Introduction
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
FIXME
Key Points
First key point. Brief Answer to questions. (FIXME)
Introduction to Workflows
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What is a workflow?
Why should I use a workflow?
What are the advantages of Nextflow?
Objectives
Describe a workflow.
Explain why workflows are useful.
Ascertain whether Nextflow meets your needs.
Workflows
Key Points
A workflow is a series of tasks to process data.
A workflow framework is a set of tools that allow you to compose tasks, connect them together, and process data.
There are many workflow frameworks available.
Nextflow is reproducible, portable, simple, extensible, well designed, open source, and supported.
The Nextflow Language
Overview
Teaching: 40 min
Exercises: 20 minQuestions
What language is nextflow written in?
How do I write a nextflow script?
How can I run a nextflow script?
Objectives
Describe the Groovy syntax.
Write a Nextflow workflow.
Run a Nextflow workflow.
Nextflow is a Domain Specific Language (DSL) designed to ease the writing of computational pipelines. It is an extension of the computer language Groovy, which is a superset of the computer language Java. As such, Nextflow can execute any piece of Groovy code or use any library for the JVM platform. Full documentation can be found here
This section describes how to write and run a nextflow workflow.
The Groovy language
What’s your scripting language experience?
- How many of you use a scripting language on a regular basis?
- Bash
- Python
- Perl
- R
- Groovy
Before writing a workflow, here are some fundamental concepts of the Groovy language.
- Single line comments (lines not meant to be executed) start
with
//
. Multi-line comments are nested between/*
and*/
tags.// This is a single line comment. Everything after the // is ignored. /* Comments can also span multiple lines. */
- Variables are assigned using
=
and can have any value. Variables used inside a double quoted string are prefixed with a$
to denote the variable should be interpolated. A variable is otherwise just used by it’s name.myvar = 1 // Integer myvar = -3.1499392 // Floating point number myvar = false // Boolean myvar = "Hello world" // String myvar = new java.util.Date() // Object - Abstract data structure message = "The file $file cannot be found!" // Variable used inside a string. println message // Print variable
- Lists (also known as arrays) are defined using the square bracket
[]
notation.emptylist = [] // an empty list // Lists can contain duplicates, and the values can be of any type. mixedList = [1, 1, 'string', true, null, 5 as byte] mixedList.add("new value") // adds "new value" to the end of mixedList println mixedList.size() // prints the size println mixedList[0] // prints 1
- Maps (also known as associative arrays) are defined using the
[:]
literal. They associate a unique string with a value, and are commonly referred to as key-value pairs.emptyMap = [:] // an empty map mymap = [ name : "Steve", age: 43, likes: ['walks','cooking','coding']] mymap.age = 44 // values can be accessed using map property notation. println mymap['name'] // alternatively values can be accessed using quoted keys.
- Groovy supports common control structures such as if/else tests,
switch/case tests, for loops, and while loops.
// if / else test x = false if ( !x ) { return 1 } else { return 0 } // switch / case test switch (x) { case "found foo": result = "Got foo" // continues to following cases case Number: result = "Number" break // stops at this case default: result = "default" } // for loop (other variants exist) String message = '' for (int i = 0; i < 5; i++) { message += 'Hi ' } // while loop y = 5 while ( y-- > 0 ){ println "Are we there yet?" } println "We've arrived!"
- Closures are open, anonymous, blocks of code that can take arguments,
return a value and be assigned to a variable. A closure definition
follows the syntax
{ [closureParameters ->] statements }
. A closure is evaluated when it is called, returning the result of the last statement in the closure.// Find all elements > 1 above_one = [1, 2, 3].findAll { it -> it > 1 } // Closure in a string message = "The file ${ file.getName() } cannot be found!"
Groovy is very syntax-rich and supports many more operations. A full description of Groovy semantics can be found in the Groovy Documentation.
Can I do X in Groovy / Nextflow too?
- Are there other data and control structures that you commonly use in your analyses?
- What kind of things would you like to do in your computational pipeline?
Writing a workflow
To ease the writing of computational pipelines Nextflow introduces two high-level data structures; channels, and processes (See Basic concepts).
A channel is a data-flow object that passes data asynchronously from one process to another. Channels provide methods for reading in data from various sources. Nextflow was developed with a strong emphasis on supporting bioinformatics, and as such includes methods for supporting common file formats like fasta and fastq. Channels send data in a first in, first out manner (FIFO), however data may arrive at the next channel in a different order (asynchrony) due to process execution time, or manipulation of channel values by channel operators.
A process is a task that executes a user script. The user script can be written in any interpreted language, although the default is bash. Each task defined by a process is executed independently, and in isolation, and so input must be communicated using channels.
Example (example.nf
):
#! /usr/bin/env nextflow
number_ch = Channel.of(1,2,3,4)
process Sequence {
input:
val num from number_ch
output:
stdout into out_ch
script:
"""
echo $num
"""
}
out_ch.view()
Write a nextflow script
Create your own Nextflow script containing the following:
- A directive to use the Nextflow interpreter.
- A channel containing the words “This”, “is”, “my”, “Nextflow”, “script”.
- A statement to see the contents of the above channel using the
view
method.- A process that prints each channel value using the shell command
echo
.Solution
#! /usr/bin/env nextflow word_ch = Channel.of("This","is","my","nextflow","script") word_ch.view() process Display_Words { input: val word from word_ch script: """ echo $word """ }
Running a workflow.
A Nextflow workflow is executed using the nextflow run <script.nf>
command. Each task is executed locally (on your computer) by default,
and expects all the commands in your process scripts to be
available on the command line. While local execution is suitable for
small scale data processing, Nextflow integrates support for several
third-party softwares enabling large scale data processing through
various package management tools, job schedulers, and distributed
compute infrastructure tools, covered later on.
$ nextflow run example.nf
N E X T F L O W ~ version 20.01.0
Launching `example.nf` [marvelous_ride] - revision: 614fc2b804
executor > local (4)
[3e/7b764f] process > Sequence [100%] 4 of 4 ✔
0
1
2
3
Nextflow is also able to run workflows from online version control repositories. If a script is not locally available, Nextflow will attempt to connect to a GitHub repository. The repository and other settings can be configured as described in the Pipeline Sharing documentation. Configuration is discussed in more detail in the Configuration section.
Parameters to both Nextflow and the pipeline script can also
be passed on the command line (Use nextflow help run
to see all
all the available Nextflow options).
# nextflow run -c <config> <workflow_script> --<workflow_parameter> <value>
nextflow run -c nxf.conf my_workflow.nf --welcome_message hello
Parameters to the Nextflow workflow engine are prefixed with a single
dash -
, while parameters used in the workflow script are prefixed
with a double dash --
. Additional information on parameter passing is
provided later.
Run your own workflows
- Run your
myscript.nf
scriptSolution
$ nextflow run myscript.nf
- Modify and run
myscript.nf
to display the output of theDisplay_Words
process.Solution
#! /usr/bin/env nextflow word_ch = Channel.from("This","is","my","nextflow","script") word_ch.view() process Display_Words { input: val word from word_ch output: stdout into out_ch script: """ echo $word """ } out_ch.view()
$ nextflow run myscript.nf
- Run the hello script from https://github.com/nextflow-io/hello
Solution
$ nextflow run nextflow-io/hello
or
$ nextflow run https://github.com/nextflow-io/hello
References
- Nextflow Documentation: https://www.nextflow.io/docs/latest/index.html
- Groovy Syntax: https://groovy-lang.org/semantics.html
Key Points
Nextflow is written in the Groovy computer language.
Channels and Processes are the fundamental data structures of Nextflow.
A workflow script is run using
nextflow run <script.nf>
.
Processes
Overview
Teaching: 30 min
Exercises: 20 minQuestions
How do I include a software tool into a Nextflow workflow?
How do I pass parameters to the tool?
How do I access tool results?
Objectives
Understand the process data structure
Detail how and when a process runs from Nextflow
Construct and run a complete process definition
Processes
Processes are the Nextflow data structure used to run user scripts.
The process data structure
A process has five parts:
- Directives: Defines configuration properties
- Input: Defines the input
- Output: Defines the output
- When: Defines when the process should execute (default: true = always when input is available)
- Script: Defines the script to run.
process < name > {
[ directives ]
input:
< process inputs >
output:
< process outputs >
when:
< condition >
[script|shell|exec]:
< user script to be executed >
}
The user script is the only required part of the process that needs to be defined.
Isolated run folder
Each instance of a process is staged in it’s own folder.
Example script:
x = Channel.of( 'a', 'b', 'c')
process echo_string {
input:
val x
script:
"""
echo "string-$x"
"""
}
After running the example script, a work
folder is created where
process tasks are cached.
$ ls -a -R work/
work/:
. .. 01 40 b6
work/01:
. .. 8f8eea1b232f6a6218c0b4998fcac2
work/01/8f8eea1b232f6a6218c0b4998fcac2:
. .. .command.begin .command.err .command.log .command.out .command.run .command.sh .exitcode
work/40:
. .. e920cb5e8d1a63b338f075444ba38c
work/40/e920cb5e8d1a63b338f075444ba38c:
. .. .command.begin .command.err .command.log .command.out .command.run .command.sh .exitcode
work/b6:
. .. c9801458f3e4924a57eac3e9154105
work/b6/c9801458f3e4924a57eac3e9154105:
. .. .command.begin .command.err .command.log .command.out .command.run .command.sh .exitcode
The hidden file .command.sh
contains the interpolated script to
execute.
cat work/b6/c9801458f3e4924a57eac3e9154105/.command.sh
#!/bin/bash -ue
echo "string-b"
The hidden file .command.run
contains code that provides the run-time
environment in which the script is run. Running this file, can be used to
help debug processes during the workflow development stage.
Process input
Since tasks are isolated from each other, input is passed using channels.
input:
<input qualifier> <input name> [from <source channel>] [attributes]
The input qualifier declares the type of data received.
val
: A named value of any type to be used as input.env
: A named environment variable to be used as input.file
: A file, which is soft-linked into the staging folder so it can be accessed properly in the execution context.path
: A backwards-compatible drop-in replacement forfile
with additional functionality introduced in Nextflow 19.10.0, and is thus preferred overfile
.stdin
: Takes input from stdin.tuple
: Declares input as a group of inputs using the above qualifiers.each
: Executes a task for each input in the collection.
If the <input name>
matches a name of a channel, it will take input
directly from that. In general <input name>
is treated as variable
accessible in the process scope. <input name>
can also be a
string describing a file glob (pathname pattern expansion)
containing the characters ?
or *
, which
can be useful to control how an input file is named.
process stage {
input:
path 'dir/*' from log_ch.collect()
path csv_files, stageAs: 'data??.csv' from csv_ch.collect()
script:
"""
ls *.csv
process_files.py $csv_files
ls dir
"""
}
A table of how file glob inputs are interpreted:
Pattern | Input file cardinality | Staged as | |
---|---|---|---|
* |
1 or many | Original filename | |
file?.ext |
1 | file1.ext |
|
file?.ext |
many | file1.ext , file2.ext , file3.ext |
|
file??.ext |
1 | file01.ext |
|
file??.ext |
many | file01.ext , file02.ext , file03.ext |
|
file*.ext |
1 | file.ext |
|
file*.ext |
many | file1.ext ,file2.ext ,file3.ext |
|
dir/* |
1 or many | Original filename in a directory dir |
|
dir??/* |
many | Each file staged in its own dir dir01 , dir02 |
Process output
Output should generally be coded to write to the current folder
where it remains isolated. Using the output
declaration, one can
specify which files should be passed on to further processes,
and/or published as results using a directive.
output:
<output qualifier> <output name> [into <target channel>[,<target channel2,...]] [attributes]
The output qualifier declares the type of data received.
val
: A named value of any type to be taken as output.env
: A named environment variable to be taken as output.file
: A filename or glob (pathname pattern expansion) to be taken as output.path
: A filename or glob (pathname pattern expansion) to be taken as output (preferred overfile
, see the “input” section above).stdout
: Reads output from stdout.tuple
: Declares output as a group of outputs using the above qualifiers.
The <output name>
can be a literal value, a file glob pattern, a
variable in the process scope, or an input variable. A process task
will always emit just one value into the output channel. When a file
glob pattern is used, a List of files is emitted rather than
a single file object. Input files are by default not included
in the list of matched files.
process outfiles {
output:
path "somefile.txt" into onefile_ch
path "lots*.file" into manyfiles_ch
script:
"""
touch somefile.txt lots{1..4}.file
"""
}
onefile_ch.view()
manyfiles_ch.view()
N E X T F L O W ~ version 20.01.0
Launching `view_outputfiles.nf` [stoic_blackwell] - revision: 07b06e56ad
executor > local (1)
[78/ea9da5] process > outfiles [100%] 1 of 1 ✔
<pwd>/<work_cachedir>/somefile.txt
[<pwd>/<work_cachedir>/lots1.file, <pwd>/<work_cachedir>/lots2.file, <pwd>/<work_cachedir>/lots3.file, <pwd>/<work_cachedir>/lots4.file]
where the current directory has been replaced with <pwd>
, and the
work directory has been replaced with <work_cachedir>
.
Conditional processing
A process will only execute when it receives a complete input declaration,
i.e. has a data value for each declared input. We can also limit a
process execution further to when an input has a certain property, by
using the when
process block. This takes any expression that
returns a true or false value. If we want to execute a process
based on another process having no output, we can create an input
when the channel is empty, and use the when
block to test for that
specific input.
An example of checking a property of the input:
Channel.of('Xa','Xb').into {proc_a_ch; proc_b_ch}
process property_a {
echo true
input:
val str from proc_a_ch
when:
str =~ /a/
script:
"""
echo Found A!
"""
}
process property_b {
echo true
input:
val str from proc_b_ch
when:
str =~ /b/
script:
"""
echo Found B!
"""
}
An example of execution from no channel input:
Channel.empty().set { step_a_ch }
process step_a {
echo true
input:
val str from step_a_ch
output:
path 'log.txt' into step_b_ch
script:
"""
echo Found A! | tee log.txt
"""
}
process step_b_if_not_a {
echo true
input:
val str from step_b_ch.ifEmpty('EMPTY')
when:
str == 'EMPTY'
script:
"""
echo Did not find A. Executing B!
"""
}
The script
Most commonly, the user script is written as a multi-line string. The default script language is Bash, but can be any interpreted language, such as Rscript, Python, Perl, and so on, as long as an interpreter directive is provided at the start of the script (this is different from the process directives described next). The only limitation to the commands in the script is that they must be available on the target execution system.
Example script using R:
process my_R_script {
input:
path 'data.csv' from analysis_ch
script:
"""
#! /usr/bin/env Rscript
data <- read.csv("data.csv")
"""
}
One caveat of using Bash as the scripting language is that both Bash and Nextflow use the same syntax for variables, and so care must be taken if you want to evaluate a variable in Nextflow context (by the nextflow interpreter) or in Bash context (by the bash interpreter).
process foo {
script:
"""
# Escape the \$ sign to evaluate a variable in Bash context.
echo "\$PATH is not the same as $PATH"
"""
}
One can also use the shell
block definition with single quote
multi-line strings to do the same thing.
process foo {
shell:
'''
# Nextflow variables are now referenced with !{var}.
echo "$PATH is not the same as !{PATH}"
'''
}
The script block can be thought of as a function that returns the multi-line string as the result. As such, a script block allows the inclusion of additional code statements including control structures such as if statements to dictate which code snippet should be executed.
process my_task {
input:
tuple val(sample), path(reads) from my_read_ch
script:
if ( sample ~ /control/ ) {
options = "Options for control"
"""
echo "Do this with control $options"
"""
} else {
options = "Options for cases"
"""
echo "Do that with cases -I -X $options"
"""
}
}
Scripts do not need to be included as multi-line strings
into a Nextflow workflow process. They
can be written in a separate file that can be used as a stand-alone
script or as template
script. Stand-alone scripts should be stored
in the bin
folder within the same directory as the workflow, and
can be executed as though they were a normal script available from the
PATH
environment variable. A template script should be stored
in the templates
folder within the same directory as the workflow.
Template scripts interpret variables starting with $
in Nextflow
context, and must be called using the template
function. Variables
can either be assigned to from the input declaration or from the
script block before the template is called.
process task_A {
script:
"""
# Run my_script.pl from bin/
my_script.pl
"""
}
process task_B {
input:
val var from var_ch
script:
template vars.sh
}
where vars.sh
looks like:
#! /usr/bin/env bash
echo "This is my $var"
Directives
Directives describe the execution behaviour of the process, from the compute resources to reserve, to where it should put the results. There are lots of directives described in the documentation, which allow you to customise the running of your workflow to your infrastructure. The directives should appear before the input, output, when, and script blocks, or alternatively can be provided via configuration files (recommended for certain directives).
Here is a table of some useful directives.
Directive | Description |
---|---|
publishDir |
Defines where files from the output block should be written. Multiple publishDir directives can be provided to a process, along with the pattern parameter to write output to separate paths. |
label |
Provides the process with a label that can be used for configuring a group of processes. |
executor |
Defines the system on which the processes are executed. This could be local execution, submission to a job scheduler, or to other services. |
cpus |
Assigns the number of cores to reserve. This can be accessed within the script block using ${task.cpus} . |
time |
Assigns the maximum length of time the task should run for. |
memory |
Assigns the amount of memory to reserve. |
queue |
Defines the queue to use for certain grid based executors |
clusterOptions |
Defines additional options necessary for your grid bases executors. |
scratch |
Defines whether a process should be run on a temporary folder local to the execution node (Highly recommended when using a grid executor!). Only output files are copied back across to the working dir (Saving valuable disk space, but losing potentially helpful files from unsuccessful tasks). |
module |
Defines which software modules to use. |
conda |
Defines which conda software packages to use. |
container |
Defines which container images of software to use. Use containerOptions to provide process specific additional configuration. |
errorStrategy |
Describes how Nextflow should behave when a process terminates with an error. |
Directives can also be dynamically defined using a closure.
process foo {
memory { 2.GB * task.attempt }
time { 1.hour * task.attempt }
errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries 3
script:
"""
<your job here>
"""
}
A special variable task
is accessible to the process block, which
provides the runtime values of the task directives and more.
process qualimap {
input:
tuple val(sample), path(alignment) from aln_ch
script:
bamfile = alignment.find { it =~ /\.bam$/ }
mem = task.memory ? task.memory.toGiga() : task.cpus * 6
"""
export JAVA_OPTS="-Xms32m -Xmx${mem}G -Djava.io.tmpdir=${task.scratch}"
qualimap bamqc -bam $bamfile -outdir ${bamfile.baseName} -outformat 'HTML' -nt ${task.cpus}
"""
}
The variables available through the task configuration are partially documented in the second table in Tracing Documentation pages.
Exercises
- Order the blocks
- Stage a parameter file
- Capture output from one process and stage in another.
- Stage a folder
- Stage some files
- When does a process task execute.
- Incomplete header
Key Points
A process has five parts: directives, input, output, when, and the script.
Each process task runs from it’s own directory.
A process only executes when there is a complete process input declaration.
Selected process output can be saved to a folder using
publishDir
.Directives control process execution behaviour.
Channels
Overview
Teaching: 30 min
Exercises: 20 minQuestions
How do I read data into the workflow?
How do I pass data around the workflow?
Objectives
Learn how to read data into the workflow.
Learn how data is passed between processes.
Learn how to use channel operations to wrangle data into the required input form.
What is a Channel?
A Channel is a data structure designed to efficiently pass data from one process to another. The primary property of channels is they are asynchronous. As soon as a process completes, the results are put in the next channel for processing, without waiting for other processes. This allows a following process to start sooner.
Nextflow utilises two types of Channel, queue and value channels.
Queue type channels consume data in a first in, first out manner
to create process input declarations. The data in value type channels
however can be reused when constructing process input declarations.
Since a process must perform operations on input channels to make
an input declaration and spawn a task, each process needs it’s own input
channel. This is achieved using the into
operator on a channel to create
two or more channels that can be used by different processes.
Channel.of(1,2,3,4).into { sq_ch; db_ch }
process square_it {
input:
val x from sq_ch
shell:
"""
echo $x*$x | bc -l
"""
}
process double_it {
input:
val x from sq_ch
shell:
"""
echo 2*$x | bc -l
"""
}
Reading data into a workflow
The primary method of reading data into a Nextflow workflow is to use Channel factories; methods that produce channels.
There are several channel factories available:
Channel.of
: A queue type channel which emits any values passed as parameters to it.Channel.of(1, 2, 3, 4, 'X', 'Y') .view()
1 2 3 4 X Y
Channel.value
: A value type channel which emits a single value.val_ch = Channel.value('Reuse_me')
Channel.fromList
: A queue type channel which emits all elements of theList
passed.Channel.fromList(['a','b','c','d']) .view()
a b c d
Channel.fromPath
: A queue type channel which emits each file that matches the path provided. Shell wildcard characters can be used to provide a glob pattern to match.file_ch = Channel.fromPath('path/to/my/*.files')
Channel.fromFilePairs
: A queue type channel which emits a group of files that match the glob pattern provided, along with the common prefix of the file group.filepair_ch = Channel.fromFilePairs('path/to/my/*_{1,2}.fastq.gz') filepair_ch.view()
[SRR2356,[path/to/my/SRR2356_1.fastq.gz,path/to/my/SRR2356_2.fastq.gz]] [SRR2357,[path/to/my/SRR2357_1.fastq.gz,path/to/my/SRR2357_2.fastq.gz]] [SRR2358,[path/to/my/SRR2358_1.fastq.gz,path/to/my/SRR2358_2.fastq.gz]]
Channel.watchPath
: A queue type channel which emits any filenames matching the glob pattern provided. By default, only files created while watching the path are emitted. Modification or deletion events can also be used to trigger emissions.Channel.empty
: An empty channel which emits nothing.
Information on these and other channel factories can be found in the Nextflow Channel factory documentation
Channel operators
Channels pass data from one process to another using the into
and
from
keywords, which put data in, and take data out of the channels.
The into
keyword creates a queue type channel which is named by
the creating process and is available from the global scope to another
process.
process WriteHello {
output:
file "myfile.txt" into file_ch
script:
"""
<commands>
"""
}
process AddWorld {
input:
file 'myfile.txt' from file_ch
script:
"""
<commands>
"""
}
Channel operators allow you to manipulate data within channels. Some common examples are:
view
(queue type channel): Prints channel data values to the console standard output. This is mostly useful for debugging your own workflows, in particular when you want to clarify an unclear error message related to file input, or why a process does not spawn a task.process WrapText { input: file text_file from text_file_ch.view() script: """ <commands> """ }
into
(value type channel): Emits the data from one channel into one or more other named channels. This is commonly used to provide the same input to multiple processes.Channel.fromFilePairs('/path/to/reads_{1,2}.fastq.gz') .into { fastqc_read_ch; fastp_read_ch } process FastP { input: tuple val(sample), file(reads) from fastp_read_ch output: tuple val(sample), file("${sample}_trimmed_{1,2}.fastq.gz") into (bwa_ch, screen_ch) script: """ <commands> """ } process FastQC { input: tuple val(sample), file(reads) from fastqc_read_ch script: """ <commands> """ }
collect
(value type channel): Gather all data in the channel as a single output. A common use case, is when one process is used to summarise the output from the previous processes. When an input channel remains empty, nothing is emitted bycollect
. UsetoList
if a channel should emit an empty list when a channel is empty.process CollectLogs { input: file logfiles from logs_ch.collect() script: """ <commands> """ }
mix
(queue type channel): Combine data from other channels into this one. A common use case is when one wants to process data from different stages (e.g., pre- and post-filtering) with the same analysis process.process BAMStats { input: file bam_file from raw_bam_ch.mix(filtered_bam_ch) script: """ <commands> """ }
join
(queue type channel): Join together data based on a common attribute. A common use case is to process related files that were previously processed separately. Reading input from two queue type channels simultaneously does not guarantee correcting pairing of data due to their asynchronicity, and therefore must be combined using a common property before processing.process AnnotateBlastp { input: tuple val(sample), file(fasta_file) from blastp_fasta_files output: tuple val(sample), file("${sample}_blastp-annotation.tsv") into blastp_annotation_files script: """ <commands> """ } process AnnotateInterproscan { input: tuple val(sample), file(fasta_file) from interproscan_fasta_files output: tuple val(sample), file("${sample}_interproscan-annotation.tsv") into interproscan_annotation_files script: """ <commands> """ } process MergeAnnotations { input: tuple val(sample), file(blast_annotation), file(interproscan_annotation) from blastp_annotation_files.join(interproscan_annotation_files) script: """ <commands> """ }
Many more channel operators are described in the Nextflow Channel Operator Documentation.
Multiple input channels
It is important to understand how multiple input channels are processed. When two or more channels are declared as process inputs, the process waits until it receives an input value from all the channels declared as input.
Two or more queue type channels.
process foo {
echo true
input:
val x from Channel.of(1,2)
val y from Channel.of('a','b','c')
script:
"""
echo $x and $y
"""
}
In this case the process foo
will only run two times since there
are only two inputs in the first channel. Channel values are
consumed, and so there is nothing left to pair with 'c'
, which
is discarded.
In the example above, it should be noted that while the process will
execute on the pairings 1 and a
, and 2 and b
, that for more complex
workflows, the queues are asynchronous meaning there’s no guarantee
of having the pairing 1 and a
, and 2 and b
. The emission of
'c'
may happen first resulting in a 1 and c
pairing. If certain
files must be processed together, use one of the queue combining
operators such as join
or groupBy
to generate the correct pairing
before being passed as input.
Value channels with queue channels.
process foo {
echo true
input:
val x from Channel.value(1)
val y from Channel.of('a','b','c')
script:
"""
echo $x and $y
"""
}
In this example, the process foo
runs three times since the input data
from the value type channel can be reused to make a complete input
declaration. This produces the pairings 1 and a
, 1 and b
, and
1 and c
.
Exercises
- mix operator
- join operator
- each -> cross
Key Points
Channels pass data into and out of processes.
There are two types of channel, queue and value channels.
Each channel must have it’s own input channel.
Channels can be manipulated using channel operators.
Workflow Configuration
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How do I configure Nextflow for my infrastructure?
Objectives
Learn how to separate workflow logic from execution.
Learn how to create workflow profiles.
Learn how configuration nesting works.
Configuration
Configuration files are a collection of name and value pairs that tell Nextflow how to behave. They provide a way to separate workflow logic from the execution environment, allowing workflow portability and a cleaner workflow script.
When a Nextflow script is launched, it looks for configuration
settings in three places: nextflow.config
in the current directory,
nextflow.config
in the script directory (if not the same as the
current directory), and finally config
in $HOME/.nextflow
.
Additional configuration can be provided with the -c <config>
parameter.
When two or more of these configuration settings exist, they are merged
where -c <config>
overrides settings in $PWD/nextflow.config
, which
overrides <script_dir>/nextflow.config
, which overrides
$HOME/.nextflow/config
. Lastly, command line configuration overrides
configuration provided from files.
Configuration scopes
Configuration can be organised into scopes using the curly bracket notation or the dot notation. Configuration can also be included from another file, which helps to include repeated configuration settings in different scopes. Relative paths are resolved against the actual path of the including config file, which helps packaging configuration.
executor {
name = 'local'
// maximum cpus (applies to 'local' executor only)
cpus = 4
memory = 32.GB
}
// How many cpus a process should request.
process.cpus = 1
// Include configuration from foo
includeConfig 'path/foo.config'
Certain scopes are reserved to have special meaning. Only some are described here.
Scope params
The params
scope allows one to define parameters accessible to
the workflow script.
params {
str = "Hello"
fasta = '/path/to/fasta'
}
Variables defined in the params
scope are accessible from anywhere
in the script, but it is better practice to provide them via an input
declaration after having done appropriate checks on the input.
process echo {
echo true
script:
"""
echo ${params.str}
"""
}
Channel.fromPath(params.fasta, checkIfexists: true).set { fa_ch }
process index_fasta {
input:
path fasta from fa_ch
script:
"""
samtools faidx $fasta
"""
}
As described above, the configured value can be overriden with
a command line parameter, or using another configuration file provided
with -c
.
nextflow run script.nf --str "Hello $USER"
Scope env
The env
scope defines variables to be exported in the execution
environment.
env {
PATH = "/my/new/tool:$PATH"
TOOL_LIB = "/my/new/tool/libs"
}
Scope process
The process
scope defines any property described in the
Process directives documentation.
process {
cpus = 1
time = '1h'
scratch = true
}
The selectors withName: <process_name>
and withLabel: <label>
can be used to provide configurations to specific processes.
process {
withName: 'index_fasta' {
cpus = 1
}
withLabel: 'bigMem' {
memory = 256.GB
}
}
Selector expressions can be used to group process selections or negate them.
Scope manifest
The manifest
scope provides metadata for your workflow.
manifest {
homePage = 'http://foo.com'
description = 'Pipeline does this and that'
mainScript = 'foo.nf'
version = '1.0.0'
}
Configuration profiles
Configuration profiles define predefined configuration settings to be used with the workflow, enabling workflow portability. Profiles can include how to manage software, executor settings, or computer or institute specific settings, and multiple profiles can be used together to provide flexibility of use.
profiles {
/*
<profile_name> {
<configuration scope1>
<configuration scope2>
...
}
*/
// default profile
standard {
}
hal_9000 {
process {
cpus = 1000
}
}
laptop {
process {
cpus = 1
}
}
}
Profiles are used directly on the command line:
nextflow run -profile <profile1>[,<profile2>,...]` <nextflow_script>
Executor configurations
Key Points
First key point. Brief Answer to questions. (FIXME)
Software Package Managers
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How is software managed?
Objectives
Learn how to leverage package managers to handle software requirements
Software package managers
The basic premise of Nextflow is to chain lots of different tools
together. By default, Nextflow expects all the commands or interpreters
to be available via the shell PATH
environment variable.
However, many tools may have complex dependencies which can
cause conflicts with other tools in the same environment.
Nextflow supports several systems of package managers that
isolate tools and their dependencies into separated environments
preventing the majority of software conflicts.
Ideally, package management should be handled in the configuration file rather than in the nextflow script. This allows users to tailor software execution to their computing environment.
Reproducibility
Software results can vary depending depending on the execution environment. In order to ensure reproducibility and cross-plaform compatibility across many infrastructures, the use of container technology is recommended. In the absence of a suitable container technology, conda is recommended to manage software installation and dependancies. Although modules and self installed tools are supported, they do not enable ease of use across platforms, creating a potentially high barrier to using the workflow.
Environment Modules
Environment Modules
is a package manager that loads tools via
the module load <package>
command. Modules are supported in
Nextflow via the module
directive in the process
scope.
process blast {
module 'ncbi-blast/2.9.0'
"""
blast -version
"""
}
The module
directive can also be assigned in a config file:
process {
// available to all processes
module = 'cluster-utils/1.2.3'
// Override the module directive above for a specific process
withName: blast {
module = 'ncbi-blast/2.9.0:gnu-parallel/3.5'
}
}
Multiple packages can be loaded at the same time by separating the
package names with a colon (:
).
Note that environment modules are often centrally managed (e.g. by cluster administrators) which may limit tools available to the user.
Conda
Conda is another package, dependency and environment manager. Of particular interest is the
Bioconda channel which specialises
in bioinformatic software. Support for Conda is provided via the
conda
directive in the process
scope.
A user will often create an environment for themselves to use tools. e.g.
# Create environment via commands
$ conda create -n blast_env blast=2.9.0
# Or create environment via yaml files
$ cat environment.yml
name: blast_env
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- blast=2.9.0
$ conda env create -f environment.yml
Both methods are supported in Nextflow:
process blastp {
conda 'blast=2.9.0'
"""
blastp -version
"""
}
process blastn {
conda '/path/to/environment.yml'
"""
blastn -version
"""
}
This will create the environment in the conda.cacheDir
directory
(the default location is the conda
folder in the working directory).
The use of existing environments is also supported by providing the full path to the environment.
process blastp {
conda '/path/to/existing/conda/env'
"""
blastp -version
"""
}
The conda
directive can also be used in the config file.
process {
// available to all processes
conda = 'gnu-parallel=3.5'
// Override the conda directive above for a specific process.
withName: blastn {
conda = 'blast=2.9.0 gnu-parallel=3.5'
}
}
Docker
Docker is a container platform that provides a standardised packaging format known as container images. A container image is a unit of software that packages up code and all its dependencies so the application runs the same regardless of the underlying infrastructure. The Docker Engine needs to be installed to run Docker container images on your computer infrastructure.
Typically, images are run as containers in which your commands can be executed.
# Start a container based on the image `docker run <image>`
# Run a command `fastqc --version`
# Remove the container on completion `--rm`
$ docker run --rm quay.io/biocontainers/fastqc:0.11.9--0 fastqc --version
FastQC v0.11.9
Container images are built according to recipes prescribed in a
Dockerfile
. A base image is used as starting point, which could
be an operating system image, or a pre-defined image with other tools
preinstalled (e.g. miniconda as seen below). Additional instructions
then add commands to run,
set environment variables, entry points, and other metadata.
An example Dockerfile
:
# Select the miniconda image as the base
# https://hub.docker.com/r/continuumio/miniconda3/dockerfile
FROM continuumio/miniconda3:4.8.2
# Select the shell to use.
SHELL ["/bin/bash", "-c"]
# Add metadata to the container using labels.
LABEL description="Spade (Search for Patterned DNA Elements) container" \
author="Mahesh Binzer-Panchal" \
version="1.0.0"
# APT (Advanced Packaging Tool) is package manager for certain linux distributions
# It can be used to update and install software dependencies
RUN apt-get update --fix-missing && \
apt-get install -y procps ghostscript
# Conda is another package manager that simplifies package installation
# This is a useful option for installing bioinformatic packages
RUN conda update -n base conda && \
conda install -c conda-forge -c bioconda \
python=3.6 mafft=7.455 blast=2.9.0 openssl=1.1.1e && \
conda clean --all -f -y
# Some tools are not available via package managers.
# These must then be manually installed.
WORKDIR /opt
RUN git clone --depth 1 https://github.com/yachielab/SPADE && \
cd SPADE && chmod u+x *.py && \
pip install matplotlib==2.2.3 && \
pip install seaborn==0.8.1 && \
pip install weblogo==3.6.0 && \
pip install biopython==1.76
# Environment variables can be set to provide settings for new tools.
ENV PATH="/opt/SPADE:${PATH}"
# A default command can be provided when a container is started (entry point).
CMD [ "SPADE.py" ]
When a container image is built, it is stored in a registry, either
locally or online. Nextflow is able to retrieve these container images
when provided with a path to the image and version (tag)
('docker-repository/image-name:tag'
) given by the
container
directive in the process
scope. Images should preferably
be stored in an online repository to enable access for others.
Docker Registries
Here are some useful Docker registries:
- Docker Hub: https://hub.docker.com/
- Red Hat Quay: https://quay.io/
- Biocontainers (Bioconda images): https://biocontainers.pro/#/
Github also supports hosting Docker images using Github packages.
An additional docker
scope is provided by Nextflow
which allows you to supply extra parameters to Docker. In order
to use a container image with Docker, it must be enabled.
docker {
enabled = true
}
process {
// available to all processes
container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_1'
// Override the container directive above for a specific process.
withName: blastn {
container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
}
}
Default user
When running a container using Docker, the default docker user is used to run the process. This is the same user with which the image was built (default:
root
). For life science projects it is rare that tools need to be run with superuser privileges, and often one wants to run tools using the current user. It is helpful to add your username and group to therunOptions
in the docker configuration scope like so:docker { enabled = true // Uses `id` to get the current user id and group id of the user runOptions='-u "$( id -u ):$( id -g )"' }
Docker images run using other container platforms use the settings prescribed by the other container platform. For example, Singularity will use the current user as the default user.
References
- Nextflow Docker integration: https://www.nextflow.io/docs/latest/docker.html
- Nextflow Docker Configuration: https://www.nextflow.io/docs/latest/config.html#scope-docker
- Docker Documentation: https://docs.docker.com/
- Docker Best Practices: https://docs.docker.com/develop/dev-best-practices/
- Dockerfile reference: https://docs.docker.com/engine/reference/builder/
Singularity
Singularity is another container platform, but is root-less and daemon-less, which means it runs as a regular user. This platform is often used in compute infrastructures where escalated privileges are not-desirable. Singularity can create and run containers from both Singularity and Docker images.
A typical command-line usage of a singularity container would be like so:
# Start a container based on the image `singularity exec <image>`
# Run a command `fastqc --version`
# The image is prefixed with docker:// to denote it's a docker image
$ singularity exec docker://quay.io/biocontainers/fastqc:0.11.9--0 fastqc --version
FastQC v0.11.9
Although images definition files can be written for Singularity, writing
images using Docker allows greater portability of the image.
However, see singularity build --help
for image definition syntax.
An additional singularity
scope is provided by Nextflow
which allows you to supply extra parameters to Singularity. In order
to use a container image with Singularity, it must be enabled.
singularity {
enabled = true
}
process {
// available to all processes
container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_1'
// Override the container directive above for a specific process.
withName: blastn {
container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
}
}
Other container platforms
Nextflow also supports other container platforms such as Podman and Shifter, which can be used in the same transparent manner as Docker and Singularity.
Key Points
Commands are expected to be available from the shell PATH.
Environment modules can provide centrally managed software environments.
Conda can provide user managed software environments
Container platforms can provide self-contained software environments, and are recommended for reproducibility.
The new Nextflow syntax - DSL2
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is different in DSL2?
Objectives
Learn about the new workflow data structure.
Learn how the syntax changes between the DSLs
DSL2 is the 2nd version of the Domain Specific Language.
Key Points
The workflow block describes the workflow.
Workflows can be modularised.
Channel forking is now implicit.
Converting your project to Nextflow
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How can convert my project to Nextflow?
Objectives
Learn how to structure an analysis
Learn how to continue the analysis
Learn how to test your workflow
FIXME
Key Points
Use an organised folder structure.
Start by making process templates for code.
Refactor once the workflow logic is in place.
Make a test data set.
Supplementary: Version control
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How does Nextflow integrate with version control?
Objectives
Demonstrate how a workflow can be used from a version control repository.
Demonstrate how to use version control to develop a workflow.
FIXME
Version control
- https://swcarpentry.github.io/git-novice/
## Github
- Using
-r
to get a specific branch or revision.
Others
Key Points
Nextflow can run workflows from version control platforms.
Version control can aid workflow development.
Supplementary - Nextflow on Grid Computing Infrastructures
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
FIXME
Grid Computing Infrastructures
Job schedulers
Key Points
First key point. Brief Answer to questions. (FIXME)
Supplementary - Nextflow on Cloud Computing Infrastructures
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
FIXME
Grid Computing Infrastructures
Amazon
Key Points
First key point. Brief Answer to questions. (FIXME)
Supplementary - Nextflow on Cloud Computing Infrastructures
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
FIXME
Grid Computing Infrastructures
Amazon
Key Points
First key point. Brief Answer to questions. (FIXME)