This lesson is still being designed and assembled (Pre-Alpha version)

Nextflow for Reproducible Scientific Analysis

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Introduction to Workflows

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is a workflow?

  • Why should I use a workflow?

  • What are the advantages of Nextflow?

Objectives
  • Describe a workflow.

  • Explain why workflows are useful.

  • Ascertain whether Nextflow meets your needs.

Workflows

Key Points

  • A workflow is a series of tasks to process data.

  • A workflow framework is a set of tools that allow you to compose tasks, connect them together, and process data.

  • There are many workflow frameworks available.

  • Nextflow is reproducible, portable, simple, extensible, well designed, open source, and supported.


The Nextflow Language

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • What language is nextflow written in?

  • How do I write a nextflow script?

  • How can I run a nextflow script?

Objectives
  • Describe the Groovy syntax.

  • Write a Nextflow workflow.

  • Run a Nextflow workflow.

Nextflow is a Domain Specific Language (DSL) designed to ease the writing of computational pipelines. It is an extension of the computer language Groovy, which is a superset of the computer language Java. As such, Nextflow can execute any piece of Groovy code or use any library for the JVM platform. Full documentation can be found here

This section describes how to write and run a nextflow workflow.

The Groovy language

What’s your scripting language experience?

  • How many of you use a scripting language on a regular basis?
    • Bash
    • Python
    • Perl
    • R
    • Groovy

Before writing a workflow, here are some fundamental concepts of the Groovy language.

Groovy is very syntax-rich and supports many more operations. A full description of Groovy semantics can be found in the Groovy Documentation.

Can I do X in Groovy / Nextflow too?

  • Are there other data and control structures that you commonly use in your analyses?
  • What kind of things would you like to do in your computational pipeline?

Writing a workflow

To ease the writing of computational pipelines Nextflow introduces two high-level data structures; channels, and processes (See Basic concepts).

A channel is a data-flow object that passes data asynchronously from one process to another. Channels provide methods for reading in data from various sources. Nextflow was developed with a strong emphasis on supporting bioinformatics, and as such includes methods for supporting common file formats like fasta and fastq. Channels send data in a first in, first out manner (FIFO), however data may arrive at the next channel in a different order (asynchrony) due to process execution time, or manipulation of channel values by channel operators.

A process is a task that executes a user script. The user script can be written in any interpreted language, although the default is bash. Each task defined by a process is executed independently, and in isolation, and so input must be communicated using channels.

Example (example.nf):

#! /usr/bin/env nextflow

number_ch = Channel.of(1,2,3,4)

process Sequence {

    input:
    val num from number_ch

    output:
    stdout into out_ch

    script:
    """
    echo $num
    """

}

out_ch.view()

Write a nextflow script

Create your own Nextflow script containing the following:

  • A directive to use the Nextflow interpreter.
  • A channel containing the words “This”, “is”, “my”, “Nextflow”, “script”.
  • A statement to see the contents of the above channel using the view method.
  • A process that prints each channel value using the shell command echo.

Solution

#! /usr/bin/env nextflow

word_ch = Channel.of("This","is","my","nextflow","script")
word_ch.view()

process Display_Words {

    input:
    val word from word_ch

    script:
    """
    echo $word
    """

}

Running a workflow.

A Nextflow workflow is executed using the nextflow run <script.nf> command. Each task is executed locally (on your computer) by default, and expects all the commands in your process scripts to be available on the command line. While local execution is suitable for small scale data processing, Nextflow integrates support for several third-party softwares enabling large scale data processing through various package management tools, job schedulers, and distributed compute infrastructure tools, covered later on.

$ nextflow run example.nf
N E X T F L O W  ~  version 20.01.0
Launching `example.nf` [marvelous_ride] - revision: 614fc2b804
executor >  local (4)
[3e/7b764f] process > Sequence [100%] 4 of 4 ✔
0
1
2
3

Nextflow is also able to run workflows from online version control repositories. If a script is not locally available, Nextflow will attempt to connect to a GitHub repository. The repository and other settings can be configured as described in the Pipeline Sharing documentation. Configuration is discussed in more detail in the Configuration section.

Parameters to both Nextflow and the pipeline script can also be passed on the command line (Use nextflow help run to see all all the available Nextflow options).

# nextflow run -c <config> <workflow_script> --<workflow_parameter> <value>
nextflow run -c nxf.conf my_workflow.nf --welcome_message hello

Parameters to the Nextflow workflow engine are prefixed with a single dash -, while parameters used in the workflow script are prefixed with a double dash --. Additional information on parameter passing is provided later.

Run your own workflows

  • Run your myscript.nf script

Solution

$ nextflow run myscript.nf
  • Modify and run myscript.nf to display the output of the Display_Words process.

Solution

#! /usr/bin/env nextflow

word_ch = Channel.from("This","is","my","nextflow","script")
word_ch.view()

process Display_Words {

    input:
    val word from word_ch

    output:
    stdout into out_ch

    script:
    """
    echo $word
    """

}

out_ch.view()
$ nextflow run myscript.nf
  • Run the hello script from https://github.com/nextflow-io/hello

Solution

$ nextflow run nextflow-io/hello

or

$ nextflow run https://github.com/nextflow-io/hello

References

Key Points

  • Nextflow is written in the Groovy computer language.

  • Channels and Processes are the fundamental data structures of Nextflow.

  • A workflow script is run using nextflow run <script.nf>.


Processes

Overview

Teaching: 30 min
Exercises: 20 min
Questions
  • How do I include a software tool into a Nextflow workflow?

  • How do I pass parameters to the tool?

  • How do I access tool results?

Objectives
  • Understand the process data structure

  • Detail how and when a process runs from Nextflow

  • Construct and run a complete process definition

Processes

Processes are the Nextflow data structure used to run user scripts.

The process data structure

A process has five parts:

process < name > {

   [ directives ]

   input:
    < process inputs >

   output:
    < process outputs >

   when:
    < condition >

   [script|shell|exec]:
   < user script to be executed >

}

The user script is the only required part of the process that needs to be defined.

Isolated run folder

Each instance of a process is staged in it’s own folder.

Example script:

x = Channel.of( 'a', 'b', 'c')

process echo_string {

    input:
    val x

    script:
    """
    echo "string-$x"
    """

}

After running the example script, a work folder is created where process tasks are cached.

$ ls -a -R work/
work/:
.  ..  01  40  b6

work/01:
.  ..  8f8eea1b232f6a6218c0b4998fcac2

work/01/8f8eea1b232f6a6218c0b4998fcac2:
.  ..  .command.begin  .command.err  .command.log  .command.out  .command.run  .command.sh  .exitcode

work/40:
.  ..  e920cb5e8d1a63b338f075444ba38c

work/40/e920cb5e8d1a63b338f075444ba38c:
.  ..  .command.begin  .command.err  .command.log  .command.out  .command.run  .command.sh  .exitcode

work/b6:
.  ..  c9801458f3e4924a57eac3e9154105

work/b6/c9801458f3e4924a57eac3e9154105:
.  ..  .command.begin  .command.err  .command.log  .command.out  .command.run  .command.sh  .exitcode

The hidden file .command.sh contains the interpolated script to execute.

cat work/b6/c9801458f3e4924a57eac3e9154105/.command.sh
#!/bin/bash -ue
echo "string-b"

The hidden file .command.run contains code that provides the run-time environment in which the script is run. Running this file, can be used to help debug processes during the workflow development stage.

Process input

Since tasks are isolated from each other, input is passed using channels.

input:
  <input qualifier> <input name> [from <source channel>] [attributes]

The input qualifier declares the type of data received.

If the <input name> matches a name of a channel, it will take input directly from that. In general <input name> is treated as variable accessible in the process scope. <input name> can also be a string describing a file glob (pathname pattern expansion) containing the characters ? or *, which can be useful to control how an input file is named.

process stage {

    input:
    path 'dir/*' from log_ch.collect()
    path csv_files, stageAs: 'data??.csv' from csv_ch.collect()

    script:
    """
    ls *.csv
    process_files.py $csv_files
    ls dir
    """
}

A table of how file glob inputs are interpreted:

Pattern Input file cardinality Staged as  
* 1 or many Original filename  
file?.ext 1 file1.ext  
file?.ext many file1.ext, file2.ext, file3.ext  
file??.ext 1 file01.ext  
file??.ext many file01.ext, file02.ext, file03.ext  
file*.ext 1 file.ext  
file*.ext many file1.ext,file2.ext,file3.ext  
dir/* 1 or many Original filename in a directory dir  
dir??/* many Each file staged in its own dir dir01, dir02  

Process output

Output should generally be coded to write to the current folder where it remains isolated. Using the output declaration, one can specify which files should be passed on to further processes, and/or published as results using a directive.

output:
  <output qualifier> <output name> [into <target channel>[,<target channel2,...]] [attributes]

The output qualifier declares the type of data received.

The <output name> can be a literal value, a file glob pattern, a variable in the process scope, or an input variable. A process task will always emit just one value into the output channel. When a file glob pattern is used, a List of files is emitted rather than a single file object. Input files are by default not included in the list of matched files.

process outfiles {

    output:
    path "somefile.txt" into onefile_ch
    path "lots*.file" into manyfiles_ch

    script:
    """
    touch somefile.txt lots{1..4}.file
    """
}

onefile_ch.view()
manyfiles_ch.view()
N E X T F L O W  ~  version 20.01.0
Launching `view_outputfiles.nf` [stoic_blackwell] - revision: 07b06e56ad
executor >  local (1)
[78/ea9da5] process > outfiles [100%] 1 of 1 ✔
<pwd>/<work_cachedir>/somefile.txt
[<pwd>/<work_cachedir>/lots1.file, <pwd>/<work_cachedir>/lots2.file, <pwd>/<work_cachedir>/lots3.file, <pwd>/<work_cachedir>/lots4.file]

where the current directory has been replaced with <pwd>, and the work directory has been replaced with <work_cachedir>.

Conditional processing

A process will only execute when it receives a complete input declaration, i.e. has a data value for each declared input. We can also limit a process execution further to when an input has a certain property, by using the when process block. This takes any expression that returns a true or false value. If we want to execute a process based on another process having no output, we can create an input when the channel is empty, and use the when block to test for that specific input.

An example of checking a property of the input:

Channel.of('Xa','Xb').into {proc_a_ch; proc_b_ch}

process property_a {

        echo true

        input:
        val str from proc_a_ch

        when:
        str =~ /a/

        script:
        """
        echo Found A!
        """
}

process property_b {

        echo true

        input:
        val str from proc_b_ch

        when:
        str =~ /b/

        script:
        """
        echo Found B!
        """

}

An example of execution from no channel input:

Channel.empty().set { step_a_ch }

process step_a {

        echo true

        input:
        val str from step_a_ch

        output:
        path 'log.txt' into step_b_ch

        script:
        """
        echo Found A! | tee log.txt
        """
}

process step_b_if_not_a {

        echo true

        input:
        val str from step_b_ch.ifEmpty('EMPTY')

        when:
        str == 'EMPTY'

        script:
        """
        echo Did not find A. Executing B!
        """
}

The script

Most commonly, the user script is written as a multi-line string. The default script language is Bash, but can be any interpreted language, such as Rscript, Python, Perl, and so on, as long as an interpreter directive is provided at the start of the script (this is different from the process directives described next). The only limitation to the commands in the script is that they must be available on the target execution system.

Example script using R:

process my_R_script {

    input:
    path 'data.csv' from analysis_ch

    script:
    """
    #! /usr/bin/env Rscript

    data <- read.csv("data.csv")
    """
}

One caveat of using Bash as the scripting language is that both Bash and Nextflow use the same syntax for variables, and so care must be taken if you want to evaluate a variable in Nextflow context (by the nextflow interpreter) or in Bash context (by the bash interpreter).

process foo {

    script:
    """
    # Escape the \$ sign to evaluate a variable in Bash context.
    echo "\$PATH is not the same as $PATH"
    """
}

One can also use the shell block definition with single quote multi-line strings to do the same thing.

process foo {

    shell:
    '''
    # Nextflow variables are now referenced with !{var}.
    echo "$PATH is not the same as !{PATH}"
    '''
}

The script block can be thought of as a function that returns the multi-line string as the result. As such, a script block allows the inclusion of additional code statements including control structures such as if statements to dictate which code snippet should be executed.

process my_task {

    input:
    tuple val(sample), path(reads) from my_read_ch

    script:
    if ( sample ~ /control/ ) {
        options = "Options for control"
        """
        echo "Do this with control $options"
        """
    } else {
        options = "Options for cases"
        """
        echo "Do that with cases -I -X $options"
        """
    }
}

Scripts do not need to be included as multi-line strings into a Nextflow workflow process. They can be written in a separate file that can be used as a stand-alone script or as template script. Stand-alone scripts should be stored in the bin folder within the same directory as the workflow, and can be executed as though they were a normal script available from the PATH environment variable. A template script should be stored in the templates folder within the same directory as the workflow. Template scripts interpret variables starting with $ in Nextflow context, and must be called using the template function. Variables can either be assigned to from the input declaration or from the script block before the template is called.

process task_A {

    script:
    """
    # Run my_script.pl from bin/
    my_script.pl
    """
}

process task_B {

    input:
    val var from var_ch

    script:
    template vars.sh
}

where vars.sh looks like:

#! /usr/bin/env bash

echo "This is my $var"

Directives

Directives describe the execution behaviour of the process, from the compute resources to reserve, to where it should put the results. There are lots of directives described in the documentation, which allow you to customise the running of your workflow to your infrastructure. The directives should appear before the input, output, when, and script blocks, or alternatively can be provided via configuration files (recommended for certain directives).

Here is a table of some useful directives.

Directive Description
publishDir Defines where files from the output block should be written. Multiple publishDir directives can be provided to a process, along with the pattern parameter to write output to separate paths.
label Provides the process with a label that can be used for configuring a group of processes.
executor Defines the system on which the processes are executed. This could be local execution, submission to a job scheduler, or to other services.
cpus Assigns the number of cores to reserve. This can be accessed within the script block using ${task.cpus}.
time Assigns the maximum length of time the task should run for.
memory Assigns the amount of memory to reserve.
queue Defines the queue to use for certain grid based executors
clusterOptions Defines additional options necessary for your grid bases executors.
scratch Defines whether a process should be run on a temporary folder local to the execution node (Highly recommended when using a grid executor!). Only output files are copied back across to the working dir (Saving valuable disk space, but losing potentially helpful files from unsuccessful tasks).
module Defines which software modules to use.
conda Defines which conda software packages to use.
container Defines which container images of software to use. Use containerOptions to provide process specific additional configuration.
errorStrategy Describes how Nextflow should behave when a process terminates with an error.

Directives can also be dynamically defined using a closure.

process foo {

    memory { 2.GB * task.attempt }
    time { 1.hour * task.attempt }

    errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
    maxRetries 3

    script:
    """
    <your job here>
    """

}

A special variable task is accessible to the process block, which provides the runtime values of the task directives and more.

process qualimap {

    input:
    tuple val(sample), path(alignment) from aln_ch

    script:
    bamfile = alignment.find { it =~ /\.bam$/ }
    mem = task.memory ? task.memory.toGiga() : task.cpus * 6
    """
    export JAVA_OPTS="-Xms32m -Xmx${mem}G -Djava.io.tmpdir=${task.scratch}"
    qualimap bamqc -bam $bamfile -outdir ${bamfile.baseName} -outformat 'HTML' -nt ${task.cpus}
    """
}

The variables available through the task configuration are partially documented in the second table in Tracing Documentation pages.

Exercises

  • Order the blocks
  • Stage a parameter file
  • Capture output from one process and stage in another.
    • Stage a folder
    • Stage some files
  • When does a process task execute.
    • Incomplete header

Key Points

  • A process has five parts: directives, input, output, when, and the script.

  • Each process task runs from it’s own directory.

  • A process only executes when there is a complete process input declaration.

  • Selected process output can be saved to a folder using publishDir.

  • Directives control process execution behaviour.


Channels

Overview

Teaching: 30 min
Exercises: 20 min
Questions
  • How do I read data into the workflow?

  • How do I pass data around the workflow?

Objectives
  • Learn how to read data into the workflow.

  • Learn how data is passed between processes.

  • Learn how to use channel operations to wrangle data into the required input form.

What is a Channel?

A Channel is a data structure designed to efficiently pass data from one process to another. The primary property of channels is they are asynchronous. As soon as a process completes, the results are put in the next channel for processing, without waiting for other processes. This allows a following process to start sooner.

Nextflow utilises two types of Channel, queue and value channels. Queue type channels consume data in a first in, first out manner to create process input declarations. The data in value type channels however can be reused when constructing process input declarations. Since a process must perform operations on input channels to make an input declaration and spawn a task, each process needs it’s own input channel. This is achieved using the into operator on a channel to create two or more channels that can be used by different processes.

Channel.of(1,2,3,4).into { sq_ch; db_ch }

process square_it {

    input:
    val x from sq_ch

    shell:
    """
    echo $x*$x | bc -l
    """
}

process double_it {

    input:
    val x from sq_ch

    shell:
    """
    echo 2*$x | bc -l
    """
}

Reading data into a workflow

The primary method of reading data into a Nextflow workflow is to use Channel factories; methods that produce channels.

There are several channel factories available:

Information on these and other channel factories can be found in the Nextflow Channel factory documentation

Channel operators

Channels pass data from one process to another using the into and from keywords, which put data in, and take data out of the channels. The into keyword creates a queue type channel which is named by the creating process and is available from the global scope to another process.

process WriteHello {

    output:
    file "myfile.txt" into file_ch

    script:
    """
    <commands>
    """
}

process AddWorld {

    input:
    file 'myfile.txt' from file_ch

    script:
    """
    <commands>
    """
 }

Channel operators allow you to manipulate data within channels. Some common examples are:

Many more channel operators are described in the Nextflow Channel Operator Documentation.

Multiple input channels

It is important to understand how multiple input channels are processed. When two or more channels are declared as process inputs, the process waits until it receives an input value from all the channels declared as input.

Two or more queue type channels.

process foo {

    echo true

    input:
    val x from Channel.of(1,2)
    val y from Channel.of('a','b','c')

    script:
    """
    echo $x and $y
    """
}

In this case the process foo will only run two times since there are only two inputs in the first channel. Channel values are consumed, and so there is nothing left to pair with 'c', which is discarded.

In the example above, it should be noted that while the process will execute on the pairings 1 and a, and 2 and b, that for more complex workflows, the queues are asynchronous meaning there’s no guarantee of having the pairing 1 and a, and 2 and b. The emission of 'c' may happen first resulting in a 1 and c pairing. If certain files must be processed together, use one of the queue combining operators such as join or groupBy to generate the correct pairing before being passed as input.

Value channels with queue channels.

process foo {

    echo true

    input:
    val x from Channel.value(1)
    val y from Channel.of('a','b','c')

    script:
    """
    echo $x and $y
    """
}

In this example, the process foo runs three times since the input data from the value type channel can be reused to make a complete input declaration. This produces the pairings 1 and a, 1 and b, and 1 and c.

Exercises

  • mix operator
  • join operator
  • each -> cross

Key Points

  • Channels pass data into and out of processes.

  • There are two types of channel, queue and value channels.

  • Each channel must have it’s own input channel.

  • Channels can be manipulated using channel operators.


Workflow Configuration

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I configure Nextflow for my infrastructure?

Objectives
  • Learn how to separate workflow logic from execution.

  • Learn how to create workflow profiles.

  • Learn how configuration nesting works.

Configuration

Configuration files are a collection of name and value pairs that tell Nextflow how to behave. They provide a way to separate workflow logic from the execution environment, allowing workflow portability and a cleaner workflow script.

When a Nextflow script is launched, it looks for configuration settings in three places: nextflow.config in the current directory, nextflow.config in the script directory (if not the same as the current directory), and finally config in $HOME/.nextflow. Additional configuration can be provided with the -c <config> parameter. When two or more of these configuration settings exist, they are merged where -c <config> overrides settings in $PWD/nextflow.config, which overrides <script_dir>/nextflow.config, which overrides $HOME/.nextflow/config. Lastly, command line configuration overrides configuration provided from files.

Configuration scopes

Configuration can be organised into scopes using the curly bracket notation or the dot notation. Configuration can also be included from another file, which helps to include repeated configuration settings in different scopes. Relative paths are resolved against the actual path of the including config file, which helps packaging configuration.

executor {
    name = 'local'
    // maximum cpus (applies to 'local' executor only)
    cpus = 4
    memory = 32.GB
}
// How many cpus a process should request.
process.cpus = 1

// Include configuration from foo
includeConfig 'path/foo.config'

Certain scopes are reserved to have special meaning. Only some are described here.

Scope params

The params scope allows one to define parameters accessible to the workflow script.

params {
    str = "Hello"
    fasta = '/path/to/fasta'
}

Variables defined in the params scope are accessible from anywhere in the script, but it is better practice to provide them via an input declaration after having done appropriate checks on the input.

process echo {

    echo true

    script:
    """
    echo ${params.str}
    """
}

Channel.fromPath(params.fasta, checkIfexists: true).set { fa_ch }

process index_fasta {

    input:
    path fasta from fa_ch

    script:
    """
    samtools faidx $fasta
    """
}

As described above, the configured value can be overriden with a command line parameter, or using another configuration file provided with -c.

nextflow run script.nf --str "Hello $USER"

Scope env

The env scope defines variables to be exported in the execution environment.

env {
    PATH = "/my/new/tool:$PATH"
    TOOL_LIB = "/my/new/tool/libs"
}

Scope process

The process scope defines any property described in the Process directives documentation.

process {
    cpus = 1
    time = '1h'
    scratch = true
}

The selectors withName: <process_name> and withLabel: <label> can be used to provide configurations to specific processes.

process {
    withName: 'index_fasta' {
        cpus = 1
    }
    withLabel: 'bigMem' {
        memory = 256.GB
    }
}

Selector expressions can be used to group process selections or negate them.

Scope manifest

The manifest scope provides metadata for your workflow.

manifest {
    homePage = 'http://foo.com'
    description = 'Pipeline does this and that'
    mainScript = 'foo.nf'
    version = '1.0.0'
}

Configuration profiles

Configuration profiles define predefined configuration settings to be used with the workflow, enabling workflow portability. Profiles can include how to manage software, executor settings, or computer or institute specific settings, and multiple profiles can be used together to provide flexibility of use.

profiles {

	/*
	<profile_name> {
		<configuration scope1>
		<configuration scope2>
		...
	}
	*/

	// default profile
	standard {
	}

	hal_9000 {
		process {
			cpus = 1000
		}
	}

	laptop {
		process {
			cpus = 1
		}
	}
}

Profiles are used directly on the command line:

nextflow run -profile <profile1>[,<profile2>,...]` <nextflow_script>

Executor configurations

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Software Package Managers

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How is software managed?

Objectives
  • Learn how to leverage package managers to handle software requirements

Software package managers

The basic premise of Nextflow is to chain lots of different tools together. By default, Nextflow expects all the commands or interpreters to be available via the shell PATH environment variable. However, many tools may have complex dependencies which can cause conflicts with other tools in the same environment. Nextflow supports several systems of package managers that isolate tools and their dependencies into separated environments preventing the majority of software conflicts.

Ideally, package management should be handled in the configuration file rather than in the nextflow script. This allows users to tailor software execution to their computing environment.

Reproducibility

Software results can vary depending depending on the execution environment. In order to ensure reproducibility and cross-plaform compatibility across many infrastructures, the use of container technology is recommended. In the absence of a suitable container technology, conda is recommended to manage software installation and dependancies. Although modules and self installed tools are supported, they do not enable ease of use across platforms, creating a potentially high barrier to using the workflow.

Environment Modules

Environment Modules is a package manager that loads tools via the module load <package> command. Modules are supported in Nextflow via the module directive in the process scope.

process blast {

    module 'ncbi-blast/2.9.0'

    """
    blast -version
    """
}

The module directive can also be assigned in a config file:

process {

    // available to all processes
    module = 'cluster-utils/1.2.3'

    // Override the module directive above for a specific process
    withName: blast {
        module = 'ncbi-blast/2.9.0:gnu-parallel/3.5'
    }
}

Multiple packages can be loaded at the same time by separating the package names with a colon (:).

Note that environment modules are often centrally managed (e.g. by cluster administrators) which may limit tools available to the user.

Conda

Conda is another package, dependency and environment manager. Of particular interest is the Bioconda channel which specialises in bioinformatic software. Support for Conda is provided via the conda directive in the process scope.

A user will often create an environment for themselves to use tools. e.g.

# Create environment via commands
$ conda create -n blast_env blast=2.9.0

# Or create environment via yaml files
$ cat environment.yml
name: blast_env
channels:
  - conda-forge
  - bioconda
  - defaults

dependencies:
  - blast=2.9.0
$ conda env create -f environment.yml

Both methods are supported in Nextflow:

process blastp {

    conda 'blast=2.9.0'

    """
    blastp -version
    """
}

process blastn {

    conda '/path/to/environment.yml'

    """
    blastn -version
    """
}

This will create the environment in the conda.cacheDir directory (the default location is the conda folder in the working directory).

The use of existing environments is also supported by providing the full path to the environment.

process blastp {

    conda '/path/to/existing/conda/env'

    """
    blastp -version
    """
}

The conda directive can also be used in the config file.

process {

    // available to all processes
    conda = 'gnu-parallel=3.5'

    // Override the conda directive above for a specific process.
    withName: blastn {
        conda = 'blast=2.9.0 gnu-parallel=3.5'
    }
}

Docker

Docker is a container platform that provides a standardised packaging format known as container images. A container image is a unit of software that packages up code and all its dependencies so the application runs the same regardless of the underlying infrastructure. The Docker Engine needs to be installed to run Docker container images on your computer infrastructure.

Typically, images are run as containers in which your commands can be executed.

# Start a container based on the image `docker run <image>`
# Run a command `fastqc --version`
# Remove the container on completion `--rm`
$ docker run --rm quay.io/biocontainers/fastqc:0.11.9--0 fastqc --version
FastQC v0.11.9

Container images are built according to recipes prescribed in a Dockerfile. A base image is used as starting point, which could be an operating system image, or a pre-defined image with other tools preinstalled (e.g. miniconda as seen below). Additional instructions then add commands to run, set environment variables, entry points, and other metadata.

An example Dockerfile:

# Select the miniconda image as the base
# https://hub.docker.com/r/continuumio/miniconda3/dockerfile
FROM continuumio/miniconda3:4.8.2

# Select the shell to use.
SHELL ["/bin/bash", "-c"]

# Add metadata to the container using labels.
LABEL description="Spade (Search for Patterned DNA Elements) container" \
      author="Mahesh Binzer-Panchal" \
      version="1.0.0"

# APT (Advanced Packaging Tool) is package manager for certain linux distributions
# It can be used to update and install software dependencies
RUN apt-get update --fix-missing && \
    apt-get install -y procps ghostscript

# Conda is another package manager that simplifies package installation
# This is a useful option for installing bioinformatic packages
RUN conda update -n base conda && \
    conda install -c conda-forge -c bioconda \
	python=3.6 mafft=7.455 blast=2.9.0 openssl=1.1.1e && \
    conda clean --all -f -y

# Some tools are not available via package managers.
# These must then be manually installed.
WORKDIR /opt
RUN git clone --depth 1 https://github.com/yachielab/SPADE && \
    cd SPADE && chmod u+x *.py && \
    pip install matplotlib==2.2.3 && \
    pip install seaborn==0.8.1 && \
    pip install weblogo==3.6.0 && \
    pip install biopython==1.76

# Environment variables can be set to provide settings for new tools.
ENV PATH="/opt/SPADE:${PATH}"

# A default command can be provided when a container is started (entry point).
CMD [ "SPADE.py" ]

When a container image is built, it is stored in a registry, either locally or online. Nextflow is able to retrieve these container images when provided with a path to the image and version (tag) ('docker-repository/image-name:tag') given by the container directive in the process scope. Images should preferably be stored in an online repository to enable access for others.

Docker Registries

Here are some useful Docker registries:

Github also supports hosting Docker images using Github packages.

An additional docker scope is provided by Nextflow which allows you to supply extra parameters to Docker. In order to use a container image with Docker, it must be enabled.

docker {
    enabled = true
}
process {

    // available to all processes
    container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_1'

    // Override the container directive above for a specific process.
    withName: blastn {
        container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
    }
}

Default user

When running a container using Docker, the default docker user is used to run the process. This is the same user with which the image was built (default: root). For life science projects it is rare that tools need to be run with superuser privileges, and often one wants to run tools using the current user. It is helpful to add your username and group to the runOptions in the docker configuration scope like so:

docker {
    enabled = true
    // Uses `id` to get the current user id and group id of the user
    runOptions='-u "$( id -u ):$( id -g )"'
}

Docker images run using other container platforms use the settings prescribed by the other container platform. For example, Singularity will use the current user as the default user.

References

Singularity

Singularity is another container platform, but is root-less and daemon-less, which means it runs as a regular user. This platform is often used in compute infrastructures where escalated privileges are not-desirable. Singularity can create and run containers from both Singularity and Docker images.

A typical command-line usage of a singularity container would be like so:

# Start a container based on the image `singularity exec <image>`
# Run a command `fastqc --version`
# The image is prefixed with docker:// to denote it's a docker image
$ singularity exec docker://quay.io/biocontainers/fastqc:0.11.9--0 fastqc --version
FastQC v0.11.9

Although images definition files can be written for Singularity, writing images using Docker allows greater portability of the image. However, see singularity build --help for image definition syntax.

An additional singularity scope is provided by Nextflow which allows you to supply extra parameters to Singularity. In order to use a container image with Singularity, it must be enabled.

singularity {
    enabled = true
}
process {

    // available to all processes
    container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_1'

    // Override the container directive above for a specific process.
    withName: blastn {
        container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
    }
}

Other container platforms

Nextflow also supports other container platforms such as Podman and Shifter, which can be used in the same transparent manner as Docker and Singularity.

Key Points

  • Commands are expected to be available from the shell PATH.

  • Environment modules can provide centrally managed software environments.

  • Conda can provide user managed software environments

  • Container platforms can provide self-contained software environments, and are recommended for reproducibility.


The new Nextflow syntax - DSL2

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is different in DSL2?

Objectives
  • Learn about the new workflow data structure.

  • Learn how the syntax changes between the DSLs

DSL2 is the 2nd version of the Domain Specific Language.

Key Points

  • The workflow block describes the workflow.

  • Workflows can be modularised.

  • Channel forking is now implicit.


Converting your project to Nextflow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can convert my project to Nextflow?

Objectives
  • Learn how to structure an analysis

  • Learn how to continue the analysis

  • Learn how to test your workflow

FIXME

Key Points

  • Use an organised folder structure.

  • Start by making process templates for code.

  • Refactor once the workflow logic is in place.

  • Make a test data set.


Supplementary: Version control

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How does Nextflow integrate with version control?

Objectives
  • Demonstrate how a workflow can be used from a version control repository.

  • Demonstrate how to use version control to develop a workflow.

FIXME

Version control

## Github

Others

Key Points

  • Nextflow can run workflows from version control platforms.

  • Version control can aid workflow development.


Supplementary - Nextflow on Grid Computing Infrastructures

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Grid Computing Infrastructures

Job schedulers

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Supplementary - Nextflow on Cloud Computing Infrastructures

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Grid Computing Infrastructures

Amazon

## Google

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Supplementary - Nextflow on Cloud Computing Infrastructures

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Grid Computing Infrastructures

Amazon

## Google

Key Points

  • First key point. Brief Answer to questions. (FIXME)