This lesson is still being designed and assembled (Pre-Alpha version)

Software Package Managers

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How is software managed?

Objectives
  • Learn how to leverage package managers to handle software requirements

Software package managers

The basic premise of Nextflow is to chain lots of different tools together. By default, Nextflow expects all the commands or interpreters to be available via the shell PATH environment variable. However, many tools may have complex dependencies which can cause conflicts with other tools in the same environment. Nextflow supports several systems of package managers that isolate tools and their dependencies into separated environments preventing the majority of software conflicts.

Ideally, package management should be handled in the configuration file rather than in the nextflow script. This allows users to tailor software execution to their computing environment.

Reproducibility

Software results can vary depending depending on the execution environment. In order to ensure reproducibility and cross-plaform compatibility across many infrastructures, the use of container technology is recommended. In the absence of a suitable container technology, conda is recommended to manage software installation and dependancies. Although modules and self installed tools are supported, they do not enable ease of use across platforms, creating a potentially high barrier to using the workflow.

Environment Modules

Environment Modules is a package manager that loads tools via the module load <package> command. Modules are supported in Nextflow via the module directive in the process scope.

process blast {

    module 'ncbi-blast/2.9.0'

    """
    blast -version
    """
}

The module directive can also be assigned in a config file:

process {

    // available to all processes
    module = 'cluster-utils/1.2.3'

    // Override the module directive above for a specific process
    withName: blast {
        module = 'ncbi-blast/2.9.0:gnu-parallel/3.5'
    }
}

Multiple packages can be loaded at the same time by separating the package names with a colon (:).

Note that environment modules are often centrally managed (e.g. by cluster administrators) which may limit tools available to the user.

Conda

Conda is another package, dependency and environment manager. Of particular interest is the Bioconda channel which specialises in bioinformatic software. Support for Conda is provided via the conda directive in the process scope.

A user will often create an environment for themselves to use tools. e.g.

# Create environment via commands
$ conda create -n blast_env blast=2.9.0

# Or create environment via yaml files
$ cat environment.yml
name: blast_env
channels:
  - conda-forge
  - bioconda
  - defaults

dependencies:
  - blast=2.9.0
$ conda env create -f environment.yml

Both methods are supported in Nextflow:

process blastp {

    conda 'blast=2.9.0'

    """
    blastp -version
    """
}

process blastn {

    conda '/path/to/environment.yml'

    """
    blastn -version
    """
}

This will create the environment in the conda.cacheDir directory (the default location is the conda folder in the working directory).

The use of existing environments is also supported by providing the full path to the environment.

process blastp {

    conda '/path/to/existing/conda/env'

    """
    blastp -version
    """
}

The conda directive can also be used in the config file.

process {

    // available to all processes
    conda = 'gnu-parallel=3.5'

    // Override the conda directive above for a specific process.
    withName: blastn {
        conda = 'blast=2.9.0 gnu-parallel=3.5'
    }
}

Docker

Docker is a container platform that provides a standardised packaging format known as container images. A container image is a unit of software that packages up code and all its dependencies so the application runs the same regardless of the underlying infrastructure. The Docker Engine needs to be installed to run Docker container images on your computer infrastructure.

Typically, images are run as containers in which your commands can be executed.

# Start a container based on the image `docker run <image>`
# Run a command `fastqc --version`
# Remove the container on completion `--rm`
$ docker run --rm quay.io/biocontainers/fastqc:0.11.9--0 fastqc --version
FastQC v0.11.9

Container images are built according to recipes prescribed in a Dockerfile. A base image is used as starting point, which could be an operating system image, or a pre-defined image with other tools preinstalled (e.g. miniconda as seen below). Additional instructions then add commands to run, set environment variables, entry points, and other metadata.

An example Dockerfile:

# Select the miniconda image as the base
# https://hub.docker.com/r/continuumio/miniconda3/dockerfile
FROM continuumio/miniconda3:4.8.2

# Select the shell to use.
SHELL ["/bin/bash", "-c"]

# Add metadata to the container using labels.
LABEL description="Spade (Search for Patterned DNA Elements) container" \
      author="Mahesh Binzer-Panchal" \
      version="1.0.0"

# APT (Advanced Packaging Tool) is package manager for certain linux distributions
# It can be used to update and install software dependencies
RUN apt-get update --fix-missing && \
    apt-get install -y procps ghostscript

# Conda is another package manager that simplifies package installation
# This is a useful option for installing bioinformatic packages
RUN conda update -n base conda && \
    conda install -c conda-forge -c bioconda \
	python=3.6 mafft=7.455 blast=2.9.0 openssl=1.1.1e && \
    conda clean --all -f -y

# Some tools are not available via package managers.
# These must then be manually installed.
WORKDIR /opt
RUN git clone --depth 1 https://github.com/yachielab/SPADE && \
    cd SPADE && chmod u+x *.py && \
    pip install matplotlib==2.2.3 && \
    pip install seaborn==0.8.1 && \
    pip install weblogo==3.6.0 && \
    pip install biopython==1.76

# Environment variables can be set to provide settings for new tools.
ENV PATH="/opt/SPADE:${PATH}"

# A default command can be provided when a container is started (entry point).
CMD [ "SPADE.py" ]

When a container image is built, it is stored in a registry, either locally or online. Nextflow is able to retrieve these container images when provided with a path to the image and version (tag) ('docker-repository/image-name:tag') given by the container directive in the process scope. Images should preferably be stored in an online repository to enable access for others.

Docker Registries

Here are some useful Docker registries:

Github also supports hosting Docker images using Github packages.

An additional docker scope is provided by Nextflow which allows you to supply extra parameters to Docker. In order to use a container image with Docker, it must be enabled.

docker {
    enabled = true
}
process {

    // available to all processes
    container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_1'

    // Override the container directive above for a specific process.
    withName: blastn {
        container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
    }
}

Default user

When running a container using Docker, the default docker user is used to run the process. This is the same user with which the image was built (default: root). For life science projects it is rare that tools need to be run with superuser privileges, and often one wants to run tools using the current user. It is helpful to add your username and group to the runOptions in the docker configuration scope like so:

docker {
    enabled = true
    // Uses `id` to get the current user id and group id of the user
    runOptions='-u "$( id -u ):$( id -g )"'
}

Docker images run using other container platforms use the settings prescribed by the other container platform. For example, Singularity will use the current user as the default user.

References

Singularity

Singularity is another container platform, but is root-less and daemon-less, which means it runs as a regular user. This platform is often used in compute infrastructures where escalated privileges are not-desirable. Singularity can create and run containers from both Singularity and Docker images.

A typical command-line usage of a singularity container would be like so:

# Start a container based on the image `singularity exec <image>`
# Run a command `fastqc --version`
# The image is prefixed with docker:// to denote it's a docker image
$ singularity exec docker://quay.io/biocontainers/fastqc:0.11.9--0 fastqc --version
FastQC v0.11.9

Although images definition files can be written for Singularity, writing images using Docker allows greater portability of the image. However, see singularity build --help for image definition syntax.

An additional singularity scope is provided by Nextflow which allows you to supply extra parameters to Singularity. In order to use a container image with Singularity, it must be enabled.

singularity {
    enabled = true
}
process {

    // available to all processes
    container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_1'

    // Override the container directive above for a specific process.
    withName: blastn {
        container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
    }
}

Other container platforms

Nextflow also supports other container platforms such as Podman and Shifter, which can be used in the same transparent manner as Docker and Singularity.

Key Points

  • Commands are expected to be available from the shell PATH.

  • Environment modules can provide centrally managed software environments.

  • Conda can provide user managed software environments

  • Container platforms can provide self-contained software environments, and are recommended for reproducibility.