PHA4GE: Implementation of CZ ID Workflow at AKU for Analyzing SARS-CoV-2 Genomics Data

Deprecated

Please go to this link

IDSeq pipeline is implemented using the following link:

https://github.com/czbiohub/sc2-illumina-pipeline

Main steps are:

1. Installation

Nextflow (https://www.nextflow.io/)
Pipeline software dependencies
Pipeline code

2. Running the pipeline

Main arguments used in pipeline

3. Overview of pipeline

Description of processes and functions used in the pipeline

4. Output

Summary of results

The configuration of our workstation is as follows:

Memory

93.1 GiB

Processor

Intel Xeon Gold 6240 CPU @ 2.60GHz x 72

Disk Capacity

3.0 TB

OS Name

Ubuntu 21.04

OS Type

64-bit

1. Installation:

Install Nextflow (Checking prerequisites)

Checking java version:

java -version

openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.10

Setup Nextflow:

curl -s https://get.nextflow.io | bash

The command is executed with error:

CAPSULE EXCEPTION: Error resolving dependencies. while processing attribute Allow-Snapshots: false (for stack trace, run with -Dcapsule.log=verbose) Unable to initialize nextflow environment

Next option is we have to install nextflow through conda (https://docs.conda.io/en/latest/).

conda install -c bioconda nextflow

Output:

Nextflow is installed successfully.

Update nextflow:

conda update nextflow

Launch nextflow:

nextflow run hello

N E X T F L O W ~ version 19.01.0 Pulling nextflow-io/hello ... downloaded from https://github.com/nextflow-io/hello.git Launching `nextflow-io/hello` [desperate_goldberg] - revision: ec11eb0ec7 [master] [warm up] executor > local [d9/0caedb] Submitted process > sayHello (1) [ce/bb4e73] Submitted process > sayHello (4) [58/63468f] Submitted process > sayHello (2) [b2/915799] Submitted process > sayHello (3) Bonjour world! Hola world! Ciao world! Hellow world!

Pipeline software dependencies

Docker container czbiohub/sc2-msspe pipeline:

docker pull czbiohub/sc2-msspe

Using default tag: latest latest: Pulling from czbiohub/sc2-msspe 68ced04f60ab: Pull complete 9c388eb6d33c: Pull complete 96cf53b3a9dd: Pull complete 6b5e6e9d0010: Pull complete 16f5187a04c9: Pull complete c69c31f8d73a: Pull complete c69c31f8d73a: Pull complete 69e38dd6074a: Pull complete 84d4143732bf: Pull complete 020ac22afab7: Pull complete Digest: sha256:d0e177deef2359b271445f4000cf739193e261f0b059b2a6beceb91de04fdf62 Status: Downloaded newer image for czbiohub/sc2-msspe:latest docker.io/czbiohub/sc2-msspe:latest

Installation of dependencies into a conda environment sc2-msspe:

conda env create -f environment.yaml

Output:

Make file:

make

Output:

docker build . -t czbiohub/sc2-msspe:latest Sending build context to Docker daemon 41.27MB Step 1/8 : FROM nfcore/base:1.10.2 ---> 9e316f514716 Step 2/8 : LABEL authors="Jack Kamm and Samantha Hao" description="Docker image containing all software requirements for the nf-core/msspe pipeline" ---> Using cache ---> 525fb38872e9 Step 3/8 : COPY environment.yaml / ---> Using cache ---> 0dc62ac5b33c Step 4/8 : RUN conda env create -f /environment.yaml && conda clean -a ---> Using cache ---> f2117adfcc63 Step 5/8 : ENV PATH /opt/conda/envs/sc2-msspe/bin/:$PATH ---> Using cache ---> 590cabd71879 Step 6/8 : RUN conda env export --name nf-core-msspe-1.0dev > nf-core-msspe-1.0dev.yaml ---> Using cache ---> 809b6527b82a Step 7/8 : RUN apt update && apt install -y r-base ---> Using cache ---> 45559db3313c Step 8/8 : RUN Rscript -e "install.packages(c('BiocManager', 'aplot')); BiocManager::install('ggtree')" ---> Using cache ---> b4f2218da097 Successfully built b4f2218da097 Successfully tagged czbiohub/sc2-msspe:latest

2. Running the pipeline

Install kraken2 db

git clone https://github.com/DerrickWood/kraken2.git

Output:

loning into 'kraken2'... remote: Enumerating objects: 903, done. remote: Counting objects: 100% (203/203), done. remote: Compressing objects: 100% (152/152), done. remote: Total 903 (delta 115), reused 104 (delta 51), pack-reused 700 Receiving objects: 100% (903/903), 428.83 KiB | 1.67 MiB/s, done. Resolving deltas: 100% (607/607), done.

cd kraken2/
./install_kraken2.sh kraken2

Output:

Kraken 2 installation complete.

To make things easier for you, you may want to copy/symlink the following files into a directory in your PATH:

Link to download kraken DB:

https://github.com/czbiohub/test-datasets/raw/olgabot/mssspe--kraken-coronavirus/reference/kraken_coronavirus_db_only.tar.gzgithub.com

Download testing data from GitHub:

git clone https://github.com/czbiohub/sc2-illumina-pipeline.git

Output:

Cloning into 'sc2-illumina-pipeline'... remote: Enumerating objects: 1709, done. remote: Counting objects: 100% (235/235), done. remote: Compressing objects: 100% (172/172), done. remote: Total 1709 (delta 133), reused 132 (delta 60), pack-reused 1474 Receiving objects: 100% (1709/1709), 12.06 MiB | 485.00 KiB/s, done. Resolving deltas: 100% (1058/1058), done.

Testing:

Simple test to make sure things are not broken:

nextflow run czbiohub/sc2-msspe-bioinfo -profile docker,test
nextflow run czbiohub/sc2-illumina-pipeline -profile docker,test

nextflow run czbiohub/sc2-illumina-pipeline -profile docker, --reads '/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_15Jun2021/sc2-illumina-pipeline/test_data/*_R{1,2}.fq.gz' --primers 'data/nCoV-2019.bed' --kraken2_db 'kraken2/kraken_coronavirus_db_only/' --outdir results_1

Error encountered which was then removed by using the following command:

nextflow self-update

Output

IDSeq pipeline successfully installed.

Updating pipeline:

Command: nextflow pull czbiohub/sc2-illumina-pipeline

Main arguments:

-profile: Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

--reads: Use this to specify the location of your input SARS-CoV-2 read files

--primers: Use this to specify the BED file with the enrichment/amplification primers used in the library prep. The repository comes with two sets of primers corresponding to the MSSPE (SARS-COV-2_spikePrimers.bed) and ARTIC V3 (nCoV-2019.bed) protocols.

--kraken2_db: Use this to specify the path to the folder containing the Kraken2 database. This is required to filter the reads to only SARS-CoV-2 reads.

--outdir: Specify the output directory. This is required and can be an S3 path given the system has the appropriate credentials.

3. Overview of pipeline:

Description of processes and functions used in the pipeline

prefilterHostReads

Filters host reads by mapping with minimap2 to the file given with --ref_host and keeping unmapped reads. This process is optional; enable it with the flag --prefilter_host_reads. The filtered reads will be output into the folder host-subtracted-reads. These reads are sent to filterReads.

This step may be computationally expensive. If you only need the SARS-CoV-2 reads, and don't care about reads from other pathogens, you can use the fastqs output from filterReads instead.

By default, this step is skipped for -profile artic, but enabled for -profile msspe.

quantifyERCCs

Quantify ERCC reads by mapping with minimap2 to the file given with --ercc_fasta. The individual output files are saved to the folder ercc-stats and the information is summarized in combined.stats.tsv.

filterReads

Filter for only SARS-CoV-2 reads. This is accomplished by first mapping with minimap2 to the file given with --ref and only keeping mapped reads. These reads are then run through kraken2 against the database provided by --kraken2_db. Only reads that are assigned uniquely to SARS-CoV-2 are kept. These reads are sent to trimReads.

When the option --save_sars2_filtered_reads is enabled, the fastqs of filtered reads are saved to filtered-sars2-reads. By default, this option is enabled when -profile artic, but disabled when -profile msspe.

trimReads

Trim adapter sequences and perform quality-trimming using Trim Galore. Skip this step with the flag --skip_trim_adapters. The trimmed reads will be published to the folder trimmed-reads. These will be sent to alignReads and callVariants.

alignReads

Align the reads to the reference genome provided by --ref with minimap2 and sort by position with samtools sort. The BAM files are sent to trimPrimers

trimPrimers

Trim the primers provided in --primers from the BAM using ivar trim. The primers will be soft-clipped from the ends of the reads in the BAM file. This step is necessary to properly call consensus alleles in these regions as the primer sequences can be artificially amplified. The trimmed BAMs will be published to the folder aligned-reads.

makeConsensus

Call a consensus genome using ivar consensus. See here for parameters. These genomes will be published to the folder consensus-seqs.

quast

Run QUAST on the consensus genomes. The results will be published to the folder QUAST.

callVariants

Call variant sites (SNPs) use samtools mpileup and bcftools call. This will also produce files from bcftools stats. The VCF files and stats files are published to sample-variants.

computeStats

Compute alignment and assembly statistics using samtools stats and the python script alignment_assembly_stats.py. Coverage plots are created during this process and published to coverage-plots.

combinedVariants

Creates a joint VCF using samtools and bcftools. combined.vcf will be published to the output directory. This is skipped by default.

mergeAllAssemblies

Combine all the consensus genomes into one file combined.fa, which will be published to the output directory.

mergeAssemblyStats

Parse previous files to create a summary statistics file combined.stats.tsv. This will be published to call_consensus-stats/combined.stats.tsv.

filterAssemblies

Filter the assemblies based on the criteria given with --maxNs and --minLength. This outputs filtered.fa, published to the output directory, and filtered.stats.tsv, published to call_consensus-stats.

multiqc

Run MultiQC to collect the reports from Trim Galore (FastQC), QUAST, samtools and bcftools. The output will be published to the folder MultiQC.

4. Output:

Summary of results

The output of test data is stored in the directory named “results” containing subdirectories shown in the screenshot.

Pipeline also generated the coverage plot of samples.

Chart

Description generated with very high confidence

Local Test Data

Command:

nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker \

--reads 'testSamples/*_R{1,2}_001.fastq.gz*' \

--kraken2_db 'kraken2/kraken_coronavirus_db_only' \

--outdir 'result_2'

Output:

qcTreeMap didn't generate any plot on local test data because no sequence passed the filtering criteria.

All other processes ran successfully.

Consensus genome is stored in a file named “combined.fa”

Sample variant is stored as vcf file named “combined.vcf”

Quality plots are stored in the folder “MultiQC”

Combined.fa:

A multi-FASTA file containing all consensus sequences.

Filtered.fa:

A multi-FASTA file containing only sequences that pass filtering.

Combined.vcf:

A VCF of variants in all samples.

IDSeq miniwdl pipeline

On workstation:

with the specs

NextImplementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKU

Last updated 3 years ago

Was this helpful?