PHA4GE: Implementation of CZ ID Workflow at AKU for Analyzing SARS-CoV-2 Genomics Data
Deprecated
Please go to this link
IDSeq pipeline is implemented using the following link:
https://github.com/czbiohub/sc2-illumina-pipeline
Main steps are:
1. Installation
Nextflow (https://www.nextflow.io/)
Pipeline software dependencies
Pipeline code
2. Running the pipeline
Main arguments used in pipeline
3. Overview of pipeline
Description of processes and functions used in the pipeline
4. Output
Summary of results
The configuration of our workstation is as follows:
Memory
93.1 GiB
Processor
Intel Xeon Gold 6240 CPU @ 2.60GHz x 72
Disk Capacity
3.0 TB
OS Name
Ubuntu 21.04
OS Type
64-bit
1. Installation:
Install Nextflow (Checking prerequisites)
Checking java version:
openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.10
Setup Nextflow:
The command is executed with error:
CAPSULE EXCEPTION: Error resolving dependencies. while processing attribute Allow-Snapshots: false (for stack trace, run with -Dcapsule.log=verbose) Unable to initialize nextflow environment
Next option is we have to install nextflow through conda (https://docs.conda.io/en/latest/).
Output:
Nextflow is installed successfully.
Update nextflow:
Launch nextflow:
N E X T F L O W ~ version 19.01.0
Pulling nextflow-io/hello ... downloaded from https://github.com/nextflow-io/hello.git Launching `nextflow-io/hello` [desperate_goldberg] - revision: ec11eb0ec7 [master] [warm up] executor > local [d9/0caedb] Submitted process > sayHello (1) [ce/bb4e73] Submitted process > sayHello (4) [58/63468f] Submitted process > sayHello (2) [b2/915799] Submitted process > sayHello (3) Bonjour world! Hola world! Ciao world! Hellow world!
Pipeline software dependencies
Docker container czbiohub/sc2-msspe pipeline:
Using default tag: latest latest: Pulling from czbiohub/sc2-msspe 68ced04f60ab: Pull complete 9c388eb6d33c: Pull complete 96cf53b3a9dd: Pull complete 6b5e6e9d0010: Pull complete 16f5187a04c9: Pull complete c69c31f8d73a: Pull complete c69c31f8d73a: Pull complete 69e38dd6074a: Pull complete 84d4143732bf: Pull complete 020ac22afab7: Pull complete Digest: sha256:d0e177deef2359b271445f4000cf739193e261f0b059b2a6beceb91de04fdf62 Status: Downloaded newer image for czbiohub/sc2-msspe:latest docker.io/czbiohub/sc2-msspe:latest
Installation of dependencies into a conda environment sc2-msspe:
Output:
Collecting package metadata (repodata.json): done Solving environment: done perl-math-round-0.07 | 9 KB chardet-4.0.0 | 204 KB perl-xml-sax-expat-0 | 10 KB zlib-1.2.11 | 106 KB glimmerhmm-3.0.4 | 40.6 MB ca-certificates-2021 | 136 KB future-0.18.2 | 714 KB xopen-1.1.0 | 20 KB perl-common-sense-3. | 11 KB perl-http-message-6. | 51 KB ld_impl_linux-64-2.3 | 618 KB perl-text-format-0.5 | 16 KB perl-app-cpanminus-1 | 234 KB liblapack-3.9.0 | 11 KB perl-ntlm-1.09 | 15 KB perl-uri-1.76 | 55 KB curl-7.77.0 | 149 KB entrez-direct-13.9 | 5.2 MB decorator-5.0.9 | 11 KB perl-digest-md5-2.55 | 18 KB perl-params-validate | 25 KB perl-compress-raw-bz | 44 KB perl-module-runtime- | 15 KB perl-io-socket-ssl-2 | 151 KB perl-http-cookies-6. | 19 KB perl-regexp-common-2 | 93 KB pyyaml-5.4.1 | 189 KB matplotlib-base-3.1. | 6.7 MB circos-0.69.8 | 26.2 MB fastqc-0.11.9 perl-net-http-6.19 | 19 KB multiqc-1.8 | 789 KB pandas-1.0.3 | 11.1 MB libdeflate-1.5 | 59 KB Preparing transaction: done Verifying transaction: done Executing transaction: / The default QUAST package does not include: * GRIDSS (needed for structural variants detection) * SILVA 16S rRNA database (needed for reference genome detection in metagenomic datasets) * BUSCO tools and databases (needed for searching BUSCO genes) -- works in Linux only! To be able to use those, please run quast-download-gridss quast-download-silva quast-download-busco done
Make file:
Output:
docker build . -t czbiohub/sc2-msspe:latest Sending build context to Docker daemon 41.27MB Step 1/8 : FROM nfcore/base:1.10.2 ---> 9e316f514716 Step 2/8 : LABEL authors="Jack Kamm and Samantha Hao" description="Docker image containing all software requirements for the nf-core/msspe pipeline" ---> Using cache ---> 525fb38872e9 Step 3/8 : COPY environment.yaml / ---> Using cache ---> 0dc62ac5b33c Step 4/8 : RUN conda env create -f /environment.yaml && conda clean -a ---> Using cache ---> f2117adfcc63 Step 5/8 : ENV PATH /opt/conda/envs/sc2-msspe/bin/:$PATH ---> Using cache ---> 590cabd71879 Step 6/8 : RUN conda env export --name nf-core-msspe-1.0dev > nf-core-msspe-1.0dev.yaml ---> Using cache ---> 809b6527b82a Step 7/8 : RUN apt update && apt install -y r-base ---> Using cache ---> 45559db3313c Step 8/8 : RUN Rscript -e "install.packages(c('BiocManager', 'aplot')); BiocManager::install('ggtree')" ---> Using cache ---> b4f2218da097 Successfully built b4f2218da097 Successfully tagged czbiohub/sc2-msspe:latest
2. Running the pipeline
Install kraken2 db
Output:
loning into 'kraken2'... remote: Enumerating objects: 903, done. remote: Counting objects: 100% (203/203), done. remote: Compressing objects: 100% (152/152), done. remote: Total 903 (delta 115), reused 104 (delta 51), pack-reused 700 Receiving objects: 100% (903/903), 428.83 KiB | 1.67 MiB/s, done. Resolving deltas: 100% (607/607), done.
Output:
Kraken 2 installation complete.
To make things easier for you, you may want to copy/symlink the following files into a directory in your PATH:
Link to download kraken DB:
Download testing data from GitHub:
Output:
Cloning into 'sc2-illumina-pipeline'... remote: Enumerating objects: 1709, done. remote: Counting objects: 100% (235/235), done. remote: Compressing objects: 100% (172/172), done. remote: Total 1709 (delta 133), reused 132 (delta 60), pack-reused 1474 Receiving objects: 100% (1709/1709), 12.06 MiB | 485.00 KiB/s, done. Resolving deltas: 100% (1058/1058), done.
Testing:
Simple test to make sure things are not broken:
Error encountered which was then removed by using the following command:
Output
IDSeq pipeline successfully installed.
Updating pipeline:
Command: nextflow pull czbiohub/sc2-illumina-pipeline
Main arguments:
-profile: Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
--reads: Use this to specify the location of your input SARS-CoV-2 read files
--primers: Use this to specify the BED file with the enrichment/amplification primers used in the library prep. The repository comes with two sets of primers corresponding to the MSSPE (SARS-COV-2_spikePrimers.bed) and ARTIC V3 (nCoV-2019.bed) protocols.
--kraken2_db: Use this to specify the path to the folder containing the Kraken2 database. This is required to filter the reads to only SARS-CoV-2 reads.
--outdir: Specify the output directory. This is required and can be an S3 path given the system has the appropriate credentials.
3. Overview of pipeline:
Description of processes and functions used in the pipeline
prefilterHostReads
Filters host reads by mapping with minimap2 to the file given with --ref_host and keeping unmapped reads. This process is optional; enable it with the flag --prefilter_host_reads. The filtered reads will be output into the folder host-subtracted-reads. These reads are sent to filterReads.
This step may be computationally expensive. If you only need the SARS-CoV-2 reads, and don't care about reads from other pathogens, you can use the fastqs output from filterReads instead.
By default, this step is skipped for -profile artic, but enabled for -profile msspe.
quantifyERCCs
Quantify ERCC reads by mapping with minimap2 to the file given with --ercc_fasta. The individual output files are saved to the folder ercc-stats and the information is summarized in combined.stats.tsv.
filterReads
Filter for only SARS-CoV-2 reads. This is accomplished by first mapping with minimap2 to the file given with --ref and only keeping mapped reads. These reads are then run through kraken2 against the database provided by --kraken2_db. Only reads that are assigned uniquely to SARS-CoV-2 are kept. These reads are sent to trimReads.
When the option --save_sars2_filtered_reads is enabled, the fastqs of filtered reads are saved to filtered-sars2-reads. By default, this option is enabled when -profile artic, but disabled when -profile msspe.
trimReads
Trim adapter sequences and perform quality-trimming using Trim Galore. Skip this step with the flag --skip_trim_adapters. The trimmed reads will be published to the folder trimmed-reads. These will be sent to alignReads and callVariants.
alignReads
Align the reads to the reference genome provided by --ref with minimap2 and sort by position with samtools sort. The BAM files are sent to trimPrimers
trimPrimers
Trim the primers provided in --primers from the BAM using ivar trim. The primers will be soft-clipped from the ends of the reads in the BAM file. This step is necessary to properly call consensus alleles in these regions as the primer sequences can be artificially amplified. The trimmed BAMs will be published to the folder aligned-reads.
makeConsensus
Call a consensus genome using ivar consensus. See here for parameters. These genomes will be published to the folder consensus-seqs.
quast
Run QUAST on the consensus genomes. The results will be published to the folder QUAST.
callVariants
Call variant sites (SNPs) use samtools mpileup and bcftools call. This will also produce files from bcftools stats. The VCF files and stats files are published to sample-variants.
computeStats
Compute alignment and assembly statistics using samtools stats and the python script alignment_assembly_stats.py. Coverage plots are created during this process and published to coverage-plots.
combinedVariants
Creates a joint VCF using samtools and bcftools. combined.vcf will be published to the output directory. This is skipped by default.
mergeAllAssemblies
Combine all the consensus genomes into one file combined.fa, which will be published to the output directory.
mergeAssemblyStats
Parse previous files to create a summary statistics file combined.stats.tsv. This will be published to call_consensus-stats/combined.stats.tsv.
filterAssemblies
Filter the assemblies based on the criteria given with --maxNs and --minLength. This outputs filtered.fa, published to the output directory, and filtered.stats.tsv, published to call_consensus-stats.
multiqc
Run MultiQC to collect the reports from Trim Galore (FastQC), QUAST, samtools and bcftools. The output will be published to the folder MultiQC.
4. Output:
Summary of results
The output of test data is stored in the directory named “results” containing subdirectories shown in the screenshot.
Pipeline also generated the coverage plot of samples.
Local Test Data
Command:
nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker \
--reads 'testSamples/*_R{1,2}_001.fastq.gz*' \
--kraken2_db 'kraken2/kraken_coronavirus_db_only' \
--outdir 'result_2'
Output:
qcTreeMap didn't generate any plot on local test data because no sequence passed the filtering criteria.
All other processes ran successfully.
Consensus genome is stored in a file named “combined.fa”
Sample variant is stored as vcf file named “combined.vcf”
Quality plots are stored in the folder “MultiQC”
Combined.fa:
A multi-FASTA file containing all consensus sequences.
Filtered.fa:
A multi-FASTA file containing only sequences that pass filtering.
Combined.vcf:
A VCF of variants in all samples.
IDSeq miniwdl pipeline
On workstation:
with the specs
Last updated