Implementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKU
To overcome bioinformatics bottlenecks in SARS-CoV-2 genomics analysis, we have implemented CZ ID SARS-CoV-2 consensus genome pipeline on existing IT infrastructure of The Aga Khan University. π©π»βπ»
Khan W, Kanwar S
AKU CITRIC Center for Bioinformatics and Computational Biology, Depratment of Pediatrics and Child Health, Faculty of Health Sciences, Medical College, The Aga Khan Universitry, Karachi-74800, Pakistan.
CZ ID pipeline is implemented from:
https://github.com/chanzuckerberg/czid-workflows
Hope you like it!!!
Running WDL workflows locally
The analysis steps are wrapped in Workflow Description Language (WDL). The pipeline receives FASTQ or FASTA inputs and processes them through numerous tools and reference databases. To enable external reproduction:
Reference databases should be downloaded from Amazon S3: s3://czid-public-references/consensus-genome/
Tools should be packaged-in as a Docker image
Install and configure miniwdl and other dependencies
Prepare to run the workflow by setting up miniwdl, a local WDL runner.
Let's try to install miniwdl using following command:
Then, open a command prompt and try:
To test the installation with a trivial built-in workflow, numerous log messages should be printed, and conclude with miniwdl run_self_test OK
in ~30 seconds.
miniwdlβs download cache feature should also be activated, so that reference databases need to be downloaded from S3 on first use only (This is a one-time download, but it did not happen to us!!! You can find resolved query at this link). To activate the cache, set and export environment variables:
Fetch workflows
Clone the chanzuckerberg/czid-workflows
git repository containing the WDL and Docker source files.
Configure scratch space
When running miniwdl, use the --dir DIR
option to use the given DIR as scratch space; for example, if your instance storage is mounted on /mnt (default settings on UBUNTU), use miniwdl --dir /mnt
to run there. If you plan to run the czid-workflows test suite, set TMPDIR
to your directory using export TMPDIR=/mnt
.
Pull or build Docker image
Building the image from the Dockerfile takes several minutes, but doesnβt require logging in anywhere. To build it,
Running the pipeline
Excited though!!! run the workflow using the following command:
Breaking this down,
docker_image_id=
should be set to the docker image, in this caseczid-consensus-genome
.The pair of FASTQ files given to
fastqs_0=
andfastqs_1=
parameters.sample
is the name to use where referencing the sample in the output files.Sequencing technology used should be mentioned in
technology
parameter eitherIllumina
orONT
.ref_fasta
path to reference file should be given in this parameter, if you want to use your own reference, mention the path of reference fasta file in this parameter.local_test.yml
supplies boiler plate workflow inputs, such as the S3 paths for the viral reference databases.
The first attempt will take some time to download the reference databases (about 6 GiB total). Thereafter, if miniwdl download caching is enabled as suggested above, running this or other small samples should take just a few minutes (only analysis with no downloads as reference databases are downloaded earlier).
Output of workflow
When the run completes, miniwdl prints a large JSON structure with all the outputs and output file paths in the created run directory.
Summary
In the output directory specified by --outdir
, the following folders and files will be created:
Let's breakdown the whole structure of output files generated at the end:
consensus.fa
consensus.fa
A multi-FASTA file containing consensus sequence.
variants.vcf.gz
variants.vcf.gz
A VCF file of variants with respect to the reference.
MultiQC/
MultiQC/
All MultiQC output files.
QUAST/
QUAST/
All QUAST output files.
aligned-reads/
aligned-reads/
Primer-trimmed BAM files after alignment of reads to the reference provided by ref_fasta
. These files are the output of ivar trim
.
call_consensus-stats/
call_consensus-stats/
Summary QC statistics for a sample run through the pipeline.
coverage-plot/
coverage-plot/
Line plot of coverage for a sample.
ercc-stats/
ercc-stats/
The output of samtools stats
for the alignment to the file --ercc_fasta
.
RemoveHost/
RemoveHost/
This folder is created when the flag --prefilter_host_reads
is set. By default, it is disabled when -profile artic
, but enabled when -profile msspe
. It contains untrimmed reads that do not map to the host genome.
filtered-reads/
filtered-reads/
This folder is created when the flag --save_sars2_filtered_reads
is set. By default, it is enabled when -profile artic
, but disabled when -profile msspe
. It contains untrimmed reads that map to the reference genome, and filtered reads that map to host or other viruses.
trimmed-reads/
trimmed-reads/
Sample reads that have undergone all filtering steps and adapter-trimming.
Overview of pipeline
Processes
The main script that runs the pipeline is run.wdl
Below, we explain the major functions that are called to process the fastq files
. Although this information is available on czbiohub SC2 pipeline but for reader's convenience, we have put together all information at one place.
RemoveHostReads
RemoveHostReads
Filters host reads by mapping with minimap2
to the file given with --ref_host
and keeping unmapped reads. The filtered reads will be output into the folder call-RemoveHost
. These reads are sent to filterReads
.
This step may be computationally expensive. If you only need the SARS-CoV-2 reads, and don't care about reads from other pathogens, you can use the fastqs output from filterReads
instead.
By default, this step is skipped for -profile artic
, but enabled for -profile msspe
.
quantifyERCCs
quantifyERCCs
Quantify ERCC reads by mapping with minimap2
to the file given with --ercc_fasta
. The individual output files are saved to the folder ercc-stats
and the information is summarized in combined.stats.tsv
.
filterReads
filterReads
Filter for only SARS-CoV-2 reads. This is accomplished by first mapping with minimap2
to the file given with --ref
and only keeping mapped reads. These reads are then run through kraken2
against the database provided by --kraken2_db
. Only reads that are assigned uniquely to SARS-CoV-2 are kept. These reads are sent to trimReads
.
trimReads
trimReads
Trim adapter sequences and perform quality-trimming using Trim Galore. Skip this step with the flag --skip_trim_adapters
. The trimmed reads will be published to the folder trimmed-reads
. These will be sent to alignReads
and callVariants
.
alignReads
alignReads
Align the reads to the reference genome provided by --ref
with minimap2
and sort by position with samtools sort
. The BAM files are sent to trimPrimers
trimPrimers
trimPrimers
Trim the primers provided in --primers
from the BAM using ivar trim
. The primers will be soft-clipped from the ends of the reads in the BAM file. This step is necessary to properly call consensus alleles in these regions as the primer sequences can be artificially amplified. The trimmed BAMs will be published to the folder aligned-reads
.
makeConsensus
makeConsensus
Call a consensus genome using ivar consensus
. See here for parameters.
quast
quast
Run QUAST
on the consensus genome. The results will be published to the folder QUAST
.
callVariants
callVariants
Call variant sites (SNPs) use samtools mpileup
and bcftools call
. This will also produce files from bcftools stats
.
computeStats
computeStats
Compute alignment and assembly statistics using samtools stats
and the python script. Coverage plot is created during this process and published to coverage-plots
.
multiqc
multiqc
Run MultiQC to collect the report from Trim Galore
(FastQC
), QUAST
, samtools
and bcftools
. The output will be published to the folder MultiQC
.
Specifications of Workstation:
Processor
Intel Xeon Gold 6240 CPU @ 2.60GHz x 72
Disk Capacity
3.0 TB
LSB Version
core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID
Ubuntu
Description
Ubuntu 21.04
Release
21.04
Codename
hirsute
Type
64-bit
Specifications of Server:
Processor
Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz x 160
Disk Capacity
45 TB
LSB Version
core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID
Ubuntu
Description
Ubuntu 20.04.3 LTS
Release
20.04
Codename
focal
Type
64-bit
Pre-requisites:
CZ ID consensus-genome-v3.4.8 is installed on our server and workstation as of writing this manual (23rd May,2022). Versions of pre-requisites* are mentioned below:
miniwdl v1.1.3
Docker version 20.10.7, build f0df350
conda 4.10.3
python 3.7.11
*These prerequisite versions are specific to workstation only.
Note
Modifications required while running on your samples:
In test.yml
file, chr 1 is mentioned in the script for minimal usage of resources. You can also develop your own yml file
according to your requirements.
Example
Output files generated after running pipeline:
Depth Plot:
On average, FASTQ files of 50 MB takes ~7.5 mins to be processed on CZ ID consensus genome pipeline (either on server or on workstation!!!)
We hope you liked our efforts, as we tried our best to keep steps/processes simple. If you have any queries or suggestions regarding pipeline and GitBook, feel free to contact us at:
Dear folks,
If you are interested to know our further contributions in fighting the SARS-CoV-2 pandemic, kindly visit the following links:
QC of SARS-CoV-2 Genomics Analysis: https://github.com/pha4ge/pipeline-resources/blob/main/docs/qc-solutions.md & with PHA4GE newsletter: https://pha4ge.org/articles/bioinformatics-pipelines-and-visualization-working-group-july-2022-update/
Feel free to contact us in these regards as well!!!
Last updated