😎Implementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKU

To overcome bioinformatics bottlenecks in SARS-CoV-2 genomics analysis, we have implemented CZ ID SARS-CoV-2 consensus genome pipeline on existing IT infrastructure of The Aga Khan University. 👩🏻‍💻

Khan W, Kanwar S
AKU CITRIC Center for Bioinformatics and Computational Biology, Depratment of Pediatrics and Child Health, Faculty of Health Sciences, Medical College, The Aga Khan Universitry, Karachi-74800, Pakistan.
CZ ID pipeline is implemented from:
https://github.com/chanzuckerberg/czid-workflows
Hope you like it!!!

Running WDL workflows locally

The analysis steps are wrapped in Workflow Description Language (WDL). The pipeline receives FASTQ or FASTA inputs and processes them through numerous tools and reference databases. To enable external reproduction:

Reference databases should be downloaded from Amazon S3: s3://czid-public-references/consensus-genome/
Tools should be packaged-in as a Docker image

Install and configure miniwdl and other dependencies

Prepare to run the workflow by setting up miniwdl, a local WDL runner.

Let's try to install miniwdl using following command:

conda install -c conda-forge miniwdl

Then, open a command prompt and try:

miniwdl run_self_test

To test the installation with a trivial built-in workflow, numerous log messages should be printed, and conclude with miniwdl run_self_test OK in ~30 seconds.

miniwdl’s download cache feature should also be activated, so that reference databases need to be downloaded from S3 on first use only (This is a one-time download, but it did not happen to us!!! You can find resolved query at this link). To activate the cache, set and export environment variables:

export MINIWDL__DOWNLOAD_CACHE__PUT=true
export MINIWDL__DOWNLOAD_CACHE__GET=true
export MINIWDL__DOWNLOAD_CACHE__DIR=/mnt/miniwdl_download_cache

Fetch workflows

Clone the chanzuckerberg/czid-workflows git repository containing the WDL and Docker source files.

git clone https://github.com/chanzuckerberg/czid-workflows.git

Configure scratch space

When running miniwdl, use the --dir DIR option to use the given DIR as scratch space; for example, if your instance storage is mounted on /mnt (default settings on UBUNTU), use miniwdl --dir /mnt to run there. If you plan to run the czid-workflows test suite, set TMPDIR to your directory using export TMPDIR=/mnt.

Pull or build Docker image

Building the image from the Dockerfile takes several minutes, but doesn’t require logging in anywhere. To build it,

docker build -t czid-consensus-genome czid-workflows/consensus-genome

Running the pipeline

Excited though!!! run the workflow using the following command:

export MINIWDL__DOWNLOAD_CACHE__PUT=true && \
export MINIWDL__DOWNLOAD_CACHE__GET=true && \
export MINIWDL__DOWNLOAD_CACHE__DIR=/tmp/miniwdl_download_cache && \
time miniwdl run --verbose czid-workflows/consensus-genome/run.wdl \
docker_image_id=czid-consensus-genome \
fastqs_0= /path/sample_R1_001.fastq.gz \
fastqs_1= /path/sample_R2_001.fastq.gz \
sample= SampleRun \
technology= Illumina \
ref_fasta=s3://idseq-public-references/consensus-genome/MN908947.3.fa \
-i czid-workflows/consensus-genome/test/local_test.yml --debug -d SampleRun/.

Breaking this down,

docker_image_id=should be set to the docker image, in this case czid-consensus-genome.
The pair of FASTQ files given to fastqs_0= and fastqs_1= parameters.
sample is the name to use where referencing the sample in the output files.
Sequencing technology used should be mentioned in technology parameter either Illumina or ONT.
ref_fasta path to reference file should be given in this parameter, if you want to use your own reference, mention the path of reference fasta file in this parameter.
local_test.yml supplies boiler plate workflow inputs, such as the S3 paths for the viral reference databases.

The first attempt will take some time to download the reference databases (about 6 GiB total). Thereafter, if miniwdl download caching is enabled as suggested above, running this or other small samples should take just a few minutes (only analysis with no downloads as reference databases are downloaded earlier).

Output of workflow

When the run completes, miniwdl prints a large JSON structure with all the outputs and output file paths in the created run directory.

{
  "consensus_genome.vadr_errors": null,
  "consensus_genome.trim_reads_out_trimmed_fastqs": [
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_reads_out_trimmed_fastqs/0/trim_reads_val_1.fq.gz",
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_reads_out_trimmed_fastqs/1/trim_reads_val_2.fq.gz"
  ],
  "consensus_genome.vadr_quality_out": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/vadr_quality_out/vadr-output.vadr.sqc",
  "consensus_genome.quast_out_quast_tsv": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/quast_out_quast_tsv/report.tsv",
  "consensus_genome.quast_out_quast_txt": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/quast_out_quast_txt/report.txt",
  "consensus_genome.compute_stats_out_depths_fig": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/compute_stats_out_depths_fig/depths.png",
  "consensus_genome.remove_host_out_host_removed_fastqs": [
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/remove_host_out_host_removed_fastqs/0/no_host_1.fq.gz",
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/remove_host_out_host_removed_fastqs/1/no_host_2.fq.gz"
  ],
  "consensus_genome.compute_stats_out_output_stats": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/compute_stats_out_output_stats/stats.json",
  "consensus_genome.align_reads_out_alignments": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/align_reads_out_alignments/aligned_reads.bam",
  "consensus_genome.quantify_erccs_out_ercc_out": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/quantify_erccs_out_ercc_out/ercc_stats.txt",
  "consensus_genome.trim_primers_out_trimmed_bam_ch": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_primers_out_trimmed_bam_ch/primertrimmed.bam",
  "consensus_genome.trim_primers_out_trimmed_bam_bai": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_primers_out_trimmed_bam_bai/primertrimmed.bam.bai",
  "consensus_genome.call_variants_out_variants_ch": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/call_variants_out_variants_ch/variants.vcf.gz",
  "consensus_genome.realign_consensus_fa": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/realign_consensus_fa/S12_S12_L001.muscle.out.fasta",
  "consensus_genome.vadr_alerts_out": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/vadr_alerts_out/vadr-output.vadr.alt.list",
  "consensus_genome.filter_reads_out_filtered_fastqs": [
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/filter_reads_out_filtered_fastqs/0/filtered_1.fq.gz",
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/filter_reads_out_filtered_fastqs/1/filtered_2.fq.gz"
  ],
  "consensus_genome.compute_stats_out_sam_depths": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/compute_stats_out_sam_depths/samtools_depth.txt",
  "consensus_genome.minion_log": null,
  "consensus_genome.zip_outputs_out_output_zip": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/zip_outputs_out_output_zip/outputs.zip",
  "consensus_genome.make_consensus_out_consensus_fa": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/make_consensus_out_consensus_fa/consensus.fa"
}

Summary

In the output directory specified by --outdir, the following folders and files will be created:

outdir
├── consensus.fa
├── variants.vcf.gz
├── MultiQC
│   ├── multiqc_data
│   └── multiqc_plots
│       ├── pdf
│       ├── png
│       └── svg
├── QUAST
│   ├── sample1
│   │   ├── aligned_stats
│   │   ├── basic_stats
│   │   ├── contigs_reports
│   │   │   └── minimap_output
│   │   ├── genome_stats
│   │   └── icarus_viewers
│   └── ...
├── aligned-reads
│   ├── sample1.primertrimmed.bam
│   ├── sample1.primertrimmed.bam.bai
│   └── call-RealignConsensus
│   │   ├── sample.muscle.out.fasta
├── call_consensus-stats
│   ├── combined.stats.tsv
│   └── filtered.stats.tsv
├── consensus-seqs
│   ├── sample1.consensus.fa
│   └── ...
├── coverage-plots
│   ├── sample1.depths.png
│   └── ...
├── ercc-stats
│   ├── sample1.ercc_stats
│   └── ...
├── sample-variants
│   ├── sample1.bcftools_stats
│   ├── sample1.vcf.gz
│   ├── sample1.vcf.gz.tbi
│   └── ...
└── trimmed-reads
    ├── sample1_covid_1_val_1.fq.gz
    ├── sample1_covid_2_val_2.fq.gz
    └── ...

Let's breakdown the whole structure of output files generated at the end:

`consensus.fa`

A multi-FASTA file containing consensus sequence.

`variants.vcf.gz`

A VCF file of variants with respect to the reference.

`MultiQC/`

All MultiQC output files.

`QUAST/`

All QUAST output files.

`aligned-reads/`

Primer-trimmed BAM files after alignment of reads to the reference provided by ref_fasta. These files are the output of ivar trim.

`call_consensus-stats/`

Summary QC statistics for a sample run through the pipeline.

`coverage-plot/`

Line plot of coverage for a sample.

`ercc-stats/`

The output of samtools stats for the alignment to the file --ercc_fasta.

`RemoveHost/`

This folder is created when the flag --prefilter_host_reads is set. By default, it is disabled when -profile artic, but enabled when -profile msspe. It contains untrimmed reads that do not map to the host genome.

`filtered-reads/`

This folder is created when the flag --save_sars2_filtered_reads is set. By default, it is enabled when -profile artic, but disabled when -profile msspe. It contains untrimmed reads that map to the reference genome, and filtered reads that map to host or other viruses.

`trimmed-reads/`

Sample reads that have undergone all filtering steps and adapter-trimming.

Overview of pipeline

Processes

The main script that runs the pipeline is run.wdlBelow, we explain the major functions that are called to process the fastq files. Although this information is available on czbiohub SC2 pipeline but for reader's convenience, we have put together all information at one place.

`RemoveHostReads`

Filters host reads by mapping with minimap2 to the file given with --ref_host and keeping unmapped reads. The filtered reads will be output into the folder call-RemoveHost. These reads are sent to filterReads.

This step may be computationally expensive. If you only need the SARS-CoV-2 reads, and don't care about reads from other pathogens, you can use the fastqs output from filterReads instead.

By default, this step is skipped for -profile artic, but enabled for -profile msspe.

`quantifyERCCs`

Quantify ERCC reads by mapping with minimap2 to the file given with --ercc_fasta. The individual output files are saved to the folder ercc-stats and the information is summarized in combined.stats.tsv.

`filterReads`

Filter for only SARS-CoV-2 reads. This is accomplished by first mapping with minimap2 to the file given with --ref and only keeping mapped reads. These reads are then run through kraken2 against the database provided by --kraken2_db. Only reads that are assigned uniquely to SARS-CoV-2 are kept. These reads are sent to trimReads.

`trimReads`

Trim adapter sequences and perform quality-trimming using Trim Galore. Skip this step with the flag --skip_trim_adapters. The trimmed reads will be published to the folder trimmed-reads. These will be sent to alignReads and callVariants.

`alignReads`

Align the reads to the reference genome provided by --ref with minimap2 and sort by position with samtools sort. The BAM files are sent to trimPrimers

`trimPrimers`

Trim the primers provided in --primers from the BAM using ivar trim. The primers will be soft-clipped from the ends of the reads in the BAM file. This step is necessary to properly call consensus alleles in these regions as the primer sequences can be artificially amplified. The trimmed BAMs will be published to the folder aligned-reads.

`makeConsensus`

Call a consensus genome using ivar consensus. See here for parameters.

`quast`

Run QUAST on the consensus genome. The results will be published to the folder QUAST.

`callVariants`

Call variant sites (SNPs) use samtools mpileup and bcftools call. This will also produce files from bcftools stats.

`computeStats`

Compute alignment and assembly statistics using samtools stats and the python script. Coverage plot is created during this process and published to coverage-plots.

`multiqc`

Run MultiQC to collect the report from Trim Galore (FastQC), QUAST, samtools and bcftools. The output will be published to the folder MultiQC.

Specifications of Workstation:

Memory

96 GiB

Processor

Intel Xeon Gold 6240 CPU @ 2.60GHz x 72

Disk Capacity

3.0 TB

LSB Version

core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch

Distributor ID

Ubuntu

Description

Ubuntu 21.04

Release

21.04

Codename

hirsute

Type

64-bit

Specifications of Server:

Memory

132 GiB

Processor

Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz x 160

Disk Capacity

45 TB

LSB Version

core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch

Distributor ID

Ubuntu

Description

Ubuntu 20.04.3 LTS

Release

20.04

Codename

focal

Type

64-bit

Pre-requisites:

CZ ID consensus-genome-v3.4.8 is installed on our server and workstation as of writing this manual (23rd May,2022). Versions of pre-requisites* are mentioned below:

miniwdl v1.1.3

Docker version 20.10.7, build f0df350

conda 4.10.3

python 3.7.11

*These prerequisite versions are specific to workstation only.

Note

Modifications required while running on your samples:

test.yml ---> chr1 to hg38 
run.wdl ---> threads = 72 on workstation         #depends on system's capacity

In test.yml file, chr 1 is mentioned in the script for minimal usage of resources. You can also develop your own yml file according to your requirements.

Example

Output files generated after running pipeline:

Depth Plot:

On average, FASTQ files of 50 MB takes ~7.5 mins to be processed on CZ ID consensus genome pipeline (either on server or on workstation!!!)

We hope you liked our efforts, as we tried our best to keep steps/processes simple. If you have any queries or suggestions regarding pipeline and GitBook, feel free to contact us at:

Samiah Kanwar and samiah.kanwar@aku.edu