> For the complete documentation index, see [llms.txt](https://waqasnayab.gitbook.io/pipeline/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://waqasnayab.gitbook.io/pipeline/master/implementation-of-cz-id-mini-wdl-based-sars-cov-2-consensus-genome-workflow-pipeline-at-aku.md).

# Implementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKU

> *Khan W, Kanwar S*
>
> *AKU CITRIC Center for Bioinformatics and Computational Biology, Depratment of Pediatrics and Child Health, Faculty of Health Sciences, Medical College, The Aga Khan Universitry, Karachi-74800, Pakistan.*
>
> *CZ ID pipeline is implemented from:*
>
> [*https://github.com/chanzuckerberg/czid-workflows*](https://github.com/chanzuckerberg/czid-workflows)
>
> Hope you like it!!!&#x20;

### **Running WDL workflows locally**

The analysis steps are wrapped in [Workflow Description Language](https://openwdl.org/) (WDL). The pipeline receives FASTQ or FASTA inputs and processes them through numerous tools and reference databases. To enable external reproduction:

1. Reference databases should be downloaded from Amazon S3: [s3://czid-public-references/consensus-genome/](s3://czid-public-references/consensus-genome/)&#x20;
2. Tools should be packaged-in as a Docker image

### Install and configure miniwdl and other dependencies

Prepare to run the workflow by setting up [miniwdl](https://github.com/chanzuckerberg/miniwdl), a local WDL runner.

Let's try to install miniwdl using following command:

```
conda install -c conda-forge miniwdl
```

Then, open a command prompt and try:

```
miniwdl run_self_test
```

To test the installation with a trivial built-in workflow, numerous log messages should be printed, and conclude with `miniwdl run_self_test OK` in \~30 seconds.

![miniwdl installed successfully on workstation (You will find specs at the end of page)](/files/2ELIhmu3FuXhjDThxEe7)

miniwdl’s [download cache feature](https://miniwdl.readthedocs.io/en/latest/runner_reference.html#file-download-cache) should also be activated, so that reference databases need to be downloaded from S3 on first use only **(This is a one-time download, but it did not happen to us!!! You can find resolved query at this** [link](<https://github.com/chanzuckerberg/idseq-workflows/issues/179 >)**)**. To activate the cache, set and export environment variables:

```
export MINIWDL__DOWNLOAD_CACHE__PUT=true
export MINIWDL__DOWNLOAD_CACHE__GET=true
export MINIWDL__DOWNLOAD_CACHE__DIR=/mnt/miniwdl_download_cache
```

### Fetch workflows

Clone the `chanzuckerberg/czid-workflows` git repository containing the WDL and Docker source files.

```
git clone https://github.com/chanzuckerberg/czid-workflows.git
```

### Configure scratch space

When running miniwdl, use the `--dir DIR` option to use the given DIR as scratch space; for example, if your instance storage is mounted on /mnt (default settings on UBUNTU), use `miniwdl --dir /mnt` to run there. If you plan to run the czid-workflows test suite, set `TMPDIR` to your directory using `export TMPDIR=/mnt`.

### Pull or build Docker image

Building the image from the Dockerfile takes several minutes, but doesn’t require logging in anywhere. To build it,

```
docker build -t czid-consensus-genome czid-workflows/consensus-genome
```

### **Running the pipeline**

Excited though!!! run the workflow using the following command:

```
export MINIWDL__DOWNLOAD_CACHE__PUT=true && \
export MINIWDL__DOWNLOAD_CACHE__GET=true && \
export MINIWDL__DOWNLOAD_CACHE__DIR=/tmp/miniwdl_download_cache && \
time miniwdl run --verbose czid-workflows/consensus-genome/run.wdl \
docker_image_id=czid-consensus-genome \
fastqs_0= /path/sample_R1_001.fastq.gz \
fastqs_1= /path/sample_R2_001.fastq.gz \
sample= SampleRun \
technology= Illumina \
ref_fasta=s3://idseq-public-references/consensus-genome/MN908947.3.fa \
-i czid-workflows/consensus-genome/test/local_test.yml --debug -d SampleRun/.
```

Breaking this down,

* `docker_image_id=`should be set to the docker image, in this case `czid-consensus-genome`.
* The pair of FASTQ files given to `fastqs_0=` and `fastqs_1=` parameters.
* `sample` is the name to use where referencing the sample in the output files.
* Sequencing technology used should be mentioned in `technology` parameter either `Illumina` or `ONT`.
* `ref_fasta` path to reference file should be given in this parameter, if you want to use your own reference, mention the path of reference fasta file in this parameter.
* `local_test.yml` supplies boiler plate workflow inputs, such as the S3 paths for the viral reference databases.

The first attempt will take some time to download the reference databases (about 6 GiB total). Thereafter, if miniwdl download caching is enabled as suggested above, running this or other small samples should take just a few minutes **(only analysis with no downloads as reference databases are downloaded earlier)**.

### **Output of workflow**

When the run completes, miniwdl prints a large JSON structure with all the outputs and output file paths in the created run directory.

```
{
  "consensus_genome.vadr_errors": null,
  "consensus_genome.trim_reads_out_trimmed_fastqs": [
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_reads_out_trimmed_fastqs/0/trim_reads_val_1.fq.gz",
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_reads_out_trimmed_fastqs/1/trim_reads_val_2.fq.gz"
  ],
  "consensus_genome.vadr_quality_out": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/vadr_quality_out/vadr-output.vadr.sqc",
  "consensus_genome.quast_out_quast_tsv": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/quast_out_quast_tsv/report.tsv",
  "consensus_genome.quast_out_quast_txt": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/quast_out_quast_txt/report.txt",
  "consensus_genome.compute_stats_out_depths_fig": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/compute_stats_out_depths_fig/depths.png",
  "consensus_genome.remove_host_out_host_removed_fastqs": [
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/remove_host_out_host_removed_fastqs/0/no_host_1.fq.gz",
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/remove_host_out_host_removed_fastqs/1/no_host_2.fq.gz"
  ],
  "consensus_genome.compute_stats_out_output_stats": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/compute_stats_out_output_stats/stats.json",
  "consensus_genome.align_reads_out_alignments": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/align_reads_out_alignments/aligned_reads.bam",
  "consensus_genome.quantify_erccs_out_ercc_out": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/quantify_erccs_out_ercc_out/ercc_stats.txt",
  "consensus_genome.trim_primers_out_trimmed_bam_ch": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_primers_out_trimmed_bam_ch/primertrimmed.bam",
  "consensus_genome.trim_primers_out_trimmed_bam_bai": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/trim_primers_out_trimmed_bam_bai/primertrimmed.bam.bai",
  "consensus_genome.call_variants_out_variants_ch": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/call_variants_out_variants_ch/variants.vcf.gz",
  "consensus_genome.realign_consensus_fa": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/realign_consensus_fa/S12_S12_L001.muscle.out.fasta",
  "consensus_genome.vadr_alerts_out": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/vadr_alerts_out/vadr-output.vadr.alt.list",
  "consensus_genome.filter_reads_out_filtered_fastqs": [
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/filter_reads_out_filtered_fastqs/0/filtered_1.fq.gz",
    "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/filter_reads_out_filtered_fastqs/1/filtered_2.fq.gz"
  ],
  "consensus_genome.compute_stats_out_sam_depths": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/compute_stats_out_sam_depths/samtools_depth.txt",
  "consensus_genome.minion_log": null,
  "consensus_genome.zip_outputs_out_output_zip": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/zip_outputs_out_output_zip/outputs.zip",
  "consensus_genome.make_consensus_out_consensus_fa": "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_miniwdl/S12_S12_L001/out/make_consensus_out_consensus_fa/consensus.fa"
}
```

### Summary&#x20;

In the output directory specified by `--outdir`, the following folders and files will be created:

```
outdir
├── consensus.fa
├── variants.vcf.gz
├── MultiQC
│   ├── multiqc_data
│   └── multiqc_plots
│       ├── pdf
│       ├── png
│       └── svg
├── QUAST
│   ├── sample1
│   │   ├── aligned_stats
│   │   ├── basic_stats
│   │   ├── contigs_reports
│   │   │   └── minimap_output
│   │   ├── genome_stats
│   │   └── icarus_viewers
│   └── ...
├── aligned-reads
│   ├── sample1.primertrimmed.bam
│   ├── sample1.primertrimmed.bam.bai
│   └── call-RealignConsensus
│   │   ├── sample.muscle.out.fasta
├── call_consensus-stats
│   ├── combined.stats.tsv
│   └── filtered.stats.tsv
├── consensus-seqs
│   ├── sample1.consensus.fa
│   └── ...
├── coverage-plots
│   ├── sample1.depths.png
│   └── ...
├── ercc-stats
│   ├── sample1.ercc_stats
│   └── ...
├── sample-variants
│   ├── sample1.bcftools_stats
│   ├── sample1.vcf.gz
│   ├── sample1.vcf.gz.tbi
│   └── ...
└── trimmed-reads
    ├── sample1_covid_1_val_1.fq.gz
    ├── sample1_covid_2_val_2.fq.gz
    └── ...
```

Let's breakdown the whole structure of output files generated at the end:

#### `consensus.fa`

A multi-FASTA file containing consensus sequence.

#### `variants.vcf.gz`

A VCF file of variants with respect to the reference.

#### `MultiQC/`

All MultiQC output files.

#### `QUAST/`

All QUAST output files.

#### `aligned-reads/`

Primer-trimmed BAM files after alignment of reads to the reference provided by `ref_fasta`. These files are the output of `ivar trim`.

#### `call_consensus-stats/`

Summary QC statistics for a sample run through the pipeline.

#### `coverage-plot/`

Line plot of coverage for a sample.

#### `ercc-stats/`

The output of `samtools stats` for the alignment to the file `--ercc_fasta`.

#### `RemoveHost/`

This folder is created when the flag `--prefilter_host_reads` is set. By default, it is disabled when `-profile artic`, but enabled when `-profile msspe`. It contains untrimmed reads that do not map to the host genome.

#### `filtered-reads/`

This folder is created when the flag `--save_sars2_filtered_reads` is set. By default, it is enabled when `-profile artic`, but disabled when `-profile msspe`. It contains untrimmed reads that map to the reference genome, and filtered reads that map to host or other viruses.

#### `trimmed-reads/`

Sample reads that have undergone all filtering steps and adapter-trimming.

### **Overview of pipeline**

### Processes

The main script that runs the pipeline is `run.wdl`Below, we explain the major functions that are called to process the `fastq files`. Although this information is available on czbiohub [SC2 pipeline](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/overview.md) but for reader's convenience, we have put together all information at one place.&#x20;

#### `RemoveHostReads`

Filters host reads by mapping with `minimap2` to the file given with `--ref_host` and keeping unmapped reads. The filtered reads will be output into the folder `call-RemoveHost`. These reads are sent to [`filterReads`](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/overview.md#filterreads).

This step may be computationally expensive. If you only need the SARS-CoV-2 reads, and don't care about reads from other pathogens, you can use the fastqs output from `filterReads` instead.

By default, this step is skipped for `-profile artic`, but enabled for `-profile msspe`.

#### `quantifyERCCs`

Quantify ERCC reads by mapping with `minimap2` to the file given with `--ercc_fasta`. The individual output files are saved to the folder `ercc-stats` and the information is summarized in `combined.stats.tsv`.

#### `filterReads`

Filter for only SARS-CoV-2 reads. This is accomplished by first mapping with `minimap2` to the file given with `--ref` and only keeping mapped reads. These reads are then run through `kraken2` against the database provided by `--kraken2_db`. Only reads that are assigned uniquely to SARS-CoV-2 are kept. These reads are sent to [`trimReads`](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/overview.md#trimreads).

#### `trimReads`

Trim adapter sequences and perform quality-trimming using [Trim Galore](https://github.com/FelixKrueger/TrimGalore). Skip this step with the flag `--skip_trim_adapters`. The trimmed reads will be published to the folder `trimmed-reads`. These will be sent to [`alignReads`](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/overview.md#alignreads) and [`callVariants`](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/overview.md#callvariants).

#### `alignReads`

Align the reads to the reference genome provided by `--ref` with `minimap2` and sort by position with `samtools sort`. The BAM files are sent to [`trimPrimers`](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/overview.md#trimprimers)

#### `trimPrimers`

Trim the primers provided in `--primers` from the BAM using `ivar trim`. The primers will be soft-clipped from the ends of the reads in the BAM file. This step is necessary to properly call consensus alleles in these regions as the primer sequences can be artificially amplified. The trimmed BAMs will be published to the folder `aligned-reads`.

#### `makeConsensus`

Call a consensus genome using `ivar consensus`. See [here](https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/docs/running.md#primer-trimming) for parameters.

#### `quast`

Run `QUAST` on the consensus genome. The results will be published to the folder `QUAST`.

#### `callVariants`

Call variant sites (SNPs) use `samtools mpileup` and `bcftools call`. This will also produce files from `bcftools stats`.

#### `computeStats`

Compute alignment and assembly statistics using `samtools stats` and the python script. Coverage plot is created during this process and published to `coverage-plots`.

#### `multiqc`

Run MultiQC to collect the report from `Trim Galore` (`FastQC`), `QUAST`, `samtools` and `bcftools`. The output will be published to the folder `MultiQC`.

### Specifications of Workstation:

| Memory         | 96 GiB                                                  |
| -------------- | ------------------------------------------------------- |
| Processor      | Intel Xeon Gold 6240 CPU @ 2.60GHz x 72                 |
| Disk Capacity  | 3.0 TB                                                  |
| LSB Version    | core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch |
| Distributor ID | Ubuntu                                                  |
| Description    | Ubuntu 21.04                                            |
| Release        | 21.04                                                   |
| Codename       | hirsute                                                 |
| Type           | 64-bit                                                  |

### Specifications of Server:

| Memory         | 132 GiB                                                 |
| -------------- | ------------------------------------------------------- |
| Processor      | Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz x 160          |
| Disk Capacity  | 45 TB                                                   |
| LSB Version    | core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch |
| Distributor ID | Ubuntu                                                  |
| Description    | Ubuntu 20.04.3 LTS                                      |
| Release        | 20.04                                                   |
| Codename       | focal                                                   |
| Type           | 64-bit                                                  |

### **Pre-requisites:**&#x20;

CZ ID consensus-genome-v3.4.8 is installed on our server and workstation as of writing this manual (23rd May,2022). Versions of pre-requisites\* are mentioned below:&#x20;

`miniwdl v1.1.3`&#x20;

`Docker version 20.10.7, build f0df350`

`conda 4.10.3`

`python 3.7.11`

*\*These prerequisite versions are specific to workstation only.*&#x20;

### &#x20;Note

Modifications required while running on your samples:

```
test.yml ---> chr1 to hg38 
run.wdl ---> threads = 72 on workstation         #depends on system's capacity
```

In `test.yml` file, chr 1 is mentioned in the script for minimal usage of resources. You can also develop your own `yml file` according to your requirements.

### **Example**

#### **Output files generated after running pipeline:**&#x20;

![Folders generated as a result of CZ ID pipeline: Every process has its separate folder containing input, output and log files.](/files/OAHBu4D7cxQ152Q0KuUy)

![Overview of call-ZipOutputs folder having final out files ](/files/74WQkUNVX4S4c0frxzY0)

**Depth Plot:**&#x20;

![Depth plot of example sample showing nucleotide position at x-axis and depth on y-axis. ](/files/BcEuhn0BwlNtaOjHiw0q)

#### On average, FASTQ files of 50 MB takes \~7.5 mins to be processed on CZ ID consensus genome  pipeline (either on server or on workstation!!!)

We hope you liked our efforts, as we tried our best to keep steps/processes simple. If you have any queries or suggestions regarding pipeline and GitBook, feel free to contact us at:

<img src="/files/yLZf0Rqeih63qeHM3cCX" alt="" data-size="line">[*Samiah Kanwar* ](https://twitter.com/Samiah__k?t=Pdz8E_WiI3XQV-ADrXq16g\&s=09)*and <samiah.kanwar@aku.edu>*

<img src="/files/CbdiSOylWwienPSgLgV7" alt="" data-size="line">[*Waqasuddin Khan*](https://twitter.com/WaqasuddinK?t=LVh7snGQZLgdrqGf48rk3Q\&s=09) *and <waqasuddin.khan@aku.edu>*

**Dear folks,**

**If you are interested to know our further contributions in fighting the SARS-CoV-2 pandemic, kindly visit the following links:**

* <img src="/files/0koYEJAkGvYymXRtr4Ny" alt="" data-size="line"> [Genomic Epidemiology of Novel Coronavirus - Pakistan-focused Subsampling (Non-contextual/Non-targeted)](https://nextstrain.org/community/AKU-CITRIC-Center-for-Bioinformatics/ncov/Pakistan)
* <img src="/files/oGSudhjJYEvvKMozEy1g" alt="" data-size="line"> [Genomic Epidemiology of Novel Coronavirus - Pakistan and Related Contextual (Targeted) Samples](https://nextstrain.org/community/AKU-CITRIC-Center-for-Bioinformatics/ncov/Pakistan-Contextual)
* <img src="/files/JNK76Dt0yFGDsrqMP4cD" alt="" data-size="line"> [AKU CITRIC Center for Bioinformatics- GitHub repository](https://github.com/AKU-CITRIC-Center-for-Bioinformatics)
* <img src="/files/68RpVtzdupsK8m7n4gLN" alt="" data-size="line"> [SARS-CoV-2 nucleotide data submission to NCBI GenBank](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide\&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049\&Authors_idx%20q.op%3DAND=Nisar)&#x20;
* <img src="/files/4TD5gqnVgYxatep0tMl8" alt="" data-size="line"> [SARS-CoV-2 raw sequencing reads submission to NCBI SRA](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA764553/)
* <img src="/files/XdeqLIxIdff0dzPZkUrk" alt="" data-size="line">[ SARS-CoV-2 data submission to COVID-19 data portal](https://www.covid19dataportal.org/sequences?db=embl-covid19\&query=nisar\&size=15\&requestFrom=searchBox\&crossReferencesOption=all#search-content)
* QC of SARS-CoV-2 Genomics Analysis: <https://github.com/pha4ge/pipeline-resources/blob/main/docs/qc-solutions.md> & with PHA4GE newsletter: <https://pha4ge.org/articles/bioinformatics-pipelines-and-visualization-working-group-july-2022-update/>

***Feel free to contact us in these regards as well!!!***


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://waqasnayab.gitbook.io/pipeline/master/implementation-of-cz-id-mini-wdl-based-sars-cov-2-consensus-genome-workflow-pipeline-at-aku.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
