# PHA4GE: Implementation of CZ ID Workflow at AKU for Analyzing SARS-CoV-2 Genomics Data

> ### Please go to this [link](https://waqasnayab.gitbook.io/pipeline/master/implementation-of-cz-id-mini-wdl-based-sars-cov-2-consensus-genome-workflow-pipeline-at-aku)

IDSeq pipeline is implemented using the following link:

<https://github.com/czbiohub/sc2-illumina-pipeline>

Main steps are:

**1. Installation**

* Nextflow (<https://www.nextflow.io/>)
* Pipeline software dependencies
* Pipeline code

**2. Running the pipeline**

* Main arguments used in pipeline

**3. Overview of pipeline**

* Description of processes and functions used in the pipeline

**4. Output**

* Summary of results

The configuration of our workstation is as follows:

<table data-header-hidden><thead><tr><th width="245">Memory</th><th>93.1 GiB</th></tr></thead><tbody><tr><td>Memory</td><td>93.1 GiB</td></tr><tr><td>Processor</td><td>Intel Xeon Gold 6240 CPU @ 2.60GHz x 72</td></tr><tr><td>Disk Capacity</td><td>3.0 TB</td></tr><tr><td>OS Name</td><td>Ubuntu 21.04</td></tr><tr><td>OS Type</td><td>64-bit</td></tr></tbody></table>

**1. Installation:**

* **Install Nextflow** (Checking prerequisites)

#### ***Checking java version:***

```bash
java -version
```

`openjdk version "11.0.11" 2021-04-20                                                OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.10`

***Setup Nextflow:***

```bash
curl -s https://get.nextflow.io | bash
```

The command is executed with error:

`CAPSULE EXCEPTION: Error resolving dependencies. while processing attribute Allow-Snapshots: false (for stack trace, run with -Dcapsule.log=verbose)                          Unable to initialize nextflow environment`

Next option is we have to install nextflow through conda (<https://docs.conda.io/en/latest/>).

```bash
conda install -c bioconda nextflow
```

***Output:***

![](https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IP5JQtIawW_LG2W%2F0.png?generation=1624876072240446\&alt=media)

**Nextflow is installed successfully.**

&#x20;***Update nextflow:***

```bash
conda update nextflow
```

&#x20;***Launch nextflow:***

```bash
nextflow run hello
```

**`N E X T F L O W ~ version 19.01.0`**`                                                         ```Pulling nextflow-io/hello ...                                                    downloaded from https://github.com/nextflow-io/hello.git                                    Launching`nextflow-io/hello\` \[desperate\_goldberg] - revision: ec11eb0ec7 \[master]         \[warm up] executor > local                                                                 \[d9/0caedb] Submitted process > sayHello (1)                                               \[ce/bb4e73] Submitted process > sayHello (4)                                               \[58/63468f] Submitted process > sayHello (2)                                               \[b2/915799] Submitted process > sayHello (3)                                               Bonjour world!                                                                             Hola world!                                                                                Ciao world!                                                                                Hellow world!\`\`

* **Pipeline software dependencies**

***Docker container czbiohub/sc2-msspe pipeline:***

```bash
docker pull czbiohub/sc2-msspe
```

`Using default tag: latest                                                              latest: Pulling from czbiohub/sc2-msspe                                                   68ced04f60ab: Pull complete                                                               9c388eb6d33c: Pull complete                                                               96cf53b3a9dd: Pull complete                                                               6b5e6e9d0010: Pull complete                                                               16f5187a04c9: Pull complete                                                               c69c31f8d73a: Pull complete                                                               c69c31f8d73a: Pull complete                                                               69e38dd6074a: Pull complete                                                                84d4143732bf: Pull complete                                                               020ac22afab7: Pull complete                                                                Digest: sha256:d0e177deef2359b271445f4000cf739193e261f0b059b2a6beceb91de04fdf62           Status: Downloaded newer image for czbiohub/sc2-msspe:latest                               docker.io/czbiohub/sc2-msspe:latest`

***Installation of dependencies into a conda environment sc2-msspe:***

```bash
conda env create -f environment.yaml
```

&#x20;***Output:***

`Collecting package metadata (repodata.json): done                                         Solving environment: done                                                                 perl-math-round-0.07 | 9 KB                                                               chardet-4.0.0 | 204 KB                                                                    perl-xml-sax-expat-0 | 10 KB                                                              zlib-1.2.11 | 106 KB                                                                      glimmerhmm-3.0.4 | 40.6 MB                                                                ca-certificates-2021 | 136 KB                                                             future-0.18.2 | 714 KB                                                                    xopen-1.1.0 | 20 KB                                                                       perl-common-sense-3. | 11 KB                                                              perl-http-message-6. | 51 KB                                                              ld_impl_linux-64-2.3 | 618 KB                                                             perl-text-format-0.5 | 16 KB                                                              perl-app-cpanminus-1 | 234 KB                                                             liblapack-3.9.0 | 11 KB                                                                   perl-ntlm-1.09 | 15 KB                                                                    perl-uri-1.76 | 55 KB                                                                     curl-7.77.0 | 149 KB                                                                      entrez-direct-13.9 | 5.2 MB                                                               decorator-5.0.9 | 11 KB                                                                   perl-digest-md5-2.55 | 18 KB                                                              perl-params-validate | 25 KB                                                              perl-compress-raw-bz | 44 KB                                                              perl-module-runtime- | 15 KB                                                              perl-io-socket-ssl-2 | 151 KB                                                             perl-http-cookies-6. | 19 KB                                                              perl-regexp-common-2 | 93 KB                                                              pyyaml-5.4.1 | 189 KB                                                                     matplotlib-base-3.1. | 6.7 MB                                                             circos-0.69.8 | 26.2 MB                                                                   fastqc-0.11.9                                                                             perl-net-http-6.19 | 19 KB                                                                multiqc-1.8 | 789 KB                                                                      pandas-1.0.3 | 11.1 MB                                                                    libdeflate-1.5 | 59 KB                                                                    Preparing transaction: done                                                               Verifying transaction: done                                                               Executing transaction: / The default QUAST package does not include:                      * GRIDSS (needed for structural variants detection)                                       * SILVA 16S rRNA database (needed for reference genome detection in metagenomic datasets) * BUSCO tools and databases (needed for searching BUSCO genes) -- works in Linux only!    To be able to use those, please run                                                       quast-download-gridss                                                                     quast-download-silva                                                                      quast-download-busco                                                                      done`

&#x20;***Make file:***

```bash
make
```

***Output:***

`docker build . -t czbiohub/sc2-msspe:latest                                                Sending build context to Docker daemon 41.27MB                                         Step 1/8 : FROM nfcore/base:1.10.2                                                        ---> 9e316f514716                                                                      Step 2/8 : LABEL authors="Jack Kamm and Samantha Hao" description="Docker image containing all software requirements for the nf-core/msspe pipeline"                                 ---> Using cache                                                                          ---> 525fb38872e9                                                                      Step 3/8 : COPY environment.yaml /                                                        ---> Using cache                                                                          ---> 0dc62ac5b33c                                                                       Step 4/8 : RUN conda env create -f /environment.yaml && conda clean -a                    ---> Using cache                                                                          ---> f2117adfcc63                                                                      Step 5/8 : ENV PATH /opt/conda/envs/sc2-msspe/bin/:$PATH                                  ---> Using cache                                                                          ---> 590cabd71879                                                                      Step 6/8 : RUN conda env export --name nf-core-msspe-1.0dev > nf-core-msspe-1.0dev.yaml   ---> Using cache                                                                          ---> 809b6527b82a                                                                      Step 7/8 : RUN apt update && apt install -y r-base                                        ---> Using cache                                                                          ---> 45559db3313c                                                                      Step 8/8 : RUN Rscript -e "install.packages(c('BiocManager', 'aplot')); BiocManager::install('ggtree')"                                                           ---> Using cache                                                                          ---> b4f2218da097                                                              Successfully built b4f2218da097                                                Successfully tagged czbiohub/sc2-msspe:latest`

**2. Running the pipeline**

* **Install kraken2 db**

```bash
git clone https://github.com/DerrickWood/kraken2.git
```

***Output:***

`loning into 'kraken2'...                                                            remote: Enumerating objects: 903, done.                                             remote: Counting objects: 100% (203/203), done.                                     remote: Compressing objects: 100% (152/152), done.                                  remote: Total 903 (delta 115), reused 104 (delta 51), pack-reused 700             Receiving objects: 100% (903/903), 428.83 KiB | 1.67 MiB/s, done.                 Resolving deltas: 100% (607/607), done.`

```bash
cd kraken2/
./install_kraken2.sh kraken2
```

***Output:***

&#x20;`Kraken 2 installation complete.`

To make things easier for you, you may want to copy/symlink the following files into a directory in your PATH:

* **Link to download kraken DB:**

{% embed url="<https://github.com/czbiohub/test-datasets/raw/olgabot/mssspe--kraken-coronavirus/reference/kraken_coronavirus_db_only.tar.gz>" %}

* **Download testing data from GitHub:**

```bash
git clone https://github.com/czbiohub/sc2-illumina-pipeline.git
```

***Output:***

`Cloning into 'sc2-illumina-pipeline'...                                                   remote: Enumerating objects: 1709, done.                                                  remote: Counting objects: 100% (235/235), done.                                           remote: Compressing objects: 100% (172/172), done.                                        remote: Total 1709 (delta 133), reused 132 (delta 60), pack-reused 1474                   Receiving objects: 100% (1709/1709), 12.06 MiB | 485.00 KiB/s, done.                      Resolving deltas: 100% (1058/1058), done.`

* **Testing:**

Simple test to make sure things are not broken:

```bash
nextflow run czbiohub/sc2-msspe-bioinfo -profile docker,test
nextflow run czbiohub/sc2-illumina-pipeline -profile docker,test
```

```bash
nextflow run czbiohub/sc2-illumina-pipeline -profile docker, --reads '/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_15Jun2021/sc2-illumina-pipeline/test_data/*_R{1,2}.fq.gz' --primers 'data/nCoV-2019.bed' --kraken2_db 'kraken2/kraken_coronavirus_db_only/' --outdir results_1
```

Error encountered which was then removed by using the following command:

```bash
nextflow self-update
```

**Output**

!\[Text

Description generated with very high confidence]\(<https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IP6CF4U62qNEHFC%2F1.png?generation=1624876072261717\\&alt=media>)

![](https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IP7In7XlJMuuUXW%2F2.png?generation=1624876072256312\&alt=media)

**IDSeq pipeline successfully installed.**

* **Updating pipeline:**

&#x20;Command: nextflow pull czbiohub/sc2-illumina-pipeline

**Main arguments:**

**-profile:** Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

&#x20;**--reads:** Use this to specify the location of your input SARS-CoV-2 read files

**--primers:** Use this to specify the BED file with the enrichment/amplification primers used in the library prep. The repository comes with two sets of primers corresponding to the MSSPE (SARS-COV-2\_spikePrimers.bed) and ARTIC V3 (nCoV-2019.bed) protocols.

**--kraken2\_db:** Use this to specify the path to the folder containing the Kraken2 database. This is required to filter the reads to only SARS-CoV-2 reads.

**--outdir:** Specify the output directory. This is required and can be an S3 path given the system has the appropriate credentials.

**3. Overview of pipeline:**

* **Description of processes and functions used in the pipeline**

**prefilterHostReads**

Filters host reads by mapping with minimap2 to the file given with --ref\_host and keeping unmapped reads. This process is optional; enable it with the flag --prefilter\_host\_reads. The filtered reads will be output into the folder host-subtracted-reads. These reads are sent to filterReads.

This step may be computationally expensive. If you only need the SARS-CoV-2 reads, and don't care about reads from other pathogens, you can use the fastqs output from filterReads instead.

By default, this step is skipped for -profile artic, but enabled for -profile msspe.

**quantifyERCCs**

Quantify ERCC reads by mapping with minimap2 to the file given with --ercc\_fasta. The individual output files are saved to the folder ercc-stats and the information is summarized in combined.stats.tsv.

**filterReads**

Filter for only SARS-CoV-2 reads. This is accomplished by first mapping with minimap2 to the file given with --ref and only keeping mapped reads. These reads are then run through kraken2 against the database provided by --kraken2\_db. Only reads that are assigned uniquely to SARS-CoV-2 are kept. These reads are sent to trimReads.

When the option --save\_sars2\_filtered\_reads is enabled, the fastqs of filtered reads are saved to filtered-sars2-reads. By default, this option is enabled when -profile artic, but disabled when -profile msspe.

**trimReads**

Trim adapter sequences and perform quality-trimming using Trim Galore. Skip this step with the flag --skip\_trim\_adapters. The trimmed reads will be published to the folder trimmed-reads. These will be sent to alignReads and callVariants.

**alignReads**

Align the reads to the reference genome provided by --ref with minimap2 and sort by position with samtools sort. The BAM files are sent to trimPrimers

**trimPrimers**

Trim the primers provided in --primers from the BAM using ivar trim. The primers will be soft-clipped from the ends of the reads in the BAM file. This step is necessary to properly call consensus alleles in these regions as the primer sequences can be artificially amplified. The trimmed BAMs will be published to the folder aligned-reads.

**makeConsensus**

Call a consensus genome using ivar consensus. See here for parameters. These genomes will be published to the folder consensus-seqs.

**quast**

Run QUAST on the consensus genomes. The results will be published to the folder QUAST.

**callVariants**

Call variant sites (SNPs) use samtools mpileup and bcftools call. This will also produce files from bcftools stats. The VCF files and stats files are published to sample-variants.

**computeStats**

Compute alignment and assembly statistics using samtools stats and the python script alignment\_assembly\_stats.py. Coverage plots are created during this process and published to coverage-plots.

**combinedVariants**

Creates a joint VCF using samtools and bcftools. combined.vcf will be published to the output directory. This is skipped by default.

**mergeAllAssemblies**

Combine all the consensus genomes into one file combined.fa, which will be published to the output directory.

**mergeAssemblyStats**

Parse previous files to create a summary statistics file combined.stats.tsv. This will be published to call\_consensus-stats/combined.stats.tsv.

filterAssemblies

Filter the assemblies based on the criteria given with --maxNs and --minLength. This outputs filtered.fa, published to the output directory, and filtered.stats.tsv, published to call\_consensus-stats.

**multiqc**

Run MultiQC to collect the reports from Trim Galore (FastQC), QUAST, samtools and bcftools. The output will be published to the folder MultiQC.

**4. Output:**

* **Summary of results**

The output of test data is stored in the directory named “results” containing subdirectories shown in the screenshot.

!\[Graphical user interface, application

Description generated with high confidence]\(<https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IP85Ed-6pn509kG%2F3.png?generation=1624876072233086\\&alt=media>)

Pipeline also generated the coverage plot of samples.

!\[Chart

Description generated with very high confidence]\(<https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IP9ManTUETTrdmG%2F4.png?generation=1624876072233401\\&alt=media>) !\[Chart

Description generated with very high confidence]\(<https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IPAsYTjhFaz3Wsk%2F5.png?generation=1624876072241243\\&alt=media>)

**Local Test Data**

Command:

nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker \\

&#x20;\--reads 'testSamples/\*\_R{1,2}\_001.fastq.gz\*' \\

&#x20;\--kraken2\_db 'kraken2/kraken\_coronavirus\_db\_only' \\

&#x20;\--outdir 'result\_2'

Output:

![](https://3625414896-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2Fpipeline%2F-MdH8sRmL6FzP8JccQYw%2F-MdH9IPBU0lFngrc7EwZ%2F6.png?generation=1624876072283126\&alt=media)

qcTreeMap didn't generate any plot on local test data because no sequence passed the filtering criteria.

All other processes ran successfully.

Consensus genome is stored in a file named “combined.fa”

Sample variant is stored as vcf file named “combined.vcf”

Quality plots are stored in the folder “MultiQC”

**Combined.fa:**

A multi-FASTA file containing all consensus sequences.

**Filtered.fa:**

A multi-FASTA file containing only sequences that pass filtering.

**Combined.vcf:**

A VCF of variants in all samples.

**IDSeq miniwdl pipeline**&#x20;

**On workstation:**&#x20;

with the specs&#x20;
