📑
Implementation of Bioinformatics Pipeline at AKU
  • PHA4GE: Implementation of CZ ID Workflow at AKU for Analyzing SARS-CoV-2 Genomics Data
    • 😎Implementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKU
      • 😊SARS CoV-2 Consensus Genome QC- Mixed sites Correction
      • 😇SARS CoV-2 Consensus Genome QC- Frame shift Correction
Powered by GitBook
On this page

Was this helpful?

  1. PHA4GE: Implementation of CZ ID Workflow at AKU for Analyzing SARS-CoV-2 Genomics Data
  2. Implementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKU

SARS CoV-2 Consensus Genome QC- Mixed sites Correction

This page will guide you how to remove mixed sites (MS) from your FASTA consensus genome files before submitting them to GISAID.

PreviousImplementation of CZ ID mini-WDL-based SARS-CoV-2 Consensus Genome Workflow Pipeline at AKUNextSARS CoV-2 Consensus Genome QC- Frame shift Correction

Last updated 2 years ago

Was this helpful?

Khan W, Kanwar S

AKU CITRIC Center for Bioinformatics and Computational Biology, Depratment of Pediatrics and Child Health, Faculty of Health Sciences, Medical College, The Aga Khan Universitry, Karachi-74800, Pakistan.

Mixed sites are mixtures of two or more different bases at a given necleotide position within the sequence.

  • Red indicates the number of mixed sites >10.

  • Yellow indicates the number of mixed sites >2.

  • Green indicates the number of mixed sites <=2 (acceptable).

These can be corrected by following below steps:

1) Use “grep” command to find ambiguous code in consensus.fa file.

grep 'Y\|R\|W\|S\|K\|M\|D\|V\|H\|B\|X' consensus.fa

Hint: To find the exact MS position, copy adjacent/flanking nucleotides (+/-3) around that specific MS and search in consensus genome FASTA file.

As shown in the image below, the MS "R" is located at 6,459 position (displayed at the bottom left corner).

Hint: In IGV's search box, paste the position after : . Do not remove the entire line written in search box, e.g., MN908947.3:position

Click on coverage bar to view the frequency of nucleotides at specified position. If the frequency of a nucleotide is >=75% then replace the ambiguous code in consensus.fa file with most occurring/frequent nucleotide. As shown in the below figure, frequency of "A" is 90% so R can be replaced by "A" in consensus.fa file.

Hope this will help you to improve the QC of consensus genome.

On , mixed sites are represented by "M". Color coding represents the mixed sites numbers in the sequence.

2) Identify the position of ambiguous code by opening the alignment file, for example, muscle.out.fasta (generated as intermediate file in ) using . Copy three nucleotides upstream and dowstream around MS position, click on "Find Motif" option in MegaX dropdown menu, paste the sequence and search.

3) Find the frequency of nucleotides at identified position using (IGV). You can do this by opening IGV and firdt uploading the reference genome FASTA file (in our case ), then upload the BAM file of sample containing MS and search the position (that was identified through MegaX) by pasting it in IGV's search box.

One thing to keep in mind is the threshold frequency value. Do not modify the nucleotide position if the nucleotide frequency is less than 75 percent. Previously, the threshold criteria was 90% in CZ ID pipeline which then changed to 75% few months ago (see the exact context on ). It is always better to stay updated regarding the standards accepted by scientific community (Hint: maintain your CZ ID pipeline by continously updating it!!!).

😎
😊
NextClade
CZ ID pipeline
MegaX
Integrative Genomics Viewer
MN908947.3
GitHub
IUPAC nucleotide code for single and mixed bases
"M" in the QC column represents the mixed sites in FASTA file.
Identification of ambiguous nucleotides in genome.
Visualization of nucleotide position on MegaX
IGV's search box
IGV is used to identify the nucleotide frequency
NextClade QC results after MS manual correction