😊SARS CoV-2 Consensus Genome QC- Mixed sites Correction

This page will guide you how to remove mixed sites (MS) from your FASTA consensus genome files before submitting them to GISAID.

Khan W, Kanwar S

AKU CITRIC Center for Bioinformatics and Computational Biology, Depratment of Pediatrics and Child Health, Faculty of Health Sciences, Medical College, The Aga Khan Universitry, Karachi-74800, Pakistan.

Mixed sites are mixtures of two or more different bases at a given necleotide position within the sequence.

IUPAC nucleotide code for single and mixed bases

On NextClade, mixed sites are represented by "M". Color coding represents the mixed sites numbers in the sequence.

  • Red indicates the number of mixed sites >10.

  • Yellow indicates the number of mixed sites >2.

  • Green indicates the number of mixed sites <=2 (acceptable).

"M" in the QC column represents the mixed sites in FASTA file.

These can be corrected by following below steps:

1) Use “grep” command to find ambiguous code in consensus.fa file.

grep 'Y\|R\|W\|S\|K\|M\|D\|V\|H\|B\|X' consensus.fa
Identification of ambiguous nucleotides in genome.

2) Identify the position of ambiguous code by opening the alignment file, for example, muscle.out.fasta (generated as intermediate file in CZ ID pipeline) using MegaX. Copy three nucleotides upstream and dowstream around MS position, click on "Find Motif" option in MegaX dropdown menu, paste the sequence and search.

Hint: To find the exact MS position, copy adjacent/flanking nucleotides (+/-3) around that specific MS and search in consensus genome FASTA file.

As shown in the image below, the MS "R" is located at 6,459 position (displayed at the bottom left corner).

Visualization of nucleotide position on MegaX

3) Find the frequency of nucleotides at identified position using Integrative Genomics Viewer (IGV). You can do this by opening IGV and firdt uploading the reference genome FASTA file (in our case MN908947.3), then upload the BAM file of sample containing MS and search the position (that was identified through MegaX) by pasting it in IGV's search box.

Hint: In IGV's search box, paste the position after : . Do not remove the entire line written in search box, e.g., MN908947.3:position

IGV's search box

Click on coverage bar to view the frequency of nucleotides at specified position. If the frequency of a nucleotide is >=75% then replace the ambiguous code in consensus.fa file with most occurring/frequent nucleotide. As shown in the below figure, frequency of "A" is 90% so R can be replaced by "A" in consensus.fa file.

IGV is used to identify the nucleotide frequency

One thing to keep in mind is the threshold frequency value. Do not modify the nucleotide position if the nucleotide frequency is less than 75 percent. Previously, the threshold criteria was 90% in CZ ID pipeline which then changed to 75% few months ago (see the exact context on GitHub). It is always better to stay updated regarding the standards accepted by scientific community (Hint: maintain your CZ ID pipeline by continously updating it!!!).

NextClade QC results after MS manual correction

Hope this will help you to improve the QC of consensus genome.

Last updated

Was this helpful?