😊SARS CoV-2 Consensus Genome QC- Mixed sites Correction

This page will guide you how to remove mixed sites (MS) from your FASTA consensus genome files before submitting them to GISAID.

Khan W, Kanwar S

AKU CITRIC Center for Bioinformatics and Computational Biology, Depratment of Pediatrics and Child Health, Faculty of Health Sciences, Medical College, The Aga Khan Universitry, Karachi-74800, Pakistan.

Mixed sites are mixtures of two or more different bases at a given necleotide position within the sequence.

On NextClade, mixed sites are represented by "M". Color coding represents the mixed sites numbers in the sequence.

  • Red indicates the number of mixed sites >10.

  • Yellow indicates the number of mixed sites >2.

  • Green indicates the number of mixed sites <=2 (acceptable).

These can be corrected by following below steps:

1) Use β€œgrep” command to find ambiguous code in consensus.fa file.

grep 'Y\|R\|W\|S\|K\|M\|D\|V\|H\|B\|X' consensus.fa

2) Identify the position of ambiguous code by opening the alignment file, for example, muscle.out.fasta (generated as intermediate file in CZ ID pipeline) using MegaX. Copy three nucleotides upstream and dowstream around MS position, click on "Find Motif" option in MegaX dropdown menu, paste the sequence and search.

Hint: To find the exact MS position, copy adjacent/flanking nucleotides (+/-3) around that specific MS and search in consensus genome FASTA file.

As shown in the image below, the MS "R" is located at 6,459 position (displayed at the bottom left corner).

3) Find the frequency of nucleotides at identified position using Integrative Genomics Viewer (IGV). You can do this by opening IGV and firdt uploading the reference genome FASTA file (in our case MN908947.3), then upload the BAM file of sample containing MS and search the position (that was identified through MegaX) by pasting it in IGV's search box.

Hint: In IGV's search box, paste the position after : . Do not remove the entire line written in search box, e.g., MN908947.3:position

Click on coverage bar to view the frequency of nucleotides at specified position. If the frequency of a nucleotide is >=75% then replace the ambiguous code in consensus.fa file with most occurring/frequent nucleotide. As shown in the below figure, frequency of "A" is 90% so R can be replaced by "A" in consensus.fa file.

One thing to keep in mind is the threshold frequency value. Do not modify the nucleotide position if the nucleotide frequency is less than 75 percent. Previously, the threshold criteria was 90% in CZ ID pipeline which then changed to 75% few months ago (see the exact context on GitHub). It is always better to stay updated regarding the standards accepted by scientific community (Hint: maintain your CZ ID pipeline by continously updating it!!!).

Hope this will help you to improve the QC of consensus genome.

Last updated