Identification of Novel Missense Mutations in a Large Number of Recent SARS-CoV-2 Genome Sequences
Hugh Y Cai1*, Kimberly K Cai2, Julang Li3*
1Animal Health Laboratory, University of Guelph, Guelph, Ontario, Canada
2Department of Family Medicine, McMaster University, Hamilton and Forbes Park Medical Centre, Cambridge, Ontario, Canada
3Department of Animal Biosciences, University of Guelph, Guelph, Ontario, Canada
*Corresponding Author: Hugh Y Cai, Animal Health Laboratory, University of Guelph, Guelph, Ontario, Canada
*Corresponding Author: Julang Li, Department of Animal Biosciences, University of Guelph, Guelph, Ontario, Canada
Received: 19 May 2020; Accepted: 08 June 2020; Published: 20 July 2020
Hugh Y Cai, Kimberly K Cai, Julang Li. Identification of Novel Missense Mutations in a Large Number of Recent SARS-CoV-2 Genome Sequences. Journal of Biotechnology and Biomedicine 3 (2020): 93-103.View / Download Pdf Share at Facebook
Background: SARS-CoV-2 infection has spread to over 200 countries since it was first reported in December of 2019. Significant country-specific variations in infection and mortality rate have been noted. Although country-specific differences in public health response have had a large impact on infection rate control, it is currently unclear as to whether evolution of the virus itself has also contributed to variations in infection and mortality rate. Previous studies on SARS-CoV-2 mutations were based on the analysis of ~ 160 SARS-CoV-2 sequences available until mid-February 2020. By mid-April, > 550 SARS-CoV-2 sequences had been deposited in GenBank, and over 8,200 in the GISAID database.
Methods: We performed a sequence analysis on 474 SARS-CoV-2 genomes submitted to GenBank up to April 11, 2020 by multiple alignment using Map to a Reference Assembly and Variants/SNP identification. The results were verified on a larger scale, 8,126 hCoV-19 (SARS-CoV-2) sequences from GISAID database.
Results: We identified 5 recently emerged mutations in many isolates (up to 40%). Our analysis highlights 5 frequent new mutations that have emerged since late February 2020. These mutations are: one each missense (non-synonymous) mutation in orf1ab (C1059T), orf3 (G25563T) and orf8 (C27964T), one in 5’UTR (C241T), one in a non-coding region (G29553A). The final mutation (G29553A) was found to be almost exclusive to the US isolates. The first 3 mutations are non-synonymous, leading to amino acid substitutions in the viral protein sequence. Except for C241T, all the novel mutations identified are absent in the isolates from Italy and Spain in the SARS-CoV-2 genomes deposited in GenBank and GISAID by April 13, 2020.
Conclusion: The results of current study indicate that new mutations are emerging as COVID-19 pandemic are spreading to diff
COVID-19, SARS-CoV-2, Virus, Mutation, Polymorphism, Genome sequencing
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a RNA corona virus, is the pathogen of corona virus disease 2019 (COVID-19). Since it was first reported to the WHO in December 2019, it has spread to 213 countries, areas, or territories, causing 2,356,414 confirmed infections and 160,120 death worldwide (WHO April 20, 2020). The COVID-19 outbreak happened firstly in China with 84,237 confirmed cases and 4,642 deaths, then seriously hit Italy with 178,972 confirmed cases and 23,660 deaths, Spain with 195,944 confirmed cases and 20,453 deaths, then more recently the US with 723,605 confirmed a cases and 34,203 deaths, and other countries (WHO, April 20, 2020). Although country-specific differences in public health response have had a large impact on infection rate control, it is currently unclear as to whether evolution of the virus itself has also contributed to variations in infection and mortality rate.
RNA viruses possess a high mutation rate, ranging from 10−4–10−6 mutations per round of genome replication . Over 45 mutations have been described since the first SARS-CoV-2 sequence was identified in Jan 2020 [2-5]. However, these previous studies were based on the analysis of ~ 160 SARS-CoV-2 sequences available until mid-February 2020 [2-5]. By mid-April, > 550 SARS-CoV-2 sequences had been deposited in GenBank, and over 8,200 in the GISAID database. The geographic sources of the sequences have changed significantly. To provide a most recent view of the genetic variation of SARS-CoV-2, we retrieved 474 complete or close-to-complete genomes (>29,100 nt) from the National Center for Biotechnology Information (NCBI) to search for novel and high-frequency mutations. GenBank SARS-CoV-2 genomes were compared with those of the GISAID hCoV-19 database, consisting of 8,008 SARS-CoV-2s complete or close-to-complete genomes (>29,100 nt). We discovered that many SARS-CoV-2 isolates possessed mutations that were not described previously.
On April 11, 2020, there were 547 SARS-Cov-2 sequences deposited in GenBank, from which, we downloaded 474 complete or near-complete genomes of 29,161 to 29,866 nucleotides (nt) (hereafter referred as full genome), including 358 from the US, 64 from China, 24 from Spain, and 27 from other countries or regions (Table 1). The SARS-CoV-2 isolate Wuhan-Hu-1 collected in December 19, 2019 and deposited in GenBank in January 2020 (GenBank accession. NC045512)  was used as a reference for mutation analysis. All nucleotide position labeling in our study was based on the alignment with this sequence. The SARS-CoV-2 full genome sequences (474 in total) downloaded from GenBank were multiple aligned by a bioinformatic software, Geneious v.11 (Auckland, New Zealand) using Map to a Reference Assembly function. The aligned sequences were visually examined to confirm that they were aligned properly. The variants/SNP were identified by the software automatically and verified by visual confirmation. Short fragments (30 nt) containing the novel mutations identified in our study were used as queries to blast search against the sequences downloaded from GenBank to verify the existence of the mutations. To verify our findings on a larger scale, 8,126 hCoV-19 (SARS-CoV-2) sequences from GISAID (Global Initiative on Sharing All Influenza Data) website (https://www.gisaid.org) were downloaded and analyzed with the same methods as described above for the GenBank sequences.
3. Results and Discussion
3.1 Identification of 5 novel mutations
From the 474 sequences available in GenBank, a group of 100 SARS-CoV-2 genomes were found to have a nucleotide (nt 25563) mutated from G to T (G25563T). The mutation was exclusive to the US isolate sequences collected since March 2020 in the GenBank (downloaded April 11, 2020). The new mutants accounts for 21.1% (100/474) of all full genome sequences submitted to GenBank, or 27.9% (100/358) of the US full genome sequences in GenBank. Most of the G25563T isolates (94/100) co-possessed a C1059T mutation. Moreover, 16 of the G25563T isolates had an additional C27964T mutation, which accounts for 3.4% (16/474) of all full genome GenBank sequences, or 4.5% (16/354) of the US full genome sequences in GenBank. Among all 474 full genome sequences in GenBank, 48 collected from the US in March 2020 have a G29553A mutation. In addition, a mutation (C241T) was found in 30.8% (109/354) US isolates collected mostly in March 2020. The GenBank accessions of the isolates that we found with the novel mutations are shown in supplement Table S1. Of the 5 mutations described above, 3 mutations are substitution mutations in the coding regions, which resulted in amino acid sequence changes (missense mutation; non-synonymous mutations). They are C1059T causing amino acid 265 mutation from T to I (T265I) in orf1ab, G25563T (Q57H) in orf3a, C27964T (S24L) in orf8. The G29553A mutation is in a noncoding region upstream of orf10; the C241T mutation is at the 5’ untranslated (5’UTR) region. These mutations have not been described previously, to our knowledge, and were found only in the isolates submitted mostly in and after March 2020 (including a few isolates in late February; Table 1). The representative images of the 5 mutations are shown in supplement Figure S1.
3.2 Proposed classification of the new SARS-CoV-2 isolates
Recently, the SARS-CoV-2 isolates have been classified into 3 clusters (groups), namely group A, B and C, based on 3 mutations . The original isolates without mutation collected in Dec 2019 from China were classified as group A; the isolates with C8782T/Y and T28144C mutations were labeled as group B (mutated from group A); when group B isolates mutated with G26144T, the mutated isolates were labeled as group C. The isolates with the 3 nonsynonymous (missense) mutations identified in our study did not fall in the category of group A, B, C, since they had many mutations on top of group A, but did not have marker mutations C8782T/Y and T28144C (group B), nor G26144T (group C). To be consistent with the recent cluster (group) classification , we classified the isolates with novel amino acid changes as follow: C1059T(T265I) and G25563T(Q57H) usually co-existed, they are group D; the ones with the C27964T (S24L) change are in group E.
3.3 The emerging geographic locations of group D and E SARS-CoV-2 isolates
The earliest SARS-CoV-2 sequences were collected from China in December 2019 (Table 1). Of the 19 early identified sequences, 12 were group A, 2 were group B, and 5 were group C. These data suggest that most of the isolates in the early stage of the outbreak were group A. In addition, it also revealed that mutations to group B and C existed as early as December 2019. Similarly, Taiwan and India collected group A and B isolates in January 2020. In addition, Iran, Japan, Pakistan, Viet Nam, and Australia had collected only group A isolates in January 2020 (Table 1). By the time the outbreak spread to Spain in February and March 2020, all isolates collected in GenBank belonged to group B and C. In the US, the SARS-CoV-2 isolates collected in the early stage (January 2020) were group A and B, each accounting for about 50% of the isolates; 9 of 17 group A and 8 of 17 group B, respectively. However, in March, the percentage of group A isolates dropped dramatically to 5.7% (17/300); isolates in group B and their variants in group C together accounted for 62% (179/300) of the isolates submitted from the country (Table 1). More strikingly, ~ 1/3 of the US Mar-2020 isolates have at least 2 mutations identified in the current study. From the GenBank SARS-CoV-2 database (Table 1), we can see that the virus started mainly as group A, with a portion of variants mutated into group B and group C in December 2019. Thereafter, most isolates were group B and C. Then new mutants of groups D & E started to emerge, accounting for approaching 40% of the US isolates in March 2020.
Although a fairly representative snapshot, the GenBank information is obviously not a complete picture. As of April 13, 2020, 8,126 sequences were available in the GISAID hCoV-19 (SARS-CoV-2) database. To validate our findings on GenBank, we retrieved all complete or near-complete genomes (>29,160 nt) from the GISAID hCoV-19 database. We analyzed these 8,008 with the focus on the new mutations (Table 2).
In the GISAID hCoV-19 database, 17.7 % (1,417/8,008) and 0.6% (50/8,008) were group D and E isolates, respectively (Table 2). In addition, we identified 55.3% (4,427/8,008) with the novel mutation of C241T. Consistent with our finding from GenBank sequences, 43% of the US isolates belong to group D. In addition, group D isolates have been present widely; they account for substantial isolates submitted to GISAID hCoV-19 database in late February to March 2020: Canada (21.7%, 28/129), UK (6.4%, 175/2,726), France (53.9%, 110/204), Iceland (17.3%, 104/601), Australia (16.9%, 66/391), Netherlands (11.1%, 65/585), Belgium (12.1%, 39/322), Luxembourg (37.2%, 32/86), and Finland (40%, 16/40). It is striking to note that no group D mutation was found in any of the SARS-CoV-2 isolates submitted by Italy (44) and Spain(105), respectively, although the outbreaks in those 2 countries were severe and several weeks earlier than the countries in other parts of Europe and North America. We speculate that group D mutations occurred in late February to early March 2020. Since group D were found in multiple countries in a relatively short period of time, the mutation may have possibly emerged in multiple countries independently. Among the 8,008 genomes in the GISAID hCoV-19 database, 50 (0.6%) had the C27964T (group E) mutation, 42 from the US, 2 from Canada, and 6 from Australia. Although it is a relatively small number, this mutation is in a coding region, resulting in an amino acid sequence change and is thus also worth attention. The 6 Australian group E isolates are different from those collected from the US in that they did not have the mutations of group D and C1059T. Since the Australian group E isolates are different from the ones collected in the US and Canada, they possibly evolved in Australia independently. Group B (C8782T/Y and T28144C), and group C (C26144T) sequences were found in 29.5%, 30.5%, and 6.3% of 95 isolates collected before Feb 14, 2020 . However, these mutations are absent in the genomes of the US group D and E isolates, suggesting that the US group D isolates evolved directly from the ancestral strains (group A). Another interesting finding of our study was the discovery of the mutation G29553A. It was found in 1.4% (110/8,008) GISAID SARS-CoV-2 genomes from the world, or 6.9% (109/1,591) in the US SARS-CoV-2 genomes. The >100 G29553A isolates are almost exclusively, with the exception of one (Iceland), from the US. The mutation is in a noncoding region of the virus genome, although the significance of the mutation is currently unknown.
& C1059T and G25563T both are marker mutations of group D, and mostly coexists; * Four of the group B isolates have C241T mutation; # six of the group B isolates have C241T mutation; $ this isolate did not have the mutation of B group; % these 7 C isolates have no mutation of the B group; ^ Previously described [2, 5]
Table 1: Novel mutations identified in GenBank SARS-CoV-2 genomes as of April 11, 2020.
Table 2: Novel mutations identified in GISAID hCoV-19 (SARS-CoV-2) genomes as of April 13, 2020.
3.4 The potential impact of the emergence of group D and E SARS-CoV-2 strains
Group D and group E defining mutations found on orf3a and orf8 respectively are regions associated with the expression of accessory proteins. Accessory proteins are not required for viral replication but may affect viral virulence and pathogenesis . Orf3a is 72% conserved between SARS-CoV and SARS-CoV-2. Based on its function in SARS-CoV, it has been postulated that Orf3a is involved in cell apoptosis . Mutations in Orf3a in SARS-Cov-2 have been shown to also result in loss or change of epitopes that may help the virus evade the host immune response . There may be clinical implications of the missense mutations of these proteins. First, patients who have already recovered from earlier COVID-19 infection may have incomplete or reduced immunity when subsequently exposed to the newly emerging group D or group E SARS-CoV-2. Second, development of ELISA serologic testing must account for the potential epitope variability among different SARS-CoV2 groups. Accuracy of serologic testing may be adversely affected by current and emerging mutations in these accessory proteins. Further study on the biochemical and clinical impact of the Q57H substitution noted in orf3a (group D) and the S24L substitution on orf8 (group E), especially on viral virulence, and pathogenesis host immune response, are warranted. Most group D isolates also demonstrated the missense C1059T mutation in orf1ab (T265I). Orf1ab encodes a replicase that is involved in viral transcription and replication . It would be important to further elucidate the role of T265I substitution in viral replication.
Global efforts to increase sequencing of SARS-CoV-2 isolates will be critical for mutation monitoring and clinical correlation. In addition to epidemiologic analysis, identifying new mutations in the SARS-CoV-2 isolates may, among other efforts, shed light on vaccine development, and help in evaluating the current molecular testing protocol. Fortunately, none of the group D and E mutations that we identified were in the PCR targets in the protocols listed in WHO website (WHO.int, access April 17, 2020). Update before submission of the manuscript to Journal of Biotechnology and Biomedicine on May 19, 2020: Among the 29,633 complete SARS-CoV-2 genomes in the GISAID hCoV-19 database, 6, 367 (21.5%) had C25563T (group D) mutation found in mostly in US (3244) and some other countries including Spain (9) but non was from Italy; 516 (1.7%) had the C27964T (group E) mutation, 451 from the US, 3 from Canada, and 62 from Australia; 294 (1%) had G29553A mutation, 293 from the US and 1 from Iceland.
The results of the current study indicate that new mutations are emerging as COVID-19 pandemic are spreading to different countries and that geography specific mutants may exist. The findings of the current study lay the foundation for further investigation into the impact of SARS-CoV-2 mutations on disease incidence, severity, and host immune response. In addition, it may also provide insights into vaccine development and serological response detection for the virus.
Ethics Approval and Consent to Participate
Consent for Publication
All authors have read and approved the final version for publication.
Availability of Data and Materials
All sequence data used in this study were available from the GenBank and GISAID. GenBank accessions of the isolates with novel mutations identified in this study can be found in Supplement Table S1.
The authors declare no competing interests.
Natural Sciences and Engineering Research Council of Canada (NSERC), Food from Thought and OMAFRA.
HYC and JL conceived the study. HYC collected and analyzed the data, HYC KC, JL co-interpret the data and wrote the article. All authors reviewed and commented to the final version.
We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GenBank and GISAID’s hCov-19 Database on which this research is based. We thank Dr. Grant Maxie, University of Guelph, for editing the article. This article was posted online on April 28, 2020 (doi: 10.20944/preprints 202004.0482.v1).
- Jenkins G, Rambaut A, Pybus O, et al. Rates of Molecular Evolution in RNA Viruses: A Quantitative Phylogenetic Analysis. J Mol Evol 54 (2002): 156-165.
- Forster P, Forster L, Renfrew C, e al. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc Natl Acad Sci U S A 117 (2020): 9241-9243.
- Pachetti M, Marini B, Benedetti F, et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNApolymerase variant. J Translational Med 179 (2020).
- Phan T. Genetic diversity and evolution of SARS-CoV-2. Infect Genet Evol 81 (2020): 104260.
- Wang C, Liu Z, Chen Z, et al. The establishment of reference sequence for SARS-CoV-2 and variation analysis. J Med Virol 92 (2020): 667-674.
- Wu F, Zhao S, Yu B, et al. A new coronavirus associated with human respiratory disease in China. Nature 579 (2020): 265-269.
- Issa E, Merhi G, Panossian B, et al. SARS-CoV-2 and ORF3a: Non-Synonymous Mutations and Polyproline Regions. MSystems 5 (2020): 266.
- Wu C, Liu Y, Yang Y, et al. Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods. Acta Pharmaceutica Sinica B 10 (2020): 766-788.
- Liu DX, Fung TS, Chong KKL, et al. Accessory proteins of SARS-CoV and other coronaviruses. Antiviral Research 109 (2014): 97-109.