Abstracting and Indexing

  • Google Scholar
  • CrossRef
  • WorldCat
  • ResearchGate
  • Academic Keys
  • DRJI
  • Microsoft Academic
  • Academia.edu
  • OpenAIRE

Algorithm for Selecting Potential SARS-CoV-2 Dominant Variants based on POS-NT Frequency

Article Information

Eunhee Kang1, TaeJin Ahn*, 1 and Taesung Park*, 2

1Department of Life Science, Handong Global University, Pohang, Republic of Korea

2Department of Statistics, Seoul National University, Seoul, Republic of Korea

*Corresponding author: Taesung Park, Department of Statistics, Seoul National University, Seoul, Republic of Korea.

TaeJin Ahn, Department of Life Science, Handong Global University, Pohang, Republic of Korea

Received: 29 February 2024; Accepted: 08 March 2023; Published: 27 March 2024

Citation:

Eunhee Kang, TaeJin Ahn and Taesung Park. Algorithm for Selecting Potential SARSCoV- 2 Dominant Variants based on POS-NT Frequency. Archives of Microbiology and Immunology. 8 (2024): 101-117.

View / Download Pdf Share at Facebook

Abstract

Coronavirus disease 19 (COVID-19), currently prevalent worldwide, is caused by a novel coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Similar to other RNA viruses, SARS-CoV-2 continues evolving through random mutations, creating numerous variants, including Alpha, Beta, and Delta. It is, therefore, necessary to predict the mutations constituting the dominant variant before they are generated. This can be achieved by continuously monitoring the mutation trends and patterns. Hence, we sought to design a dominant variant candidate (DVC) selection algorithm in the current study. To this end, we obtained COVID-19 sequence data from GISAID and extracted position-nucleotide (POS-NT) frequency ratio data by country and date through data preprocessing. We then defined the dominant dates for each variant in the USA and developed a frequency ratio prediction model for each POS-NT. Based on this model, we applied DVC criteria to build the selection algorithm, which was verified for Delta and Omicron. Using Condition 3 as the DVC criterion, 69 and 102 DVC POS-NTs were identified for Delta and Omicron an average of 47 and 82 days before the dominant dates, respectively. Moreover, 13 and 44 Delta- and Omicron-defining POS-NTs were recognized 18 and 25 days before the dominant dates, respectively. We identified all DVC POS-NTs before the dominant dates, including rapidly and gently increasing POS-NTs. Considering that we successfully defined all POS-NT mutations for Delta and Omicron, the DVC algorithm may represent a valuable tool for providing early predictions regarding future variants, helping improve global health.

Keywords

BAM binary alignment/map, COVID Coronavirus disease 19, DVC dominant variant candidate, GISAID Global Initiative for Sharing All Influenza Data, POS-NT position-nucleotide, SAM sequence alignment/map, SARS-CoV-2 severe acute respiratory syndrome coronavirus 2

Article Details

Abbreviation:

BAM binary alignment/map

COVID Coronavirus disease 19

DVC dominant variant candidate

GISAID Global Initiative for Sharing All Influenza Data

POS-NT position-nucleotide

SAM sequence alignment/map

SARS-CoV-2 severe acute respiratory syndrome coronavirus 2

Background

The recent Coronavirus disease 19 (COVID-19) pandemic, caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, has had severe implications worldwide. Continuous mutations in the genome generate new variants, enabling the virus to thwart disease control measures. Next-generation sequencing technology is widely employed to characterize the genetic SARS-CoV-2 variants. Owing to the contributions of many researchers, SARS-CoV-2 genomic data has been collected from infected individuals worldwide. GISAID is a database that stores and provides sequenced SARS-COV-2 genomes along with basic metadata, including the sequencing date and location. GISAID also visually presents the status of SARS-CoV-2 variant spread in a geographical and time-dependent manner. In particular, predicting the emergence of a novel variant is critical to identifying new potential outbreaks capable of evading the current diagnostic and vaccine strategies.

In this study, we provide a prediction model that estimates whether a single SARS-COV-2 mutation is a prominent factor in determining disease severity in infected patients. This functionality is helpful in disease control in several aspects. First, single mutations may be associated with known clinical characteristics, such as symptom severity, incubation period, and morbidity rate. Second, mutations in PCR primer binding regions can be used to estimate if an infectious virus evades diagnostic methods. Third, single mutations help assess the vaccination efficacy of the designed epitope.

SNPs defining Delta and Omicron variants

Based on the WHO nomenclature system (GISAID, Pango lineage, Nextstrain clade): Alpha (GRY, B.1.1.7, 20I (V1)), Beta (GH/501Y. V2, B.1.351, 20H (V2)), Gamma (GR/501Y.V3, P.1, 20J(V3)), Delta (G/478K.V1, B.1.617.2, 21A-21I-21J), and Omicron (GR/484A, B.1.1.529, 21K-21L-21M-22A-22B-22C-22D) (WHO: https://www.who.int/activities/tracking-SARS-CoV-2-variants) SARS-CoV-2 strains have arisen due to mutations in the genomic sequence. The SARS-CoV-2 genome comprises 29,903 nucleotides, encoding 12 proteins (ORF1a/1ab, S, ORF3a, ORF3b, E, M, ORF6, ORF7a, ORF7b, ORF8, and ORF10*). These mutations are caused by single nucleotide changes, i.e., replacement, insertion, or deletion, leading to changes in the amino acid sequence (Fig 1; GISAID: https://gisaid.org/). Figure 1 presents a genome sequence map of SARS-Cov-2 and the major mutational positions of several variants. In this study, we attempted to predict the Delta and Omicron variants, i.e., the most recent dominant SARS-CoV-2 variants. According to Pango (Pango cov-lineages: https://cov-lineages.org/), the Delta and Omicron variants have 13 and 47 defining SNPs, respectively (Tables 1 and 2).

fortune-biomass-feedstock

Figure 1: SARS-CoV-2 genome sequence map. The 29,903 nucleotide positions are shown in the context of the 12 encoded proteins. The main mutations of each dominant variant are shown; pink: Alpha mutation, green: Delta mutation, blue: Omicron mutation.

Table 1: Delta defining position-nucleotides

 

Amino acid

Nucleotide

1

T19R

C21618G

2

T478K

C22995A

3

L452R

T22917G

4

D950N

G24410A

5

P681R

C23604G

6

D377Y

G21987A

7

S26L

T26767C

8

I82T

T27638C

9

V82A

C27752T

10

R203M

C25469T

11

T120I

G29402T

12

G142D

A28461G

13

D63G

G28881T

Table 2: Omicron defining position-nucleotides

 

Amino acid

Nucleotide

1

6513>3

 

2

11283>9

 

3

C241T

C241T

4

K856R

A2832G

5

C3037T

C3037T

6

T5386G

T5386G

7

A2710T

G8393A

8

T3255I

C10029T

9

P3395H

C10449A

10

I3758V

A11537G

11

T13195C

T13195C

12

P314L

C14408T

13

C15240T

C15240T

14

I1566V

A18163G

15

A67V

C21762T

16

T95I

C21846T

17

G339D

G22578A

18

S371L

T22673C

19

S373P

T22679C

20

K417N

G22813T

21

N440K

T22882G

22

G446S

G22898A

23

S477N

G22992A

24

T478K

C22995A

25

E484A

A23013C

26

Q493R

A23040G

27

G496S

G23048A

28

Q498R

A23055G

29

N501Y

A23063T

30

T547K

C23202A

31

D614G

A23403G

32

H655Y

C23525T

33

N679K

T23599G

34

P681H

C23604A

35

N764K

C23854A

36

D796Y

G23948T

37

N856K

C24130A

38

Q954H

A24424T

39

N969K

T24469A

40

C25000T

C25000T

41

T9I

C26270T

42

D3G

A26530G

43

Q19E

C26577G

44

A63T

G26709A

45

A27259C

A27259C

46

C27807T

C27807T

47

RG203KR

GGG28881AAC

COVID-19 sequence data from GISAID

The Global Initiative for Sharing All Influenza Data (GISAID) provides a database of nucleotide sequence information and related epidemiological information for all influenza viruses and COVID-19-causing coronaviruses. GISAID provides multiple SARS-CoV-2 sequence data analyses collected worldwide, as well as sequence alignments, diagnostic primer and probe coordinates, 3D protein models, drug targets, and phylogenetic trees. In this study, global SARS-CoV-2 sequence data were obtained from GISAID on February 22, 2022; 8,474,962 sequence data were obtained from December 1, 2019, to February 22, 2022 (GISAID: https://gisaid.org/).

Materials and Methods

Data preprocessing and formatting

COVID-19 sequence data obtained in a FASTA file format from GISAID were converted from a multiline to a single-line format; only complete sequences corresponding to > 29,000 bp were extracted. We secured sequence data by country and date through the GISAID unique ID, country, collection date, and sequence information in the header of the sequence data. In this study, countries with the most sequencing data, namely, the USA (2,702,068), UK (1,936,958), and Germany (415,309), as well as Korea, were analyzed. The sequence data obtained by country were mapped to the original sequence (NC_045512) to obtain a sequence alignment/map (SAM) file. The binary alignment/map (BAM) file was then converted to a binary format using SAM tools to reduce the file size. From the generated BAM file, sequencing reads were synthesized for each position of the original sequence to determine whether bases differed from the original data; the mutation data was extracted in a variant call format. The obtained data were used to extract information on the number of mutations and mutation frequency ratio information for each position in the sequence and to confirm the mutation trend by securing the frequency ratio data by country, date, and POS-NTs (total 29,903 positions × 4 nucleotides × 4 countries; Fig. 2).

fortune-biomass-feedstock

Figure 2: Frequency ratio data acquisition process by country, date, and POS-NT. Freq, frequency.

As it is overly computationally intensive to determine the trend for all mutations in SARS-COV-2 (i.e., a combination of 29,903 positions and 3 SNP mutations), each nucleotide position was subjected to additional preprocessing to remove those without mutations (reference frequency = 100), those with no information on the time point of the dominant variant, and those where the change in reference allele frequency was < 10%. Next, we created continuous data for date information for which frequency ratio information did not exist and the position where the total data date was < 50 days removed. Subsequently, cubic spline interpolation was used to fill in the data for which the frequency ratio information did not exist. We removed the reference allele from the four nucleotides as we were interested in mutations. An additional preprocessing step is shown in Figure 3.

fortune-biomass-feedstock

Figure 3: Additional preprocessing step to select POS-NTs for modeling.

DVC selection for the prediction model

We attempted to predict the mutations comprising the dominant variants by analyzing and predicting the Delta and Omicron variants. To confirm the trend of a specific POS-NT, a dominant variant selection time point was required. Moreover, we aimed to confirm whether the developed algorithm could identify all mutations constituting the Delta and Omicron variants at the dominant variant time point after determining the DVC POS-NT until the variant became dominant. Therefore, we attempted to define the dominant variant time-point (i.e., dominant date) for Delta and Omicron. We defined the strains that accounted for > 50% of all new COVID-19 cases as the dominant variants. However, information on the strain and lineage of the sequences was not available in the data provided by GISAID. Therefore, we proceeded with the lineage analysis provided by Pangolin, assigned a strain label, including Delta and Omicron, for each sequence, and secured the daily frequency ratio data of the strain. The strain that accounted for > 50% of all new COVID-19 cases was defined as the dominant variant and the corresponding time point was defined as the dominant date. Figure 4 illustrates the scheme determining the dominant date for each country and its variants. The dominant date was used as the time point for selecting the dominant variant using this algorithm and as a criterion for learning and prediction date windows for each variant. For the analysis and prediction of Delta mutations, the extracted data, including the alpha-dominant to the delta-dominant dates, were employed for each POS-NT. The analysis and prediction of Omicron mutation data from the Delta dominant to Omicron dominant date.

fortune-biomass-feedstock

Figure 4: Dominant variant time point (dominant date definition process). After detecting the sequence data for each country, the strain and lineage information for each sequence was allocated through lineage analysis provided by Pangolin. After securing daily rate data for each strain, those accounting for > 50% of all new COVID-19 cases were defined as dominant, and the corresponding time point was defined as the dominant variant time point (dominant date). The dominant date was used as the dominant variant selection time point and as a criterion for learning and prediction date windows for each variant in the algorithm.

Results

POS-NT frequency ratio prediction model

A POS-NT frequency ratio prediction model was developed to confirm the trend in each POS-NT frequency ratio. Gaussian Process Regression (GPR) is a powerful Bayesian-based, non-parametric kernel-based probabilistic model for regression analyses applied in exploration and utilization scenarios. It predicts the output of a new test set considering the novel input vectors of the test and training sets [1–3]. The most prominent advantage of GPRs is their ability to obtain the forecast uncertainty with the forecast value. In addition, GPR boasts computational efficiency and high accuracy and is suitable for other time series forecasting, such as weather forecasting [4]. Recently, the GPR model was used widely in predicting COVID-19 spread and deaths, exhibiting improved performance compared with other models [3, 5–7].

The following four patterns were identified for Delta and Omicron variant-defining mutations: (1) a gentle increase from a ratio of 0 (Fig. 5A); (2) consistent high-frequency ratio values (Fig. 5B); (3) a gentle increase through the dominant date, with a high-frequency ratio, of the previous variant (Fig. 5C); (4) soaring pattern (Fig. 5D). To identify the trend of a gently increasing pattern and soaring pattern, it was necessary to select optimal training and prediction dates. Therefore, to learn the soaring pattern trend, we applied the latest information to predict the future and modeled each learning and prediction combination until the dominant date for each variant (i.e., learn for 10 and 20 days and predict 3, 5, 8, and 10 days later; Fig. 6A). In the case of the Delta mutation, data from the Alpha-dominant to Delta-dominant dates were employed for the analysis window based on the variant. In the case of the Omicron mutation, data from the Delta-dominant date to the Omicron-dominant date were modeled (Fig. 6B).

fortune-biomass-feedstock

Figure 5: Time-dependent patterns of delta- and omicron-defining mutations. (A) Delta variant pattern: gentle increase from a ratio value of 0. (B–E) Omicron variant pattern; (B) high-frequency ratio values are consistently present; (C) gentle increase through the dominant date (with a high-frequency ratio) of the previous variant; (D) soaring pattern.

fortune-biomass-feedstock

Figure 6: Learning and prediction window selection. (A) Example training and prediction window for one POS-NT (training for 10 days and predicting after 3 days) and (B) Delta and Omicron analysis date time.

DVC selection algorithm

Based on the frequency ratio prediction model for each POS-NT, a dominant variant candidate selection algorithm (DVC selection algorithm) was developed by applying the dominant variant candidate criteria (DVC criteria; Fig. 7). We then determined whether all POS-NTs met the DVC criteria for each prediction time point; upon failing to meet the DVC criteria, the corresponding POS-NT was reanalyzed the next day. If it met the DVC criteria at that time point, the corresponding POS-NT was classified as DVC POS-NT. The DVC POS-NT was identified up to the dominant date of the variant, and then the identified DVC POS-NT was compared with the actual variant definition POS-NT list. Eight conditions were simulated to select the optimal DVC criteria based on the criteria for outliers in which the frequency of the DVC increased the next day (i.e., Criterion 2), and the measured value was higher than the predicted value (i.e., Criterion 4; Table 3). The DVC criteria defined the corresponding POS-NT as DVC POS-NT when all four detailed criteria were satisfied.

fortune-biomass-feedstock

Figure 7: DVC POS-NT Selection Algorithm and Combined DVC Criteria. When POS-NT ratio data at a specific point occur, predictions for the future can be made. If the DVC criteria are met, the corresponding POS-NT is identified as the DVC POS-NT. If it does not meet the DVC criteria at that time point, the POS-NT moves to the next point, and the analysis continues. Criteria 1: number of days in which all dominant variant candidate criteria were satisfied; Criterion 2: whether the observed frequency ratio increased the next day compared with the previous day; Criterion 3: threshold of the predicted frequency ratio; Criterion 4: Observed value greater than the predicted value.

Table 3: Eight DVC criteria combinations

 

Criteria 1

Criteria 2

Criteria 3

Criteria 4

Condition 1

3 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 30

Actual > Pred

Condition 2

3 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 20

Actual > Pred

Condition 3

3 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 10

Actual > Pred

Condition 4

3 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 5

Actual > Pred

Condition 5

2 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 30

Actual > Pred

Condition 6

2 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 20

Actual > Pred

Condition 7

2 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 10

Actual > Pred

Condition 8

2 days in a row

Freq ratioD – Freq ratioD-1 > 0

Pred freq ratio ≥ 5

Actual > Pred

The DVC criteria define the corresponding POS-NT as the DVC POS-NT when all four detailed criteria are satisfied. Criteria 2 and 4 are fixed, and Criteria 1 and 3 are manipulated to simulate each combination. Freq ratioD: frequency ratio at time point D (current); Freq ratioD-1: frequency ratio at time point D-1 (previous day); Actual: actual frequency ratio; Pred: predicted frequency ratio.

A visual summary of the methodology is shown in Figure 8. The SARS-CoV-2 sequence data from GISAID were formalized to secure frequency ratio information for each POS-NT by country and date. A time-series forecasting model was developed using the time-series frequency ratio data obtained using POS-NT. Over time, learning and prediction progressed to the dominant date for each variant. For each prediction date, the DVC POS-NT selection algorithm was applied to all POS-NTs to secure DVC POS-NT for each variant. When all DVC POS-NTs were selected until the dominant date for each variant, they were compared with the actual variant-defining POS-NT to determine the number of days preceding the dominant date of the average number of variant-defining POS-NTs. We also compared the prediction results with the actual variant-defining POS-NT, to determine how many variant-defining POS-NTs could be identified, on average, how many days ago.

fortune-biomass-feedstock

Figure 8: Visual summary of the methodology.

Confirmation metric for the results

The following four metrics were used to confirm the results: (1) number of DVC POS-NTs identified by the algorithm developed in this study, i.e., candidate count; (2) average number of days for identification; (3) number of POS-NTs corresponding to the POS-NTs that define the actual variant (candidate∩actual) among the identified DVC POS-NTs; (4) ratio of the number of POS-NTs corresponding to the actual variant-defining POS-NTs among the identified DVC POS-NTs (Eq. (1)). Upon identifying all actual variant-defining POS-NT, the Candidate∩Actual value will be incremented. The algorithm can sensitively identify the DVC POS-NT as the ratio value increases.

image

POS-NT frequency ratio data for modeling

The number of nucleotide positions for the final model and the total number of models (POS-NT) are listed in Table 4. In this study, the USA data were analyzed for the first time. Prediction modeling was performed with the frequency ratio data for 6951 POS-NTs of Delta and 6990 of Omicron variants.

Table 4: Number of nucleotides removed during the preprocessing process and the number of POS-NTs to be modeled.

Table 4-1: Number of positions removed during the preprocessing process

 

Delta

Omicron

 

USA

UK

Korea

Germany

USA

UK

Korea

Germany

POS where mutation do not occur

193

2399

15019

2375

193

2399

15019

2374

POS without dominant date data

1008

3478

6063

2292

99

327

274

240

POS with < 10% variance in REF allele freq ratio

26,382

19,962

4035

8461

27,281

23,110

9730

10,507

POS with date count < 50 days

3

0

89

1

0

0

22

0

Table 4-2: Number of POS-NTs to be modeled (the number of total models)

 

USA

UK

Korea

Germany

USA

UK

Korea

Germany

Number of POS for modeling

2317

4064

4697

16773

2330

4067

4858

16782

Number of POS-NTs for modeling

6951

12,192

14,091

50,319

6990

12,201

14,574

50,346

Position (POS), Frequency (freq).

Dominant date by country and variant

In the USA, the Delta and Omicron variants were confirmed as the dominant variants on June 21 and December 18, 2021, respectively (Fig. 9A). In the UK, Delta emerged as the dominant variant on May 15, 2021, and Omicron on December 14, 2021 (Fig. 9B). In Korea, Delta emerged as the dominant variant on July 4, 2021, and Omicron on January 5, 2022 (Fig. 9C). In Germany, the Delta mutation was defined as the dominant variant on June 13, 2021, while the Omicron mutation accounted for > 50% of new COVID-19 cases on December 28, 2021 (Fig. 9D). We used the dominant date as the dominant variant selection time point for this algorithm and as the criterion for the learning and prediction date windows for each variant.

fortune-biomass-feedstock

Figure 9: Definition of dominant dates for Delta and Omicron by country. (A) In the USA, Delta became the dominant variant on June 21, 2021, and Omicron on December 18, 2021; (B) in the UK, Delta became the dominant variant on May 15, 2021, and Omicron on December 14, 2021; (C) in Korea, Delta became the dominant variant on July 4, 2021, and Omicron on January 5, 2022; (D) in Germany, Delta was defined as the dominant variant on June 13, 2021, and Omicron accounted for more than 50% of all new COVID-19 cases on December 28, 2021.

POS-NT frequency ratio prediction model

The prediction results for each learning and prediction date combination, i.e., 10- and 20-day training and prediction after 3, 5, 8, and 10 days, were confirmed. Figures 10 and 11 show the Delta and Omicron predictions for a model trained for 20 days and predicted three days after the learning period. The results for the learning and prediction for other combinations are shown in Figures S1–14. It was confirmed that as the number of forecast days decreased, the forecast trend improved (after 3 days > after 5 days > after 8 days > after 10 days).

fortune-biomass-feedstock

Figure 10: Delta: Results of learning for 20 days and predicting 3 days later. TR: learning dates (training dates), TS: test dates.

fortune-biomass-feedstock

Figure 11: Omicron: Results of learning for 20 days and predicting 3 days later. TR: learning dates (training dates), TS: test dates.

POS-NT identification with the algorithm

Based on the developed frequency ratio prediction model for each POS-NT, eight combinations of DVC criteria were applied to identify DVC POS-NT until the dominant date of each variant and were compared with the actual variant-defining POS-NT. In addition, the number of days ago, on average, that the POS-NT was identified as a DVC POS-NT and the ratio of the identified POS-NT corresponding to the actual variant-defining POS-NT to the identified DVC POS-NT was confirmed (Eq. (1)). Table S1 provides the learning dates, prediction dates, number of POS-NTs recognized as DVC POS-NTs by condition and the average number of days for identifying Delta mutation for all combinations of learning and prediction dates and the eight DVC conditions. Table S2 shows Delta-like information for Omicron.

The optimal DVC criterion was specified when two conditions were satisfied: (1) identify all variant-defining POS-NTs in Delta and Omicron, and (2) have the highest ratio (Eq. (1)). In the case of Delta mutation, all Delta-defined POS-NTs were identified in 39 model-specific and DVC criteria combinations and showed the highest ratio values in prediction using Condition 2, 3 days after the 20-day learning period. In the case of the Omicron mutation, all Omicron-defined POS-NTs were identified in 11 model-specific and DVC criteria combinations and showed the highest ratio values in prediction using Condition 3, 3 days after the 20-day learning period. As a result, when using the frequency ratio prediction model that learns for 20 days and predicts 3 days later and the DVC selection algorithm using Condition 3 (3 days in a row, difference between the frequency ratio of the current and previous day is ≥ 0, predicted frequency is > 10%, and measured value exceeds the predicted value), all variant-defining POS-NTs are identified for Delta and Omicron with the highest ratio (Eq. (1), Table 5).

Table 5: Combination results of POS-NT identification model and DVC criteria for all variant definitions.

Table 5-1: Combination results of model and DVC criteria that identify all Delta-defined POS-NTs.

 

TR_TS date

Condition

Candidate Count

Candidate∩Actual

Ratio

1

TR_20_TS_3

Condition 2

45 (49 days ago)

13 (10 days ago)

0.288889

2

TR_20_TS_10

Condition 2

49 (44 days ago)

13 (7 days ago)

0.265306

3

TR_10_TS_8

Condition 6

56 (48 days ago)

13 (11 days ago)

0.232143

4

TR_20_TS_8

Condition 3

58 (33 days ago)

13 (16 days ago)

0.224138

5

TR_10_TS_10

Condition 6

60 (46 days ago)

13 (9 days ago)

0.216667

6

TR_10_TS_10

Condition 3

61 (43 days ago)

13 (14 days ago)

0.213115

7

TR_10_TS_3

Condition 6

61 (47 days ago)

13 (13 days ago)

0.213115

8

TR_20_TS_10

Condition 6

61 (47 days ago)

13 (8 days ago)

0.213115

9

TR_20_TS_3

Condition 6

61 (47 days ago)

13 (12 days ago)

0.213115

10

TR_20_TS_10

Condition 3

63 (44 days ago)

13 (15 days ago)

0.206349

11

TR_10_TS_8

Condition 3

63 (40 days ago)

13 (15 days ago)

0.206349

12

TR_10_TS_5

Condition 3

67 (47 days ago)

13 (18 days ago)

0.19403

13

TR_20_TS_5

Condition 3

67 (47 days ago)

13 (18 days ago)

0.19403

14

TR_20_TS_3

Condition 3

69 (47 days ago)

13 (18 days ago)

0.188406

15

TR_10_TS_3

Condition 3

69 (46 days ago)

13 (18 days ago)

0.188406

16

TR_10_TS_10

Condition 4

80 (43 days ago)

13 (19 days ago)

0.1625

17

TR_10_TS_8

Condition 4

83 (42 days ago)

13 (23 days ago)

0.156627

18

TR_10_TS_10

Condition 7

97 (47 days ago)

13 (16 days ago)

0.134021

19

TR_20_TS_10

Condition 7

97 (46 days ago)

13 (16 days ago)

0.134021

20

TR_10_TS_3

Condition 4

97 (49 days ago)

13 (26 days ago)

0.134021

21

TR_20_TS_8

Condition 4

97 (42 days ago)

13 (26 days ago)

0.134021

22

TR_20_TS_8

Condition 7

98 (47 days ago)

13 (17 days ago)

0.132653

23

TR_10_TS_5

Condition 7

99 (50 days ago)

13 (19 days ago)

0.131313

24

TR_10_TS_5

Condition 4

99 (51 days ago)

13 (26 days ago)

0.131313

25

TR_20_TS_10

Condition 4

100 (46 days ago)

13 (24 days ago)

0.13

26

TR_10_TS_8

Condition 7

101 (47 days ago)

13 (21 days ago)

0.128713

27

TR_20_TS_5

Condition 4

101 (53 days ago)

13 (27 days ago)

0.128713

28

TR_20_TS_5

Condition 7

102 (47 days ago)

13 (20 days ago)

0.127451

29

TR_20_TS_3

Condition 7

102 (51 days ago)

13 (20 days ago)

0.127451

30

TR_20_TS_3

Condition 4

102 (53 days ago)

13 (27 days ago)

0.127451

31

TR_10_TS_3

Condition 7

103 (51 days ago)

13 (20 days ago)

0.126214

32

TR_10_TS_10

Condition 8

122 (51 days ago)

13 (24 days ago)

0.106557

33

TR_10_TS_8

Condition 8

126 (54 days ago)

13 (28 days ago)

0.103175

34

TR_10_TS_5

Condition 8

128 (54 days ago)

13 (28 days ago)

0.101562

35

TR_20_TS_8

Condition 8

131 (55 days ago)

13 (27 days ago)

0.099237

36

TR_10_TS_3

Condition 8

131 (55 days ago)

13 (28 days ago)

0.099237

37

TR_20_TS_10

Condition 8

131 (55 days ago)

13 (26 days ago)

0.099237

38

TR_20_TS_5

Condition 8

132 (55 days ago)

13 (29 days ago)

0.098485

39

TR_20_TS_3

Condition 8

137 (57 days ago)

13 (28 days ago)

0.094891

Twenty-nine model-specific and DVC criterion combinations identified all Delta-defined POS-NTs and showed the highest ratio values in prediction and Condition 2 after three days of 20-day learning. TR: learning dates(training dates), TS: test dates. Bold marks indicate combinations that identified all variant-defining POS-NT. Bold marks indicate the combination of the final DVC selection algorithm proposed in this study.

Table 5-2: Combination results of model and DVC criteria that identify all Omicron-defined POS-NTs.

 

TR_TS date

Condition

Candidate Count

Candidate∩Actual

Ratio

1

TR_20_TS_3

Condition 3

102 (82 days ago)

44 (25 days ago)

0.431373

2

TR_10_TS_3

Condition 3

104 (81 days ago)

44 (25 days ago)

0.423077

3

TR_10_TS_3

Condition 7

110 (87 days ago)

44 (29 days ago)

0.4

4

TR_20_TS_3

Condition 4

111 (87 days ago)

44 (29 days ago)

0.396396

5

TR_20_TS_3

Condition 7

113 (89 days ago)

44 (30 days ago)

0.389381

6

TR_20_TS_5

Condition 4

113 (88 days ago)

44 (27 days ago)

0.389381

7

TR_10_TS_3

Condition 4

114 (86 days ago)

44 (29 days ago)

0.385965

8

TR_10_TS_3

Condition 8

122 (94 days ago)

44 (34 days ago)

0.360656

9

TR_20_TS_5

Condition 8

136 (107 days ago)

44 (37 days ago)

0.323529

10

TR_20_TS_3

Condition 8

137 (105 days ago)

44 (38 days ago)

0.321168

11

TR_10_TS_5

Condition 8

141 (109 days ago)

44 (35 days ago)

0.312057

All omicron-defined POS-NTs were identified in 11 model-specific and DVC criterion combinations and showed the highest ratio values in prediction and Condition 3 after 3 days of 20-day learning. TR: learning dates(training dates), TS: test dates. Bold marks indicate combinations that identified all variant-defining POS-NT. Bold marks indicate the combination of the final DVC selection algorithm proposed in this study.

Through the optimal ratio prediction model (i.e., learning for 20 days and prediction 3 days later) and DVC selection algorithm (i.e., Condition 3), 69 DVC POS-NTs were identified for Delta mutation, an average of 47 days before the dominant date. Among them, 13 Delta variant-defining POS-NTs were recognized 18 days before the dominant date. Similarly, 102 DVC POS-NTs were identified for Omicron mutation an average of 82 days before the dominant date, of which 44 Omicron variant-defining POS-NTs were recognized 25 days before the dominant date.

Discussion

Many previous studies have predicted the incidence of COVID-19 and the ratio of Delta and Omicron mutations. For example, Pathan and Biswas predicted the COVID-19 time series by analyzing the ratio of 12 base mutations using 3,068 samples and the LSTM model from NCBI GenBank in 2020 to predict the mutation rate for future patients who do not yet exist [8]. Singh et al. obtained COVID-19 case count data for 15 states in India through the Kaggle website and predicted the future spread of SARS-CoV-2 using the Kalman filter [9]. Marzouk et al. collected the COVID-19 data of Engypt from the Flevy open source in 2021 and predicted a COVID-19 outbreak (i.e., cumulative infection) after one week and one month, using LSTM, CNN, and MLP; the prediction results were in excellent agreement with the reported results [10]. Meanwhile, Obermeyer et al. proceeded with clustering using GISAID data on January 20, 2022, and the Pango lineage to infer prevalence for each lineage. Subsequently, they developed a hierarchical Bayesian regression model, PyR0, to detect and predict increases in B.1.1.7, AY.4, and BA.I in England [11]. De Hoffer et al. used 646.697 spike protein sequence data from the UK through GISAID in 2022 to perform clustering on a monthly or weekly basis based on amino acid substitution information and defined the appearance of a major cluster. They defined a new permanent variant as a chain containing clusters that share the same variant three or more consecutive times and designated an early warning for the emergence of a new permanent variant when 1% of the total sequence data was reached. As a result, an early warning was provided for the Alpha cluster as a new permanent variant six weeks before the WHO officially classified it as a VoC [12]. Although a few studies have predicted the occurrence of new mutations [Jankowiak, 12], they used protein-based data, and no studies have confirmed the trend by predicting the POS-NT ratio. Therefore, the current study can provide more detailed information regarding SARS-CoV-2 variants by predicting the trend and aspect of the mutation for each POS-NT.

This study has several limitations. First, the increasing POS-NT ratio was predicted using the DVC candidate selection algorithm, while the decreasing POS-NT ratio remained unanalyzed. Second, given that the dominant variant candidate identification algorithm was developed based on USA data, the algorithm may not apply to other countries in Asia. Hence, as different countries have demonstrated different rates of SARS-CoV-2 transmission and emergence of dominant variants, it is necessary to develop DVC selection algorithms for other countries, such as the UK, Germany, and Korea. Third, only replacement mutations were analyzed in this study, whereas other mutation types, such as insertions and deletions, were not considered.

Conclusions

We obtained SARS-CoV-2 POS-NT frequency ratio data for each country using a large amount of GISAID sequence data and defined the time point of the dominant variants for each mutation in each country. Subsequently, we developed a SARS-CoV-2 POS-NT frequency ratio prediction model and DVC selection algorithm using GPR for the USA and verified them for Delta and Omicron. Using this algorithm, we successfully identified all DVC POS-NTs before the dominant date, regardless of the soaring or gently increasing POS-NT patterns. As we were able to identify all mutation definitions of POS-NT for Delta and Omicron mutations, the algorithm can provide early warnings for other mutations in the future. If sufficient data exists, our model is expected to serve as an early warning algorithm for other viruses, thus improving global health.

Availability of the data and materials

The COVID-19 nucleotide sequence data used in this study can be obtained through GISAID (https://gisaid.org/) and compared with the original nucleotide sequence NC_045512. Correspondence and requests for materials should be addressed to TaeJin Ahn.

Funding

This research was supported by the research grants from Ministry of Science and ICT, South Korea (No.2021M3E5E3081425).

Conflict of interest

The authors have declared that no competing interests exist.

References

  1. Rasmussen CE. Gaussian Processes in Machine Learning. In: Bousquet O, von Luxburg U, Rätsch G, editors. Summer School on Machine Learning. Springer (2004): 63–71.
  2. Schulz E, Speekenbrink M, Krause A. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology 85 (2018): 1–16.
  3. Jarndal A, Husain S, Zaatar O, Al Gumaei T, Hamadeh A. In: 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI) (2020): 1–5.
  4. Tolba H, Dkhili N, Nou J, Eynard J, Thil S, et al. GHI forecasting using Gaussian process regression: Kernel study. IFAC-PapersOnLine 52 (2019): 455–460.
  5. Velásquez RMA, Lara J VM. Forecast and evaluation of COVID-19 spreading in USA with reduced-space Gaussian process regression. Chaos, Solitons & Fractals 136 (2020): 109924.
  6. Dhamodharavadhani S, Rathipriya R. COVID-19 mortality rate prediction for India using statistical neural networks and Gaussian process regression model. African Health Sciences 21 (2021): 194–206.
  7. Lounis M, Khan FM. Predicting COVID-19 cases, deaths and recoveries using machine learning methods. Engineering and Applied Science Letters 4 (2021): 43–49.
  8. Pathan RK, Biswas M, Khandaker MU. Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos, Solitons & Fractals 138 (2020): 110018.
  9. Singh KK, Kumar S, Dixit P, Bajpai, MK. Kalman filter based short term prediction model for COVID-19 spread. Applied Intelligence 51 (2021): 2714–2726.
  10. Marzouk M, Elshaboury N, Abdel-Latif A, Azab S. Deep learning model for forecasting COVID-19 outbreak in Egypt. Process Safety and Environmental Protection 153 (2021): 363–375.
  11. Obermeyer F, Jankowiak M, Barkas N, Schafner SF, Pyle JD, et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376 (2022): 1327–1332.
  12. de Hoffer A, Vatani S, Cot C, Cacciapaglia G, Chiusano ML, et al. Variant-driven early warning via unsupervised machine learning analysis of spike protein mutations for COVID-19. Scientific Reports 12 (2022): 9275.

Journal Statistics

Impact Factor: * 3.1

CiteScore: 2.9

Acceptance Rate: 11.01%

Time to first decision: 10.4 days

Time from article received to acceptance: 2-3 weeks

Discover More: Recent Articles

Grant Support Articles

    Editor In Chief

    Masashi Emoto

  • Professor of Laboratory of Immunology
    Department of Laboratory Sciences
    Gunma University Graduate School of Health Sciences
    Gunma, Japan

© 2016-2024, Copyrights Fortune Journals. All Rights Reserved!