Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches

Despite the active development of SARS-CoV-2 surveillance methods (e.g., Nextstrain, GISAID, Pangolin), the global emergence of various SARS-CoV-2 viral lineages that potentially cause antiviral and vaccine failure has driven the need for accurate and efficient SARS-CoV-2 genome sequence classifiers. This study presents an optimized method that accurately identifies the viral lineages of SARS-CoV-2 genome sequences using existing schemes. For Nextstrain and GISAID clades, a template matching-based method is proposed to quantify the differences between viral clades and to play an important role in classification evaluation. Furthermore, to improve the typing accuracy of SARS-CoV-2 genome sequences, an ensemble model that integrates a combination of machine learning-based methods (such as Random Forest and Catboost) with optimized weights is proposed for Nextstrain, Pangolin, and GISAID clades. Cross-validation is applied to optimize the parameters of the machine learning-based method and the weight settings of the ensemble model. To improve the efficiency of the model, in addition to the one-hot encoding method, we have proposed a nucleotide site mutation-based data structure that requires less computational resources and performs better in SARS-CoV-2 genome sequence typing. Based on an accumulated database of >1 million SARS-CoV-2 genome sequences, performance evaluations show that the proposed system has a typing accuracy of 99.879%, 97.732%, and 96.291% for Nextstrain, Pangolin, and GISAID clades, respectively. A single prediction only takes an average of <20 ms on a portable laptop. Overall, this study provides an efficient and accurate SARS-CoV-2 genome sequence typing system that benefits current and future surveillance of SARS-CoV-2 variants.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative virus of coronavirus disease 2019 , emerged at the end of 2019, burdening both the global economy and public health [1][2][3][4]. Next-generation sequencing has provided an unprecedented opportunity to monitor the COVID-19 pandemic in real-time [5,6]. During the pandemic, vast amounts of SARS-CoV-2 genome sequences have been accumulated at ever-growing rates and shared in the public database. As of 12 July 2022, more than 10 million SARS-CoV-2 genome sequences worldwide are available to researchers in the online database Global Initiative on Sharing all Individual Data (GISAID) [7] (available at https://www.gisaid.org/ (accessed on 12 July 2022)). Rapidly growing genome sequences contribute to surveilling this fast-spreading pathogen and distinguishing emerging lineages [8][9][10][11][12]. In particular, lineage classification is a critical tool for monitoring variants of concern (VOCs) or variants of interest (VOIs) with reduced susceptibility to neutralizing antibodies or having higher transmissibility [13]. Research indicated that distinct SARS-CoV-2 lineages could play a pivotal role in developing drugs and designing vaccines Figure 1. Processing pipeline of the SARS-CoV-2 genome sequence typing system. The system mainly includes data acquisition and preprocessing, multiple sequence alignment, data compression and feature extraction, supervised model training, and model testing. The ensemble model can achieve three different genome sequence typing predictions (GISAID, Nextstrain, and Pangolin). Templates can be obtained from the training set, and the Nextstrain or GISAID clade of the testing SARS-CoV-2 genome sequence can be obtained directly through the matching algorithm.

Data Collection and Preprocessing
As of 20 April 2022, 1,088,952 complete SARS-CoV-2 genome sequences with high coverage [31] were extracted from the GISAID database (https://www.gisaid.org/ (accessed on 12 July 2022)). Notably, to improve the persuasiveness of the results, these sequences were extracted by the collection dates and regions ( Figure 2). Given the different classification densities of the three typing tools (GISAID, Nextstrain, and Pangolin), sequences of each clade or lineage were uniformly sampled according to the collection date. Overall, the amounts of downsampled sequences are 91,772, 203,740, and 279,899 for GISAID, Nextstrain, and Pangolin, respectively. The above nucleotide sequences were aligned with the reference sequence (Wuhan-Hu-1, NCBI accession NC_045512) using the option of "addfragments" in MAFFT version 7.490. Each sample was composed of the aligned nucleotide sequence and its designated label. For the GISAID and Pangolin, the designated clades or lineages were contained in the metadata of the corresponding genome sequences. For the Nextstrain part, labels of sequences were obtained from the Nextclade system (https://clades.nextstrain.org/ (accessed on 12 July 2022)). After this step, samples with classification labels were obtained. These samples were then divided into training sets (25%) and testing sets (75%).

Data Compression and Feature Extraction
As only part of the SARS-CoV-2 genomes have mutations [31], to reduce the computational cost, invariant sites of the aligned genome sequences were discarded [24] to obtain compressed sequences. We referred to the nucleotide site screening protocol provided by Nextclade (https://clades.nextstrain.org/ (accessed on 12 July 2022)) and PangoLEARN (https://github.com/cov-lineages/pangoLEARN/ (accessed on 12 July 2022)) and selected feature extraction strategies suitable for different typing methods through subsequent model training and cross-validation. Specifically, Nextclade defined each clade by the combination of signature mutations, providing a total of 83 mutations for 25 clades. On the other hand, PangoLEARN removed nucleotide sites without any SNPs and a total of 4544 sites were preserved.
Given the number of reserved nucleotide sites, we have tested three data structures ( Figure 3): (1) f 1 is with the size of n × 1, where n is the number of reserved sites (n ∈ [83, 4544]). f 1 contains five different values (A, T, G, C, and -), and those sites with unknown nucleotides due to sequencing errors were replaced with the nucleotides of the reference sequence [24].
(2) f 2 is with the same size as f 1 . Element 1 indicates that the nucleotide type of this site is different from the reference sequence.
(3) f 3 is with the size of n × 5. Each sequence was represented as a vector of one-hot encoded nucleotides [24]. Examples of the three data structures applied in this study. f 1 is basically the same as the sample, and the unknown nucleotides in the sample are replaced with nucleotides at the corresponding locations in the reference sequence. f 2 is obtained by aligning the sample sequence with the reference sequence, thereby highlighting the mutation sites. f 3 has the largest amount of data, and each site is represented by a 5 × 1 vector.
It is noted that f 1 is only applied to the template matching method, while f 2 and f 3 are applied to machine learning-based methods. Details will be explained in the subsequent algorithm description section.

Template Matching Method
The template matching method was proposed for GISAID and Nextstrain clades. These two typing strategies have fewer branches (11 and 25, respectively) than Pangolin, avoiding the computational explosion of the matching algorithm. Furthermore, to balance the calculation efficiency and matching accuracy, a hierarchical matching algorithm is applied. Specifically, the template matching method is based on data structure f 1 , and the exact matching score is computed by Hamming Distance [32]: where N is the number of selected sites, and ⊕ stands for the exclusive OR (XOR) operation. Based on (1), the exact matching score between the query sequence and one of the template sequences is defined as: where Q is the compressed query sequence, T j denotes the jth template sequence, and N is the number of reserved sites.
Templates can be obtained from the training set, and this work selected the sequence with the highest coverage as the template for the corresponding clade. The proposed multilayer matching algorithm is described in Algorithm 1, where C * l and C * h are the output clades at the low and high resolution of matching, respectively. For most query sequences, the exact matching (Step 2) at the high resolution is only performed twice, ensuring the accuracy and efficiency of the proposed algorithm. 12: return C * l . 13: else 14: The clade index numbers with the same and the highest score form the set I h , and the

Difference Matrix between Clades
Based on the template matching method, this study proposed a difference matrix D to characterize the distance between clades. D is a diagonal matrix, and its element D ij is computed by (3), where i and j refer to two different clades, and T i and T j are the corresponding templates. M 1 and M 2 are the numbers of sequences that can be correctly classified with the feature length N f .
The difference matrix D of the 25 Nextstrain clades is shown in Figure 4. A larger D ij implies a larger difference between clades i and j. D can not only be used to quantify differences between clades but also play a role in the evaluation of models.

Ensemble Learning-Based Classifier
Seven supervised classifiers were applied in this work, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Adaboost, and Catboost. Our goal is to screen out the optimal classification models and evaluate the performance of those classifiers on the two data structures f 2 and f 3 , respectively. In addition, this study explored the ensemble of multiple models, such as the weighted fusion of multi-model predictions.
• LR is one of the most commonly used analytical methods in epidemiology and medicine [33]. As an extension of linear regression, LR is quite efficient with time and memory requirements, processing larger data with smaller resources. Using the one-vs.-rest (OvR) scheme, LR is applied for multiclass tasks. However, LR is prone to underfitting, resulting in low accuracy, especially in multi-classification tasks with unbalanced samples. • DT adopts a tree structure for classification model training [34]. Starting from the root node, each branch divides the training data into disjoint subsets. The decision tree can be visualized and easily understood and interpreted. On the other hand, DT is prone to overfitting and is sensitive to data bias. • RF is ensembled by multiple decision trees [24,35]. Each tree is built using a sub-set of the training sets. All decision trees vote on the classification, and the category with the most votes is the classification result of the RF. The RF prediction model can be trained fast and is easy to operate in parallel. In addition, the RF can output the feature importance computed as the total reduction of the criterion brought by that feature [36]. Previous studies have revealed that RF shows better performance than LR and DT in terms of SARS-CoV-2 clade classification [24,28]. • SVM solves the classification problem by finding the best hyperplane, which correctly divides the training sets and maximizes the geometric interval between the support vectors [37]. The hyperplane is presented as: where x is the feature vector, and w and b represent the normal and intercept vectors of the hyperplane, respectively. By introducing kernel functions, SVM can solve nonlinear classification problems. • MLP is an artificial neural network consisting of fully connected layers with at least one hidden layer [38]. Taking the case with one hidden layer as an example, the mathematical model of the MLP can be expressed as: where x is the input vector, w 1 and w 2 are the weights of input and hidden layers, b 1 and b 2 represent the bias vectors. φ is the activation function, such as the rectified linear unit function (ReLU) or the hyperbolic tan function (tanh). • Adaboost is applied as a strong classifier constituted with multiple weak classifiers [39]. A base classifier is first trained from the initial training set, and the weights of training samples are then adjusted based on the training loss. As a result, the misclassified samples obtain more attention in subsequent training iterations. After T iterations of training, these T weak classifiers (h t (x)) are weighted to form a strong classifier (H(x)), and a weak classifier with a smaller classification error has a larger weight α t : • Catboost is an algorithm for gradient boosting on decision trees [40,41]. Since the default parameters of Catboost provide great training results, it can reduce the time of parameter tuning. Catboost requires less hyperparameter tuning, reducing the possibility of overfitting and making the model more general. Additionally, Catboost supports model training on GPUs, improving training efficiency on large datasets. However, for the processing of categorical features, Catboost still consumes a lot of memory and time.
After training and testing the above models, on the one hand, this study analyzed the advantages and disadvantages of different models in the application of SARS-CoV-2 genome sequence typing and selected the optimal model; on the other hand, this study explored the combination of different models to obtain better prediction results than a single model. To facilitate the combination of the above methods, all classifiers were trained with enabled probability estimates. Furthermore, the proposed ensemble learning system applied weighted voting as the combination strategy [42]. Taking the Nextstrain clade typing task as an example, p i j represents the probability that the ith classifier predicts the jth clade. The ensembled probability of the jth clade is computed as: where N is the number of ensembled models, and ω i is the weight of the ith classifier, satisfying that ∑ ω i = 1. The ensembled sequence typing prediction result C * is determined with the maximum probability:

Evaluation Metrics
To facilitate the internal verification and external testing of the model, this study mainly applies statistics including precision, recall, F-score, training and testing efficiency to quantify the model. In addition, for GISAID and Nextstrain, this study also applies the average differenceD to test the classifier: where N F represents the number of samples with incorrect predictions.

Results
All experiments in this study were conducted on a portable laptop with an Intel Core i7 CPU (32G memory) at 2.30 GHz and an Nvidia RTX3070 GPU (8 G).

Feature Extraction
This study applied the sites provided by Nextstrain and Pangolin as the basis for feature extraction. As of 20 April 2022, Nextstrain provides 83 (F min ) nucleotide sites for the classification of 25 SARS-CoV-2 clades. On the other hand, Pangolin provides 4544 (F max ) nucleotide sites for the classification of over one thousand lineages. To meet the requirements of different typing strategies for accuracy, efficiency, and calculation memory usage, this study filtered the features and obtained different levels of feature scales in the range of [F min , F max ].
Given the successful application of the RF classifier in Pangolin [24], the RF classifier was trained to classify the 25 clades of Nextstrain, and the initial feature scale was set to l 5 = F max . Then, the feature importance distribution was obtained as shown in Figure 5. Since the RF classifier adopts the one-hot method ( f 3 in Section 2.2), each site corresponds to a 5 × 1 vector. Therefore, for the site weight calculation, the maximum value of the five features was used as the importance weight of the site. Taking 10 −3 , 10 −4 , and 10 −5 as the thresholds, l 2 , l 3 , and l 4 were obtained, respectively. The corresponding numbers of effective sites are 192, 464, and 1048, respectively. Obviously, l 1 corresponds to the 83 effective sites provided by Nextstrain, and l 5 corresponds to the 4544 sites offered by Pangolin. It should be pointed out that GISAID has the least number of clades. To simplify the model training, this study applied the same feature settings as Nextstrain for the construction of the GISAID clade typing model.

Nextstrain Clade Typing Results
The datasets used for the Nextstrain typing experiment consist of two groups: the first group (S N 1 ) includes 50,935 sequences for model training; the second group (S N 2 ) includes 152,805 sequences for model testing.

Training of the Nextstrain Clade Typing Models
Firstly, parameter optimization was performed for template matching using dataset S N 1 . As shown in Algorithm 1, hyperparameters of the proposed multilayer matching algorithm are mainly N l and N h . N h was set to 4544 (l 5 ), and N l was set to four levels (l 1 , l 2 , l 3 , l 4 ). As shown in Figure 6a, test1 (N l is set to l 1 = 83) obtains the highest matching accuracy and efficiency. As the number of features for coarse matching increases, the timeconsuming increases; however, the accuracy decreases. At the same time, the number of misclassified samples with large differences (>0.1) also increases accordingly (Figure 6b). The template matching algorithm assigns all features the same score weight. Despite the shortcomings of this design, the proposed method has simple parameter settings and high matching efficiency, and subsequent experiments show that this method can match the performance of machine learning methods in sequence typing. In addition to the optimized parameter N l for the template matching algorithm, the difference matrix of the Nextstrain clades as shown in Figure 4 was also obtained.
Different from the template matching algorithm, the machine learning method can obtain the weights of features, so as to play the role of automatic screening of features in the training process. In light of RF's excellent performance in SARS-CoV-2 clade classification [24], this part applied the training dataset S N 1 to test the performance of the RF classifier with different numbers of features (l 1 , l 2 , l 3 , l 4 , and l 5 ) and data structures ( f 2 and f 3 ).
The 3-fold cross-validation results of the RF classifier are shown in Figure 7. The four curves in each sub-figure correspond to 100, 200, 500, and 1000 estimators, respectively. The recall and F-score curves show similar trends across different experimental groups. As the number of features increases, the training and validation time increases. In addition, the cross-validation based on data structure f 3 is much more time-consuming than that on f 2 . However, the recall and F-score of f 2 are only marginally inferior to those of f 3 , showing the superiority of the lightweight data structure f 2 for typing SARS-CoV-2 genome sequences. Overall, the group with feature size l 3 , data structure f 3 , and estimator number 1000 obtained the best classification accuracy (Figure 7e   Since Catboost has the advantages of rapid parameter tuning, high accuracy, low risk of overfitting, and is suitable for GPU-accelerated training [40,41], this study further analyzed the cross-validation results of the Catboost classifier. The results of the 3-fold cross-validation of the Catboost classifier on dataset S N 1 are shown in Figure 8. The recall and F-score curves shown in Figure 8a,b indicate that the Catboost classifier is slightly better than the RF classifier. In addition, the best classification result is obtained at the number of features l 3 . As for time cost, the advantage of f 2 is more prominent (Figure 8c). Figure 8d shows the curves of learning error and testing error in one of the cross-validations, and the horizontal axis represents the number of iterations.
The above two sets of experimental results show that the SARS-CoV-2 genome sequence typing based on both data structures f 2 and f 3 can achieve ideal results. The former ( f 2 ) has a prominent advantage in efficiency, while the latter ( f 3 ) has a slight advantage in accuracy. In addition, choosing the number of features as l 3 achieved ideal results in both accuracy and efficiency. The other five classifiers were trained by 3-fold cross-validation with the feature size of l 3 . The cross-validation results of the seven models based on the training dataset S N 1 are shown in Table 1, arranged in descending order of the F-score.D stands for the average difference computed by (9). Catboost, RF, and Adaboost obtained the top three classification accuracy. DT obtained the highest efficiency but the worst accuracy. In addition, theD of the seven models are all less than 0.08, and over half of them are less than 0.06, indicating that the misclassified samples mainly exist between clades with small differences.  The dataset S N 2 with 152,805 sequences was tested for external validation. In addition to the seven learning-based classifiers, this part tested the proposed template matching method (TM) and the ensemble model (Ensemble). Based on the cross-validation results in Section 3.  (Table 1) and was not used for the ensemble method. Table 2 shows the results of the nine classification methods on dataset S N 2 , and the average testing time required for each aligned sequence is also represented in the table. Firstly, considering the precision, recall, and F-score measures, the Catboost classifier achieved the best performance among seven machine learning-based methods for both f 2 and f 3 . RF, Adaboost, and LR also achieved ideal classification results for both f 2 and f 3 . Secondly, methods using data structure f 2 were less time-consuming. RF, LR, and DT obtained higher accuracy and efficiency on data structure f 2 . Thirdly, TM achieved better classification measures (precision, recall, and F-score) than any machine learning-based method. Finally, the ensemble model achieved the highest accuracy among all methods. The confusion matrix produced by the ensemble model based on f 3 is shown in Figure 9. Although the f 2 -based ensemble method is slightly inferior to the f 3 -based one on the three classification measures, the former is significantly more computationally efficient than the latter. In addition, except for DT, all methods, including TM and the ensemble model, obtainedD < 0.055, indicating that the misclassified samples are mainly distributed among clades with small differences, like the Delta (21A, 21I, and 21J) and the Omicron (21M, 21K, and 21L) clades (as marked with pink boxes in Figure 9). To further compare the classification performance of different methods, the receiver operating characteristic (ROC) curves of different methods on dataset S N 2 using data structure f 2 were plotted in Figure S1. The ensemble model obtained the best performance with the largest area under the curve (AUC).  using data structure f 3 . The dark color represents a larger number of samples. Pink boxes mark easily misclassified samples distributed between clades with small differences.

GISAID Clade Typing Results
The experimental process of the GISAID clade typing is similar to Section 3.2. This part adopted the feature extraction method as described in Section 3.1 and conducted experiments on both f 2 and f 3 data structures. The datasets used for the GISAID typing experiment consist of two groups: the first group (S G 1 ) includes 22,943 sequences for training; the second group (S G 2 ) includes 68,829 sequences for testing.

Training of the GISAID Clade Typing Models
There are 11 GISAID clades involved in this study ( Figure 10). The difference matrix D was calculated based on the training dataset S G 1 by Equation (3). Compared with Nextstrain's difference matrix (Figure 4), GISAID's D shows clear discrimination. Differences between the eight clades (L, V, S, O, G, GH, GV, and GR) in Figure 10 are quite small (D ij ≤ 0.04), while GK (Delta), GRY (Alpha), and GRA (Omicron) are quite different from other clades (D ij ≥ 0.09).
Firstly, the hyperparameters N l and N h of the proposed multilayer matching Algorithm (1) were set based on S G 1 . N h was set to 4544, and N l was set to four levels (l 1 , l 2 , l 3 , and l 4 ). The corresponding four sets of training results are shown in Figure 11. Test3 and test4 obtained the same F-score, and the former was much more efficient. In addition, Figure 11b shows that there is no significant difference in the distribution of D ij . Based on the above results, the parameter N l of the template matching method for GISAID clade typing was set to l 3 .
Secondly, the 3-fold cross-validation was applied to the machine learning-based methods. Based on the good performance of the RF and Catboost classifiers in the Nextstrain clade typing, we applied these two methods for feature scale selection. The cross-validation results based on five different feature scales and two data structures are shown in Figure 12. Overall, as the number of features increases, the training time increases, and the recall and F-score increase. Furthermore, the model with data structure f 3 slightly outperforms f 2 in terms of recall and F-score. To this end, the number of features used for the GISAID clade classification was set to l 5 .
The cross-validation results of the seven models based on the training dataset S G 1 are shown in Table 3. They are sorted in descending order by the F-score. All six other methods except MLP achieved higher recall rates and F-scores on data structure f 3 . DT obtained the highest efficiency, but the second-to-last accuracy. SVM obtained the lowest accuracy on both f 2 and f 3 data structures. Furthermore, the accuracy of the GISAID clade typing is lower than that of Nextstrain, and theD of misclassified samples of the GISAID clade typing is larger. Figure 10. The difference matrix D of the GISAID clades. Each element in the matrix represents the difference between the corresponding two GISAID clades, and the dark color represents a larger difference.

Testing of the GISAID Clade Typing Models
Further external validation was conducted to compare different typing models, using the dataset S G 2 with 68,829 sequences. In addition to the seven supervised learning-based methods, TM and the ensemble model were also tested. Different from Nextstrain, the number of features used by the GISAID classification models is l 5 . DT and SVM with the worst classification accuracy were removed from the ensemble model. Based on the crossvalidation results shown in Table 3, the weights of the five classifiers in the ensemble model were set as: ω 1 = 0.25 (Catboost), ω 2 = 0.25 (MLP), ω 3 = 0.20 (LR), ω 4 = 0.20 (RF), ω 5 = 0.10 (Adaboost). Table 4 shows the results of nine classification methods on the testing dataset S G 2 , and the average testing time per sequence is also presented. Firstly, considering precision, recall, and F-score, the RF classifier achieved the best performance among those seven machine learning-based methods on both f 2 and f 3 , followed by Catboost, LR, Adaboost, and MLP. TM is inferior to other models in the precision, recall, and F-score. However, itsD is smaller, indicating that the misclassified samples of TM are mainly distributed between clades with small differences. Notably, the seven machine learning-based models have very little difference in accuracy between f 2 and f 3 . Moreover, the ensemble model achieved the highest precision, recall, and F-score on f 2 . In terms of computational efficiency, the prediction time per sample of the ensemble model on f 2 was only 31.7% of that on f 3 , providing an accurate and efficient solution for the GISAID clade typing. The confusion matrix produced by the ensemble model on f 2 is shown in Figure 13. To facilitate comparison, elements in the matrix are expressed as proportions. Among them, the recall rates of clades O and GR are less than 90% and the recall rate of clade O is the lowest (78.7%). Figure S2 shows the ROC curves of different methods on dataset S G 2 using data structure f 2 . The ensemble model obtained the largest AUC.  using data structure f 2 . The dark color represents a larger proportion.

Pango Lineage Typing Results
Unlike the Nextstrain and GISAID typing issues, the number of lineages defined by Pangolin is significantly increased [24], and TM is no longer suitable for Pango lineage typing. Due to a large number of lineages, the time-consuming and computational cost of model training increases significantly. Further considering the results in Sections 3.2 and 3.3 and the performance of pangoLEARN [24], this study mainly applied RF and Catboost to conduct the Pango lineage typing research. In view of the validity of the model and the limitation of computing resources, we set the minimum number of samples of each lineage to 50, and a total of 710 lineages were obtained (lineages with less than 50 samples were discarded). In addition, no more than 2000 samples were screened for each lineage. Finally, a total of 279,899 sequences (69,565 training samples (S P 1 ) and 210,334 testing samples (S P 2 )) were obtained.

Training of the Pango Lineage Typing Models
S P 1 was applied to build the RF classifier, achieving the feature importance distribution (shown in Figure 14). Comparing it with the feature distributions of Nextstrain and GISAID, Pangolin (green) has the widest distribution of effective features, with only 53 of the 4544 sites weighting 0. Therefore, this study adopted the same setting as PangoLEARN [24] in the number of effective nucleotide sites (l 5 = 4544).  Figure 15. The horizontal axis in Figure 15 represents the number of samples involved in training in each cross-validation, which is 66.67% of N train . Figure 15a shows that the training time is positively correlated with the number of samples, and the training on f 2 is more efficient. Figure 15b,c show that the F-scores and recall rates have very similar trends, and the classification performance on the two data structures differs very little. Considering both accuracy and efficiency, f 2 performs better than f 3 in Pango lineage typing. Further experiments showed that Catboost had very close validation results to RF.

Testing of the Pango Lineage Typing Models
We applied RF and Catboost to construct the Pango lineage classifiers and integrated the prediction results of the two models. To improve the accuracy of the classifiers, all samples in S P 1 (69,565) were used for model training. The testing results on the dataset S P (210,334) are shown in Table 5. Furthermore, in the ensemble model, the weights of the prediction results of both models (RF and Catboost) were set to 0.5. It is worth noting that the RF classifier on data structure f 3 is the same as that applied by PangoLEARN [24] and can be used for comparison. As shown in Table 5, the proposed ensemble model on data structure f 2 achieved the highest classification precision, recall, and F-score. The ensemble model improved the classification accuracy on both data structures and achieved better performance on f 2 with less computation. Figure 16 shows the F-score distribution of the three models. The vertical axis represents the proportion of the Pango lineages in different F-score intervals. The ensemble model obtains the highest proportion of lineages with an F-score ≥ 95%. Furthermore, Figure S3 shows the ROC curves of the three methods on dataset S P 2 using data structure f 2 . The ensemble model outperforms RF and Catboost with a larger AUC.

Model Extension with Newly Emerging Clades
To deal with the classification of newly emerging clades, we tried to obtain an extended model by training a sub-model based on the existing model. Taking Nextstrain as an example, we obtained 561 sequences with high coverage of two new clades (22A (Omicron) and 22B (Omicron)) from the GISAID database. The collection dates of these sequences are from 25 April 2022 to 12 July 2022. According to Nextclade (https://clades.nextstrain.org/ (accessed on 12 July 2022)), 22A (Omicron) and 22B (Omicron) evolved from 21L (Omicron). For brevity, they are abbreviated as 22A, 22B, and 21L.
We applied the same method as the main model (Section 2.5) to construct a sub-model for the classification of the three clades (21L, 22A, and 22B). An extended model is composed of the main model and a sub-model. For sequences to be classified, we firstly applied the main model for classification. For the sample whose main model output was 21L, we continued to apply the sub-model for further classification. To this end, an extended classification model capable of handling newly emerging clades can be obtained with only a small amount of work. It should be pointed out that the construction of the sub-model needs to incorporate the nucleotide mutation sites of the two new clades 22A and 22B based on the original features. In this study, the training set (22A and 22B) was compared with the reference sequence, and the nucleotide sites with mutation rates exceeding a certain threshold were extracted. As a result, 35 additional feature points were obtained.
The resulting confusion matrix (Table 6) on the testing dataset shows that only four samples were mistakenly assigned to another clade by the sub-model, achieving a weighted accuracy of 99.519%. Furthermore, out of all 831 samples, only one sample with clade 21L was misassigned to 21M by the main model. All samples of clades 22A and 22B were correctly assigned to their father clade 21L by the main model. The training of the submodel took only a few minutes and the extra testing time for each aligned sequence was less than 10 ms. Therefore, by introducing the sub-model, this study can rapidly construct an extended model that accurately identifies newly emerging clades.

Discussion
Facing the SARS-CoV-2 genome sequence typing problem, this study built classifiers for three typing strategies of GISAID, Nextstrain, and Pangolin. In addition to the machine learning-based methods, this study has proposed a method based on template matching for GISAID and Nextstrain. Based on the template matching algorithm, we obtained the difference matrix between viral clades and applied it as one of the classifier evaluation indicators. To achieve a fast and accurate classifier, two improvements have been made. First, two data structures based on one-hot coding and site mutation were used for nucleotide sequence transformation. Second, a weighted fusion strategy was applied to obtain an ensemble model. Overall, our study achieved the highest accuracy on Nextstrain clade typing (precision: 99.879%, recall: 99.879%, F-score: 99.879%), followed by the Pangolin (precision: 97.889%, recall: 97.732%, F-score: 97.766%) and the GISAID (precision: 96.433%, recall: 96.291%, F-score: 96.235%).
(1) Nextstrain: Our study has studied the classification of 25 Nextstrain clades, using seven machine learning-based methods and a template matching-based method. The ensemble model achieved the highest classification precision, recall, and F-score. The template matching algorithm achieved a classification performance comparable to any machine learning-based classifier. In addition, the difference matrix D obtained from the matching algorithm can intuitively represent the distance between different clades. Figures 4 and 9 show that the misclassified samples are mainly distributed between clades with small differences. Furthermore, data structure f 2 has a better classification performance in SARS-CoV-2 genome sequence typing. Although the accuracy is slightly lower than that of f 3 , the computational efficiency is improved by more than five times (as shown in Table 2).
(2) GISAID: Research on the classification of 11 GISAID clades has been carried out in this work. The ensemble model on data structure f 3 achieved the best results. Compared with the Nextstrain clade typing, TM performs worse in the GISAID clade classification, and the F-score is lower than 85%. Figure 10 shows that except for GRA, GRY, and GK, the GISAID clades are less diverse (D ij ≤ 0.04). Furthermore, a total of 13 (23.6%) elements in Figure 10 are less than 0.03, while those in Figure 4 equal zero. It indicates that the separability between GISAID clades is lower than that of Nextstrain clades. The ensemble model on f 3 obtained the highest typing accuracy with an ideal computational speed.
(3) Pangolin: A total of 710 Pango lineages are included in this study. The classification accuracy of RF and Catboost is very close, and the ensemble of the two methods can obtain higher precision, recall, and F-score. More interestingly, the performance of the ensemble model on f 2 is better than that on f 3 , with higher accuracy and less computation time.
Compared with existing SARS-CoV-2 typing studies, our results have both improvements and limitations. The Genome Detective Coronavirus Typing Tool [25] can only identify the SARS-CoV-2 clades of several VOCs. In addition, this method is computationally inefficient, taking an average of 30 ms per genome. UShER [43] places sequences on a comprehensive tree and supplied sequences need to be uploaded to UShER's servers where processing takes place. In addition, it takes an average of 18 ms to place one sample onto the reference tree using 16 threads and achieves an accuracy of 98.5% for samples with one parsimony-optimal placement. On the other hand, Nextclade [44] is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks, and phylogenetic placement. Although its web version can provide comprehensive and up-to-date sequence analysis results, its offline version performs clade assignment based on a small number of valid nucleotide sites, with low accuracy, and partial sequences cannot be effectively identified. Compared with Nextclade and UShER, this study does not construct the evolutionary tree but focuses on the typing of genomes. In addition, the methods proposed in this study (the template matching and the ensemble model) are computationally efficient (<20 ms for one sample) with higher accuracy (>99.85%). The disadvantage of our work is that we can only identify existing clades and cannot discover new SARS-CoV-2 clades. However, the proposed extended model can identify newly emerging clades by training sub-models with only a small amount of work.
As for the GISAID clade typing, its classification accuracy is relatively low. GISAID classification is based more on several marker variants than strictly phylogenetic relationships [18]. Moreover, clade O refers to other clades that do not meet the GISAID clade definition [45]. This can further explain that the typing model has the worst accuracy on clade O (recall: 77.249%, F-score: 86.625%). The PhenoGraph [46] classification identifies 303 SARS-CoV-2 clades and is consistent with, but more detailed and precise, than the known GISAID clades [18]. It provides an unsupervised clustering method for SARS-CoV-2 clades. In contrast, we provide supervised models for a different classification density. Although the weighted recall of the proposed model is about 96%, VOCs such as GK (Delta) and GRA (Omicron) can achieve an accuracy of over 99%.
The Pangolin classification tool [24] provides the basis for the research in this study. Different from PangoLEARN, this study tried a lightweight data structure f 2 with higher efficiency. The classification accuracy has been improved through model integration. The limitation of our method is that only 710 SARS-CoV-2 lineages are included in this study due to the constraints of computational resources. This problem can be solved by increasing the hardware configuration level and downloading more data. In addition, GNU-based Virus IDentification (GNUVID) is applied to assign sequence type profiles to all high-quality SARS-CoV-2 genomes [28]. The overall prediction statistics of GNUVID on high-quality genomes are precision (94.7%), recall (96.4%), and F-score (95.0%), which are lower than those of the classifier proposed in this study. In addition, this study adopts the lightweight data structure f 2 to improve the classification efficiency, and the average time per sequence is about 10 ms, which is much lower than the 31 ms of GNUVID [28].

Conclusions
This study presents a SARS-CoV-2 genome sequence classification system based on supervised learning methods. Overall, the system aims to achieve rapid and accurate SARS-CoV-2 genome sequence typing for the three typing strategies of Nextstrain, GISAID, and Pangolin, respectively. When we obtained SARS-CoV-2 genome sequences from COVID-19 patients, the system proposed in this study can be applied to efficiently and accurately type these sequences, which would help to carry out relevant epidemiological analysis and provide reliable typing and traceability basis for effectively blocking its spread. For Nextstrain and GISAID, this study has proposed a method based on template matching. Through the strategy of multi-layer matching, we improved the efficiency of the matching algorithm. The template matching method achieved satisfactory results in the Nextstrain clade typing. A template matching-based difference metric method is proposed to quantify the difference between two clades and serve as an evaluation factor for classifier performance. Furthermore, we have proposed an ensemble model that integrates a combination of machine learning methods (such as Random Forest and Catboost) with optimized weights. In addition to the one-hot coding method, this study has proposed a data structure based on nucleotide site mutation, which obtains good results in SARS-CoV-2 genome sequence typing. While obtaining ideal classification accuracy, the computational resources are greatly reduced. Finally, verified by a large number of testing datasets, the ensemble model proposed in this study helps to improve the accuracy of the classification system (Nextstrain: 99.879%, Pangolin: 97.732%, GISAID: 96.291%). This study provides a comprehensive and efficient method for SARS-CoV-2 genome sequence typing, which helps to monitor the diversity of SARS-CoV-2, thereby serving the global anti-epidemic. In addition, by introducing sub-models, this study can rapidly construct an extended model that accurately identifies newly emerging clades without retraining the main model constantly. Future work will focus on the discovery of new clades and the identification of recombination.
Supplementary Materials: The following supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/microorganisms10091785/s1, A list of SARS-CoV-2 genome sequence acknowledgments is provided in Data S1. Figure S1: ROC curves on the Nextstrain testing dataset, Figure S2: ROC curves on the GISAID testing dataset, Figure S3: ROC curves on the Pangolin testing dataset.  The funders played no role in study design, data collection, data analysis, data interpretation, or writing of the report.