PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences

Simple Summary The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains. Abstract The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.


Introduction
The coronavirus disease 2019 (COVID-19) pandemic is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).The pandemic has put millions of people at risk in numerous countries worldwide and made an unprecedented public health crisis [1].Although the origin of COVID-19 (SARS-CoV-2) in humans is still unknown, there are many theories that it could have been transferred to humans from bats [2].Likewise, several related coronaviruses (CoVs), such as SARS (SARS-CoV), was transmitted to humans from civets (civets are closely related to cats [3]), and MERS (MERS-CoV) from dromedary camels [4].SARS-CoV-2, and other CoVs, belong to the family coronaviridae (of order nidovirales [5]), which is a large family of diverse enveloped, positive-sense single-stranded genomic RNA(+ssRNA) viruses that can bring about respiratory diseases to humans and animals [6].They are grouped into five genera namely alphacoronavirus, betacoronavirus, gammacoronavirus, alphaletovirus and deltacoronavirus.They infect a range of hosts such as human, palm civets, bats, dogs, monkeys among others [7].Alphacoronaviruses and betacoronaviruses largely infect mammals, and gammacoronaviruses largely infect avian, the delta coronavirus both avian and mammals [8] (see Figure 6).SARS-CoV-2 is the seventh of this coronavirus family known to affect humans, and the other six are severe acute respiratory syndrome-CoV (SARS-CoV), HCoVs-NL63, HCoVs-OC43, HCoVs-HKU1, HCoVs-229E, and middle east respiratory syndrome-CoV (MERS-CoV) [9].SARS-CoV-2 is similar to SARS-CoV which is one of the previous virus from the family coronaviridae, that lead to the SARS epidemic in 2003 caused more than 8400 cases and approximately 800 deaths [10].As compared to the known SARS-CoV virus, the novel SARS-CoV-2 has a lower mortality rate but higher human-to-human transmission rate, also SARS-CoV-2 can have adverse impact on the human body.It is highly infectious and is the matter of big concern since it can not only damage the respiratory system, gastrointestinal system, heart, and central nervous system, but also may lead to multi-organ failure and eventually death [11,12].
Monitoring zoonotic disease and host specificity are integral to understanding disease dynamics.The 60% of known infectious diseases in humans and 75% of all emerging disease are zoonotic as per report by United Nations Environment Program (UNEP) and International Livestock Research Institute (ILRI) [13].The study of the COVID-19 pandemic is of great significance, not only because it can help healthcare institutions to cope with the ongoing epidemic but also because it allows researchers to learn more fundamentals about the family coronaviridae, which can provide new knowledge for the prevention of potential pandemics in the future.CoVs are widespread among birds and mammals and can be a cause for zoonoses.Zoonoses is defined as any disease or infection that is naturally transmissible from vertebrate animals to humans by the World Health Organization, and COVID-19 has been classified as a zoonotic disease [14].One important step to learn zoonoses and understand the pandemic better is finding how human infection began.CoVs can lead to various diseases in domestic animals, including dogs, pigs, chickens, and cats.Although the origin of COVID-19 in humans is still unknown, genetic analysis results show that it's highly possible that SARS-CoV-2 originates from bats and utilizes the pangolin as an intermediate host [15][16][17].
The CoVs have an envelope membrane that is associated with five structural proteins, namely the surface (S) protein, or spike protein, haemagglutinin-esterase protein (HE), membrane protein (M), envelope protein (E), and the nucleocapsid protein (N) [18], see Figure 1.The spike protein mediates the binding and fusion between the virus and the host cell receptors, and also the infected host cells and adjacent uninfected cells [19].Hence the spike protein of different CoVs largely determines the range of host specificity of these CoVs.Changes in spike protein sequences are reportedly sufficient to change tissue and species tropism, and viral virulence [7,20,21].The S protein is a trimeric transmembrane protein with protrusion on the viral surface, like spikes, which are the key for binding and entry to host cells.It is composed of the receptor binding domain or S1 subunit and S2 subunit that harbor sequences for viral fusion to the cell membrane [7,20,21].Because of its importance, using the specificity of spike proteins offers an approach to classify the potential hosts of CoVs.
A common way to classify and understand the dynamics of viruses is to construct a phylogenetic tree of evolution using the sequencing data of the virus [22,23].After the COVID-19 pandemic breakout, databases like GISAID [24] collected a large number of sequence data of SARS-CoV-2 from researchers and clinicians worldwide, which provides data to conduct virus phylogenetic tree inference.Many The structural genes include spike (S), envelope (E), membrane (M) and nucleocapsid (N).The S gene region encodes the spike protein, which is responsible for attaching the virus to receptors on the host cell membrane.
methods have been developed and applied for constructing phylogenetic trees, including the most similar supertree algorithm (MSSA) method [25], MRP method [26], and approximate maximum likelihood (ML) supertree method [27,28], similar methods upon which the state-of-the art Nextstrain [22] and IQTREE2 [23] have been developed.However, these methods of building a phylogenetic tree for CoVs require a high computational complexity, and the vast volume of sequences data size can cause a scalability issue for the phylogenetic-based approaches [29].For example, Nextstrain [22] is able to construct trees on thousands of sequences, while IQTREE2 [23] is able to scale to tens of thousands of sequences.There are currently millions of sequences available on GISAID alone -clearly viable alternatives are necessary.Here we study machine learning clustering and classification as an alternative to phylogenetic tree building.
Some effort have been made to study the coronavirus host data [30] by using one-hot embedding (OHE) approach to get fixed length feature embedding for the spike sequence.Although OHE gives better predictive performance, one of its drawbacks is the high dimensionality of the feature vectors.Also, the columns of the OHE based vector have a linear relationship with each other, which means that one variable can be easily predicted using the other variables.This behavior can cause problems of parallelism and multicollinearity (when multiple features are correlated with each other) in high dimensions.Authors in [31,32] using the coronavirus spike sequences to classify different variants of COVID-19 using k-mers based frequency vectors.Researchers have performed clustering on the COVID-19 spike sequences using the same k-mers based frequency vector generation approach [33,34].Although their approaches are effective in terms of predictive performance, the dimensionality of feature vector representation is still high, which can creates the very well known problem in machine learning the problem of the curse of dimensionality.Moreover, since for each k-mer, it is required to find the appropriate bin dedicated for that specific k-mer, this 'Bin Matching' could be expensive in terms of computational cost.
Another possible solution, which is what we propose here, is the use of the position weight matrix (PWM), sometimes also called as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM) [35].It is a good representation for motifs in biological sequences.A motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance.The PWM is a application of entropy and relative entropy towards identifying transcription factor binding sites (TFBSs), for example.A PWM contains information about frequency of nucleotide for each position in form of weights, these log-odds or log-probability weights are used for computing the binding affinity score.
The PWM can be used to distinguish the binding sites from the sequence, a well known method for de novo motif sequence finding.If we do not know about a possible motif in given sequence, there are methods such as expectation maximization (EM) and Gibbs sampling which uses the PWM.Inspired by this application, we compute an absolute score from the PWM while scanning the sequence for "motifs" (here k-mers) of our sequence using a sliding window (of size k) to scan the sequence and computing absolute score as shown in Figure 4. We can find relevant information of the motifs based on score calculated from the PWM.The higher the score, the more relevant the k-mer is.
In this paper, we propose an approach, called PWM2Vec, which is also basic implementation of position weight matrix (PWM) to generate feature vector representation from coronavirus spike sequences.Given a spike sequence, we first extract k-mers.From the k-mers, we generate the PWM (see Figure 3).After that, we assign a score to each k-mer using PWM to design a feature embedding and apply machine learning methods like classification and clustering on this feature embedding.While this is inspired by methods for finding motifs (e.g., TF-promoter binding sites), here our goal to obtain a numerical representation of these k-mers generated from each sequence.
Our contributions in this paper are as follow: 1. We propose an approach to generate fixed-length numerical representation of spike sequences using PWMs.The generated feature vectors could be used as input for any machine learning algorithm for tasks such as classification and clustering.2. Our proposed feature embedding approach contains more compact information and give results better than the baselines in terms of classification and clustering.3. Our feature vector contains fewer dimensions as compared to k-mers and one-hot encoding based feature vectors (≈ 24 times fewer dimensions than one-hot encoding and ≈ 4 times fewer dimensions than k-mers based embedding), which improves the runtime for classification and clustering algorithms.4. We performed statistical analysis on the data and show importance of different positions of amino acids that play key role towards classification of different hosts, as well as to validate the compactness of our proposed embedding from an orthogonal point of view.
The rest of the paper is organised as follows: Section 2 contains the previous work related to the sequence classification in general and coronavirus spike sequence classification in particular.Section 3 contains the detail about our proposed alignment free method for spike sequence classification.Section 4 contains the experimental setup and dataset collection and statistics detail.The results for our proposed method are given in Section 5. Finally, we conclude our paper in Section 6.

Related Work
Several machine learning approaches based on k-mers have been proposed in the literature for classification and clustering tasks [31,33,36,37].More specifically, there are a lot of classical algorithms for sequence classification [38,39].Although these methods are proven to be useful in respective studies, it is not clear if they can be used in context of coronavirus data.Also, another major problem with all those methods is high computational complexity of the algorithms (because of high dimensional representation of data), which can result in the higher runtime of the underlying classification algorithms.
Postition weight matrix (PMW) based approaches have been successfully applied for diverse sequence analysis, motifs predictions and identification studies.Several popular software applications or web servers have been build based on the implementation of PWM, e.g., the PWMscan software package [40] and PSI-BLAST [41].Also, a many other advanced algorithms have been implemented to optimize PWM technique: examples include MEME (multiple EM for motif elicitation) [42] implemented based expectation maximum (EM) and Gibbs Sampler [43] for de novo motifs discovery that uses Gibbs sampling algorithms [44], [45].The MEME EM algorithm basically finds an initial motif and repeatedly use EM steps improve motifs until the PWM values do not improve further [42,46].Also, the BaMM (Bayesian Markov model) algorithm build based on the Markov to model correlations among nucleotides at other positions -since the PWM cannot because the method assumes probabilities at different sites are independent from each others [45].The PWM method continues to be applied and extended.Log2PWMs is simple implementation of PWM extended to enable conversion or reconstruction of PWM representation from sequence logo [47].
The PWM is also used for the binding specificity of a transcription factor (TF) [48].It can be used to scan a sequence for the presence of DNA words, which are comparatively more similar to the PWM than to the background [49,50].Authors in [51] evaluate the Bayesian network and support vector machine (SVM) algorithm on four distinct TF binding sites based data sets and analyze their performance using PWM.Authors in [52] develop a tree-based PWM algorithm to accurately simulate the interaction between TF and its binding sites.A new di-nucleotide PWM approach is proposed in [53], which outperforms the conventional mono-nucleotide PWM method.Moreover, the research done in [54] proposes an improved position weight matrix (IPWM) method to recognize the prokaryotic promoters based on an entropy measurement.Using the Hepatitis C Virus (HCV) nucleotide sequences, authors in [55] designed a global PWM for the genotype of HCV genomes.Then using the PWM, signatures were selected from the 5' NCR, CORE, E1, and NS5B regions of the HCV genome.The predictive power of the selected signatures is then evaluated for predicting the most common HCV genotypes and subtypes.
Aside from DNA analysis, the PWM can also be applied to amino acid sequences.Authors in [56] develop an approach involving position specific scoring matrix (PSSM), another name of PWM, to predict protein-protein interactions between protein sequences.They first transform each protein into a PSSM, and then adopt the PSSMs to detect distantly related proteins as well as the protein quaternary structural attributes and protein folding patterns.The research of in [57] proposes a PWM-based algorithm to predict signal peptide sequences and their cleavage positions in the amino acid sequences of bacteria and eukaryotes.Authors in [58] develop a PWM-based method for protein function prediction and propose an argument for why the PWM and associated features have great potential for protein sequence analysis.Although the above methods are successful in respective domains, they do not propose a general method to design a feature embedding for the underlying sequence, which contains rich information about the sequence and can be used as input to various machine learning algorithm.
Designing efficient feature vector based representations haave been studied in many domains such as graph analytics [59,60], smart grid [61,62], electromyography (EMG) [63], clinical data analysis [64], network security [65], and text classification [66].After the spread of COVID-19, efforts have been made to study the behavior of the virus using machine learning approaches.Several methods have been proposed recently for the classification of spike sequences.Authors in [31,67] uses k-mers along with kernel based approach to classify the spike sequences.Authors in [30] propose the use of one-hot encoding to classify the viral hosts of coronavirus using spike sequences only.Although they were able to achieve higher predictive performance, authors in [31] shows that k-mers based approach performs better than the one-hot since it preserves sequence order information more effectively.An efficient clustering of the spike sequences is done in [33].

Proposed Approach
In this section, we propose an approach, PWM2Vec, to generate a fixed-length numerical feature embedding from coronavirus spike sequences for the purposes of host-specification.We also discuss the baseline approaches, specifically one-hot embedding (OHE) [30,32] and k-mers based feature embedding [31,32].We perform feature selection using ridge regression [68] on the resulting embedding before applying machine learning (ML) algorithms.This helps to reduce the dimensionality of the embedding, hence the training time of the downstream ML algorithms.

One-Hot Embedding (OHE)
Machine learning algorithms requires the input to be in a numerical format, it is necessary to processes the (spike protein) sequence data into some numerical representation to apply these algorithms.One-hot embedding (OHE) [30] is a typical approach for obtaining a fixed-length numerical representation from sequence data.Considering an alphabet Σ, which contains the characters (amino acids) of the spike protein sequence, we need to map each character of Σ to a numerical (binary 0-1 vector) representation.We have 24 unique amino acids in the protein sequence data, namely, "ABCDEFGHIJKLMNPQRSTVWXYZ".For each character in the protein sequence, we design a feature vector wherein we encode each symbol separately.Each symbol has length 24 and has a value of 1 at the location corresponding to the position of the character in the alphabet, and 0 at all other places in the vector.For example, amino acid Cysteine (C) is encoded as 001 . . .0. We then concatenate the numerical representations all characters of the protein sequence into a single binary feature vector of this spike sequence.In our coronavirus spike protein sequence dataset, after multiple alignment, the length of each spike sequence is 3498.Therefore, the length of each binary vector computed using OHE would be 3498 × 24 = 83, 952.

k-mers based Frequency Vectors
One of the major drawbacks of OHE is the high dimensionality of the resulting set of feature vectors.Another problem with OHE is that some (sequential) ordering information on the characters (amino acids) of the sequence is not preserved.An approach which addresses both of these problems is to use sub-strings (also called mers) of length k, i.e., k-mers.From a sequence, k-mers are generated by applying a sliding window of size k over the sequences (see Figure 2).Given a sequence of length N, the total number of k-mers that could be generated is as follows: Remark.In our experiments to generate k-mers based frequency vectors, we use k = 3 (as done in [31,32]).After generating the k-mers, we create a feature vector Φ (a frequency vector), which contains the frequency (count) of each k-mer occurring in the sequence [31,32].Given some sequence σ on alphabets Σ, the length of feature vector Φ k (σ) will be |Σ| k .Since we are working with spike protein (amino acid) sequences, and taking k = 3 in our experiments, the length of the feature vector we use is 24 3 = 13824.This feature vector can be used as input for various ML algorithms.Note that generating sub-strings with a sliding window preserves some (sequential) ordering information on the characters (amino acids) of the (spike protein) sequences, which counters one of the drawbacks of the OHE approach.But still to get the frequency count for k-mers we got a problem of high computational cost for bin matching, especially for worst case searches.Also the dimensionality of the frequency vectors is still high.

PWM2Vec
Despite the fact that the problem of high dimensionality of the vectors generated in the OHE approach is somewhat mitigated in the k-mers approach, the dimensionality of the frequency vectors generated in the k-mers approach remains quite high -further improvements can certainly be made.Also, if we can reduce the computational cost for bin matching that would be huge improvement in computational cost.
To address these problems, we propose PWM2Vec, an approach for generating a fixed-length numerical feature vector based on the concept of the position-weight matrix [69].
While our approach is inspired by the value of the PWM for finding motifs in (e.g., protein) sequences, we use it in a slightly different way here.Here, we build a PWM from the k-mers of our sequence, and our feature vector is the score of each k-mer from this PWM.This allows us to take advantage of k-mers in their ability to capture locality information, while also capturing the importance of the position of each amino acid in the sequence (information that is lost in computing k-mer frequency vector).Combining these pieces of information in this way allows us to devise a compact and general feature embedding that can be used with many downstream ML tasks.
Our approach PWM2Vec for feature vector generation, is summarized in Figure 3, we follow the steps (a-h) as explained below.Figure 3 (a) Given an input spike protein sequence , Figure 3 (b) we first extract the k-mers (we use k = 9 in the experiments, which is decided using a standard validation set approach [70]).Figure 3 (c) We then generate a position frequency matrix , which contains the frequency count for each character at each position.Note that, in the example, since the (amino acid) sequence is composed of four characters, there are four possible characters at any position.At position 1, for example, in all 5 k-mers, there are 2 B characters, and so the frequency count of B at position 1 is 2. In our experiments, since we have 24 unique amino acids in our spike protein sequence dataset, our PFMs will have 24 rows and k = 9 columns.Figure 3 (d) Next, we normalize the PFM matrix, and create a position probability matrix (PPM) containing the probability of each (amino acid) character at each position.For example, the probability of B in the k-mers at position 1 is the following: It is possible that the frequency (hence probability in the PPM) of a character at a certain position is 0. In order to avoid 0 values at any position in the matrix while calculating the probability, we add a Laplace estimator, or a pseudocount to each value in the position probability matrix as shown in Figure 3 (e).We use a pseudocount of 0.1 in our experiments [71].We then compute a position weight matrix (PWM) from the adjusted probability matrix.We make the PWM by computing the log likelihood of each amino acid character c, i.e., c ∈ A, B, C, . . ., Z, appearing at each position i according to Note that this likelihood is taken under the assumption that the expected frequency of each amino acid is the same (i.e., p(c) = 1/|Σ|).since we have 24 bases amino acids(p(c) = 1/24 = 0.041).Figure 3 (f) shows the computed position weight matrix (PWM).
After getting the PWM, we use it to compute the absolute scores for each individual k-mer generated from the sequence (see Figure 3 (g) for an example).It is sum of score of bases for the index.Calculating the absolute score is shown in Figure 4 for k-mer (BFDBEDDFF).The highlighted values in matrices are summed up to give Absolute score for the k-mer which sums up to 28.28.
Finally, the scores of all the 9-mers are concatenated to get the final feature vector for the given sequence (see Figure 3 (h) for an example).This whole process is repeated for each sequence.Given a k-mer and PWM, the score for that k-mer can be computed as given in Figure 4.The final length of PWM2Vec based feature vector is 3490, which is equal to the number of k-mers in each spike sequence.

Feature Selection Method
We use Ridge Regression (RR) as the feature selection approach, which is a commonly used for estimating parameters thereby addressing collinearity in multiple linear regression model problems [72,73].This method uses a bias to boost the performance of the model by improving the variance and making the slope more horizontal like.This is useful when we need to find out which of the independent attributes are not needed.This gives us the option of removing such columns (attributes) and bring the slope to zero (see Figure 5).The expression for performing ridge regression is the following: where α × slope 2 is a penalty term.The total number of selected attributes after performing RR in OHE is 22322, for k-mers approach is 7088 while 1616 features are selected for the PWM2Vec based approach.

Experimental Setup
In this section, we describe the setup we used for the experiments followed by the dataset statistics.We also give visual representation of the data using t-SNE plots.All experiments are conducted using an Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.40GHz having Windows 10 64 bit OS with 32 GB memory.We Implemented our algorithm in Python and the code is available online for reproducibility 1 .Our pre-processed data is also available online2 .For classification, we use Support Vector Machine (SVM), Naive Bayes (NB), Multiple Linear Regression (MLP), K Nearest Neighbors (KNN), Random Forest (RF), Logistic Regression (LR), and Decision Tree (DT) with default parameters.

Evaluation Metrics
To measure the performance of the classifiers, we use average accuracy, precision, recall, weighted and macro F1, and receiver operating characteristic curve "ROC" area under the curve "AUC" (one-vs-rest approach).We also measure the training time (in seconds) for each of the classifier.
For clustering, we use simple k-means algorithm and use 3 internal clustering quality metrics namely Silhouette Coefficient, Calinski-Harabasz Score, and Davies Bouldin Score to measure the performance of the clusters.We also show the runtime for k-means for different embedding methods.

Silhouette Coefficient
The silhouette coefficient [74] is used for interpretation and validation of consistency within clusters of data.A clustering algorithm having well-defined (comparatively pure) clusters will have a higher silhouette coefficient value.The Silhouette Coefficient (SC) is computed as follows: where x is the average distance between a sample and all other points in the data belonging to the same class and y is the average distance between sample x and all other data points in the next nearest cluster.

Calinski-Harabasz Score
The Calinski-Harabasz score [75] is used to measure the quality of a clustering algorithm based on the mean between and inside cluster sum of squares.A clustering algorithm with well defined clusters will have a higher Calinski-Harabasz score.The Calinski-Harabasz score is defined as the ratio of the between-clusters dispersion (the sum of distances squared) mean and the within-clusters dispersion.More formally, given a dataset D of size n D that has been clustered into j clusters, we use following expression to compute Calinski-Harabasz (CH) score: where tr(B j ) is the trace of the between cluster dispersion matrix, tr(W j ) is the trace of the within-group dispersion matrix.

Davies-Bouldin Score
The Davies-Bouldin (DB) score [76] of a clustering C is defined as follows: where D ij is the ratio of the "within-to-between cluster distance" of the ith and jth clusters.For each cluster, we compute the worst case ratio (D ij ) of a within-to-between cluster distance between it and any other cluster, and then take the mean.Therefore, by minimizing the DB score, we can make sure that different clusters are separate from each other (smaller value is better).

Dataset Collection and Statistics
The Spike protein sequences of CoVs for all hosts used in this analysis were retrieved (on September 21, 2021) from the NIAD Virus Pathogen Database and Analysis Resource (ViPR) [77] and GISAID [24].A total of 5568 complete protein sequence were collected (3358 from ViPR and 2210 from GISAID), later dropping 10 not attributable to any host detail.The distribution of the dataset across the different host types (grouped by family) is shown in Table 1, which contains information about the 21 host types that we collected from the annotation of the sequences.We also divided the viral sequences themselves into genus and subgenus to see which category a specific coronavirus belongs to. Figure 6 and Figure 7 contains distributions of viral genus and subgenus, respectively.The multiple sequence alignment (MSA) for the sequence dataset was conducted using Mafft Alignment software with default parameter settings which automatically select the appropriate strategy according to the sequence data size.In our case, the gap opening penalty op was 1.53 and gap extension penalty ep was 0.123 [78].Given that our dataset was already sufficiently large, and contained a number of unknown or identified amino acids, we were constrained to using the minimum accuracy parameter of Mafft MSA in order to allow the alignment to complete in a reasonable amount of time.Attempts to set more stringent parameters (op and ep) in order to improve this alignment resulted in runtimes > 24 hours.Since this is already the case when performing multiple alignment of even 5000 sequences, anything substantially larger is out of reach.

Data Visualization
In order to see if there is any natural (hidden) clustering in the data, we used t-distributed stochastic neighbor embedding (t-SNE) [79] approach, which maps input sequences to 2D real vectors.The t-SNE plots for different embedding methods are shown in Figure 8, Figure 9, and Figure 10 contains the t-SNE plots for OHE, k-mers, and PWM2Vec, respectively.We can observe that although with PWM2Vec, more information is included in less dimensional feature vectors, the proposed embedding approach was able to preserve the structure of data similar to OHE and k-mers.

Results and Discussion
In this section, we present our results for PWM2Vec and compare its performance with the baseline One-Hot Embedding (OHE) and the more recent k-mers based embedding approach which has shown to be an improvement over OHE [31,32].For classification, we also show the results for feature selection method (ridge regression) for all embedding approaches.We also show the effect of runtime with increasing number of sequences for the best performing classification algorithm.In the end, we show some statistical analysis on the data as well as on the feature vectors computed using different approaches.

Classification Results
Table 2 shows the results for different embedding methods with different classification methods without performing any feature selection on the feature vectors.We can see that RF with PWM2Vec is consistently performing better than other embedding methods (in some cases the performance gain for PWM2Vec on third significant digit).Also in terms of training runtime, the NB classifier with PWM2Vec is much better than other approaches.This behavior shows that PWM2Vec is not only better in terms of predictive performance, but is also better in terms of runtime.

Effect of Runtime
To evaluate the effect of runtime on the sequences, we use the best performing classifier (Random Forest) and use different embedding methods to perform classification with increasing number of sequences.Figure 11 shows the effect of runtime for (a) OHE vs PWM2Vec and (b) k-mers vs PWM2Vec.We can see that in both cases, PWM2Vec is better in terms of runtime as we increase the number of sequences (on x-axis).

Clustering Results
For clustering, we use the same feature embeddings that we use for classification task.To get the optimal number of clusters, we used the Elbow method [33].This method for different number of clusters (ranging from 2 to 30) is performing clustering to see the trade-off between the runtime and the sum of squared error (distortion score).The optimal number of clusters selected are 9 (see Figure 12).

Statistical Analysis on Data
To measure the importance of amino acids corresponding to the class label, we use the information gain (IG).The IG is defined as follows: where H is the entropy, and p i is the probability of the class i. Figure 13 shows the IG values for different amino acids corresponding to the class labels (hosts).We can see that some amino acids have higher IG values, which means that they play an important role towards the prediction of hosts.Here we can conclude that many amino acids contribute to the host specification as compared to that for SARS-CoV-2 variants specification [31], which is expected, since the genomic variability within the family coronavidiridae should be much higher.The IG values for all amino acids are also available online for further analysis 3 .
0 500 1,000 1,500 2,000 2,500 3,000 3,500 0 0.5 Since information gain does not give us the negative (or opposite) contribution of an attribute (feature) corresponding to the class label (host names), we use other statistical measures such as Pearson correlation [80] and Spearman correlation [81] to further evaluate the contribution of features in PWM2Vec based feature vector.The Pearson Correlation is computed using the following expression: where r is the correlation coefficient, x i is the values of the x-variable in a sample, x is the mean of the values of the x-variable, y i is the values of the y-variable in a sample, and ȳ is the mean of the values of the y-variable.The Spearman Correlation is computed using the following expression: where ρ is the Spearman's rank correlation coefficient, d i is the difference between the two ranks of each observation, and nis the total number of observations.The Pearson correlation and corresponding P-values for PWM2Vec are given in Figure 14 while the Spearman correlation and corresponding P-values for PWM2Vec are given in Figure 15.We can observe that most of the features are contributing towards the prediction of different hosts.Table 5 contains the comparison of correlation values computed using Pearson correlation and Spearman correlation for different embedding methods.Here, we can observe that with less dimensional and more compact approach for feature embedding (PWM2Vec), the fraction of features having correlation values greater than the threshold (i.e.0.3 and −0.3) are greater than those given for OHE and comparable with those given for k-mers (sometimes better also).This behavior shows that using PWM2Vec, we are able to preserve more information in a smaller feature vector and improve the runtime of underlying ML algorithms while giving better (sometimes comparable) predictive performance.

Conclusion
We propose an approach, called PWM2Vec to generate feature vector representation for different hosts of different coronaviruses using spike sequences only, which can be used as input for different machine learning algorithms such as classification and clustering.We show that our approach is not only efficient to generate feature vectors as compared to the popular method based on k-mers, but has comparable prediction accuracies and much better training runtime.This behavior is also observed after applying feature selection algorithm.We also provided some statistical analysis on the data and feature vectors to show the importance of attributes towards the prediction of class labels (hosts).This statistical analysis provides a validation, from an independent point of view (in terms of fraction of features statistically correlated to the label), of the compactness of our PWM2Vec embedding, compared to the baselines.
In the future, we would focus on collecting more data to evaluate the scalability of the PWM2Vec.Using unsupervised methods for dimensionality reduction is another future extension of this work.We would also like to use deep learning models such as LSTM and GRU for the classification purposes in the future.The application of this to larger families of viruses could also be another interesting future direction.

Figure 1 .
Figure 1.The genome of a coronavirus ranges from 26-32kb in length [5], each of them coding for two non-structural and four structural proteins.The non-structural proteins are coded by open reading frame (ORF) 1a and 1ab, where ORF1ab contains the RNA-dependent RNA polymerase gene (RdRp).The structural genes include spike (S), envelope (E), membrane (M) and nucleocapsid (N).The S gene region encodes the spike protein, which is responsible for attaching the virus to receptors on the host cell membrane.

Figure 3 .
Figure 3. Building feature vector for the classification models by computing the Position Weight Matrix from the k-mers of a sequence.

Figure 4 .
Figure 4. Computing the Score from PWM using sliding window on a sequence for a 9-mer.

Figure 5 .
Figure 5. Ridge Regression diagram.With alpha value tending to a large positive number, the slope will tend to zero thereby reducing variance in the dataset.

Figure 6 .
Figure 6.Pie chart for the distribution of different Genus in the dataset.

Figure 7 .
Figure 7. Pie chart for the distribution of different SubGenus in the dataset.

Figure 12 .
Figure 12.Elbow method for optimal number of clusters For the purposes of clustering, we use the simple k-means algorithm.The results of clustering on different embedding methods are shown in Table4.We can see that PWM2Vec is better in terms of

Figure 13 .
Figure 13.Information gain for each amino acid position with respect to hosts.

Table 3
contains the classification results after applying ridge regression classifier of different feature embedding approaches.We can again see that RF with PWM2Vec outperforms other embedding methods for majority of the metrics and NB with PWM2Vec is best in terms of runtime.

Table 2 .
Performance comparing for different embedding methods and different classifiers without using any feature selection approach.Best values are shown in bold.

Table 3 .
Runtime comparison for different embedding methods with increasing number of sequences using random forest classifier (best performing classifier).The figure is best seen in color.Performance comparing for different embedding methods and different classifiers using ridge regression as the feature selection approach.Best values are shown in bold.

Table 4 .
We can see that PWM2Vec is better in terms of Silhouette Coefficient and runtime while k-mers based approach is better in terms of Calinski-Harabasz Score and Davies-Bouldin Score.

Table 4 .
Internal Clustering quality metrics for k-means.Best values are show in bold.

Table 5 .
Correlation values for different embedding approaches computed using Pearson Correlation and Spearman Correlation.We show the count (No.) and fraction (frac.) of features values greater than or less than the threshold (0.3 or -0.3).The fractions are computed by taking denominator as the size of embeddings (83952 for OHE, 13824 for k-mers, and 3490 for PWM2Vec).