Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data

: Accurate identification of COVID-19 is now a critical task since it has seriously damaged daily life, public health, and the economy. It is essential to identify the infected people to prevent the further spread of the pandemic and to treat infected patients quickly. Machine learning techniques have a significant role in predicting of COVID-19. In this study, we performed binary classification (COVID-19 vs. other types of coronavirus) by extracting features from genome sequences. Support vector machines, naive Bayes, K-nearest neighbor, and random forest methods were used for classification. We used viral gene sequences from the 2019 Novel Coronavirus Resource Data-base. Experimental results presented show that a decision tree method achieved 93% accuracy.


Introduction
Coronaviruses, known to include some of the largest viral genomes (about 30,000 bps in length), are single stranded positive sense RNA viruses [1]. The family of coronaviruses contains four genera, which are alphacoronavirus, betacoronavitus, gammacoronavirus, and recently defined deltacoronavirus. Although alphacoronavirus and betacoronavirus are able to infect mammalian hosts, gammacoronavirus and deltacoronavirus mainly infect avian species [2]. Severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV), which belong to betacoronavirus, are human coronaviruses causing highly patogenic aoutcomes. Both coronaviruses can be transmitted to humans due to their zoonotic nature, and cause symptoms of viral pneumonia, fever, breathing difficulties, etc. [3]. An unrecognized pneumonia disease, which is thought to have originated from a local seafood market in December 2019, caused an outbreak in Wuhan, China. The disease sufficiently diverged from SARS-CoV to be considered a new human-infecting betacoronavirus, and it was named COVID-19, which has been officially named SARS-CoV-2 [1].
Sequence alignment methods, such as BLAST [4] and FASTA [5], perform classification using viral sequencing techniques. These methods are based on the assumption that DNA sequences share common features [6]. Although alignment-based methods are successful in detecting similarities, their application can be challenging in most cases [7]. Analyzing thousands of complete genomes using alignment-based methods is too expensive. To overcome the difficulties of alignment-based methods, alignment-free methods have been introduced [8,9]. Recent studies revealed that machine learning techniques have been applied successfully for virus classification [10,11]. Reyes, Avino, and Kari [10] proposed an open-source supervised alignment-free method operating k-mer frequencies in HIV-1 sequences. They used support vector machines, multilayer perceptron, and logistic regression for classification. They demonstrated classification accuracies over 90% in all cases for full length genome datasets of hepatitis B, hepatitis C, and influenza A viruses. Randhawa, Hill, and Kari [3] proposed a combination of supervised machine learning  with digital signal processing for accurate and scalable genome annotation. They mapped genomic sequences into discrete values for applying digital signal processing techniques. They classified plastic genomes of viruses such as dengue and influenza accurately. Reyes et. al. [12] proposed an alignment-free method based on intrinsic genomic signatures delivering highly accurate real-time taxonomic predictions. They used a decision tree method and confirmed this with Spearman's rank correlation coefficient analyses.
Wang et al. [13] reported that COVID-19 has extremely low CG abundance in its open reading frame. They found that CG reduction in COVID-19 can be achieved by mutating C/G into A/T. Based on this idea, in this study, we used CpG island features to predict the COVID-19 virus. We applied four machine learning techniques-support vector machines, naive Bayes, k-nearest neighbor, and random forest. Results were evaluated on the 2019 New Coronavirus Resource (2019nCoVR) repository [14].

Material and Method
In this section, first, we explain how genome sequences were retrieved. Second, we explain how distinguishing features were extracted. Finally, we overview the machine learning algorithms that we used for prediction of COVID-19.

Dataset
The 2019 Novel Coronavirus Resource (2019nCoVR) by China's National Center for Bioinformation [14] collects public coronavirus sequences from various databases, including NCBI, NMDC, GISAID, and CNCB/NGDC. We downloaded 1000 available COVID-19 sequences on August 2020. For non-COVID-19 sequences, 2019nCoVR includes alphacoronavirus, betacoronavirus-1, human coronavirus 229E, human coronavirus HKU1, and human coronavirus NL63 species. We downloaded all available 334 human coronavirus sequences not including COVID-19 on August 2020. Properties of the sequences are also given in Table 1. All sequences were complete genome sequences that were about 30 kbp, and host was chosen as Homo sapiens.

Feature Extraction
The choice of the differentiable features is a critical step to improve recognition performance depending on the characteristics of the COVID-19 virus. By using the assumption that SARS-CoV-2 exhibits a strong absence of CpG [13,15], we proposed the use of CpG island features [16,17], extracted by using Equations (1) and (2).
where p(C), p(G), and p(CG) are percentages of C, G, and CG in a sequence. Thus, for a given sequence, we presented the two CpG island features. Figure 1 illustrates an example of computing the features from a sequence.

Machine Learning Algorithms
The classification was performed to classify the given human genome sequences into COVID-19 or not. Various machine learning techniques can be used to achieve classification. Support vector machines, naive Bayes, K-nearest neighbor, and random forest were used for performing this task.

Support Vector Machines
The support vector machine (SVM) method is a supervised nonparametric statistical learning technique. Therefore, it does not make any assumption on the underlying data distribution. It has various advantages, such as the sparsity of the solution, global optimization, solid theoretical foundation, generalization, and nonlinearity. In the original formulation of SVMs, the method finds an optimal separating hyperplane using a broad set of observations with known labels (i.e., training set) by maximizing the margin between two classes. The term optimal separating hyperplane refers to the decision boundary minimizing misclassifications. The subset of data that lie on the margin is called a support vector. New unlabeled data are allocated to a class based on their geometric position relative to the classifier function. In practice, data points belonging to different class members may overlap one another, which makes linear separability difficult. The soft margin method and the kernel trick are used by adding slack variables to solve the inseparability problem [18].

Naive Bayes
Naive Bayes (NB) is a frequently used machine learning classification algorithm based on Bayes' theorem, which provides evaluation of explicit probabilities for any hypothesis. The theorem states that: where P(h) denotes prior probability of hypothesis h, P(T) is prior probability of training data T, P(T|h) is probability of T given h, and P(h|T) is probability of h given T. In order to choose the acceptable hypothesis the most probable one is selected.

K-Nearest Neighbor
K-nearest neighbor (KNN) is known as one of the simplest nonparametric classifiers. It is a lazy learning algorithm and it does not require any learning process. KNN assigns a new observation into a class with the majority of votes based on k-nearest neighbors [19]. In this step, a Euclidean-like distance is used. Optimum amounts of k-values can be defined using a cross-validation technique.

Random Forest
Random forest (RF) classifier is an ensemble machine learning algorithm that is used for classification and works similarly to a decision tree. It uses the bootstrap aggregating method for training. The overall prediction can be done by averaging predictions of all the individual trees. When feature vectors are given as an input, random forest algorithm creates a forest from a subset of randomly selected data with the help of various decision trees. Next, the algorithm sums up the votes of the decision trees to determine the prediction of COVID-19 or not.

Results
We were interested in the effectiveness of CpG island features in COVID-19 classification. After CpG island features were extracted using Equation 1 and Equation 2, they were classified by using the machine learning techniques, which were support vector machines, naive Bayes, k-nearest neighbor, and random forest. Weka-3-8-4 tool [20] was used to perform machine learning classifications. The numerical results were obtained by using a computer with Linux operating system, 16 GB RAM, and 2.7 GHz processor. Performance of each classifier was measured in terms of precision, recall, F-measure, and accuracy. The tenfold cross-validation strategy was applied and results are reported in Table  2. Moreover, Figure 2 visualizes precision, recall, F-measure, and accuracy values. The maximum classification accuracy was 93%, which was obtained using random forest with CpG island-based features. The machine learning models used in this study with the proposed features predicted COVID-19 sequences in high accuracy. This underlines the efficiency of the proposed method.

Conclusions
In this study, we classified COVID-19 cases from human genome sequences using four machine learning methods-support vector machines, naive Bayes, k-nearest neighbor, and random forest. Experimental results showed k-nearest neighbor and random forest methods with genome-based features gave remarkable results by reaching 92% and 93% accuracy, respectively. In future studies, we will compare COVID-19 sequences coming from humans to other types of coronavirus sequences, such as those coming from musculus, and propose a similarity-based feature. Institutional Review Board Statement: This study was generated by using the publicly available dataset and was conducted according to the guidelines of the Declaration of Helsinki, and approved by China National Center for Bioinformation, China.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study that the publicly available dataset was generated.
Data Availability Statement: Publicly available dataset was analyzed in this study. This data can be found here (2019nCoVR) https://bigd.big.ac.cn/ncov/?lang=en

Conflicts of Interest:
The author declares no conflict of interest.