Development of a Smartphone-Based Expert System for COVID-19 Risk Prediction at Early Stage

COVID-19 has imposed many challenges and barriers on traditional healthcare systems due to the high risk of being infected by the coronavirus. Modern electronic devices like smartphones with information technology can play an essential role in handling the current pandemic by contributing to different telemedical services. This study has focused on determining the presence of this virus by employing smartphone technology, as it is available to a large number of people. A publicly available COVID-19 dataset consisting of 33 features has been utilized to develop the aimed model, which can be collected from an in-house facility. The chosen dataset has 2.82% positive and 97.18% negative samples, demonstrating a high imbalance of class populations. The Adaptive Synthetic (ADASYN) has been applied to overcome the class imbalance problem with imbalanced data. Ten optimal features are chosen from the given 33 features, employing two different feature selection algorithms, such as K Best and recursive feature elimination methods. Mainly, three classification schemes, Random Forest (RF), eXtreme Gradient Boosting (XGB), and Support Vector Machine (SVM), have been applied for the ablation studies, where the accuracy from the XGB, RF, and SVM classifiers achieved 97.91%, 97.81%, and 73.37%, respectively. As the XGB algorithm confers the best results, it has been implemented in designing the Android operating system base and web applications. By analyzing 10 users’ questionnaires, the developed expert system can predict the presence of COVID-19 in the human body of the primary suspect. The preprocessed data and codes are available on the GitHub repository.


Introduction
COVID-19 is a highly contagious disease caused by the recently discovered coronavirus 2, causing severe acute respiratory syndrome (SARS-CoV-2). It is highly infectious, with alarming characteristics, leading to the current worldwide pandemic [1]. The typical symptoms, such as fever, headache, shortness of breath, dry cough, and loss of smell, can vary from person to person; for instance, it sometimes does not raise any symptoms in some groups of people. It is noteworthy to mention that smokers and drug-addicted people are highly affected by this novel virus. However, unaware interaction with surrounding 1.
Excluding redundant features from the utilized dataset by applying two different algorithms for ablation studies, such as K Best and Recursive Feature Elimination (RFE);
Performing comprehensive comparison in terms of different evaluation metrics among different ML models to select the best performing classifier for the aimed task; 4.
Finding the association between the significant features with the Apriori algorithm; 5.
Developing an intelligent application system in both smartphone and web interfaces to prognosticate COVID-19.

Analysis Procedure
The workflow diagram of this aimed study is displayed in Figure 1, where the proposed method has been categorized into several steps.  Firstly, the dataset has been preprocessed for the subsequent steps in the proposed framework. It includes missing value imputation and re-balancing the dataset, followed by train-test splitting. Secondly, the feature selection method gives the essential and non-redundant attributes. Thirdly, XGB, RF, and SVM classifiers categorize the Covid and non-Covid patients. Finally, an expert system application was developed from the best-performing ML model for predicting COVID-19. The step-by-step methodology for conducting this study is described in greater detail in the following subsections.

Preprocessing
In this article, the practical COVID-19 dataset is dealt with. It requires several preprocessing steps for imputing missing values and re-balancing the imbalanced data before engaging in the final ML-based classification. Moreover, such preprocessing will likely enhance the ML model's performance [18]. However, the Multivariate Imputation has been applied by the Chained Equations (MICE) algorithm [19] to deal with the missing values. After that, the ADASYN algorithm has been used to eliminate the class imbalance nature of the dataset, where an adequate amount of artificial data has been created for the minority class. Finally, the feature selection method provides 21,578 instances with only 10 features.
The dataset used to carry out this research has 11,169 attributes, with 2.82% positive Covid results and 97.18% negative Covid test results for each sample. So, a significant class imbalance exists in the dataset, creating bias while classifying the target variables. This issue can be mitigated by generating a sufficient amount of synthetic data for the minority classes by oversampling these classes. There are two renowned techniques to oversample the minority classes: (i) the Synthetic Minority Oversampling Technique (SMOTE); (ii) the Adaptive Synthetic (ADASYN) sampling method, a generalization of the SMOTE algorithm. Both the algorithms mentioned above can create virtual data to solve the bias issue that occurred due to the class imbalance nature of the dataset. However, one disparity is that if there is any sample outlying in the minority class and that appears in the majority class, it will create a line bridge with the majority class. Another difference is that, while oversampling, the ADASYN takes the density distribution into account to accurately define the number of synthetic instances generated for the minority class, which is usually challenging to understand [20]. So, it is evident that the ADASYN algorithm assists in adjusting the decision boundary adaptively, relying on the problematic samples. Because of this exciting convenience, the ADASYN algorithm has been considered for solving the issue of the class imbalance nature of the dataset.

Classifiers
After the preprocessing, mainly three classification algorithms were applied: XGB, RF, and SVM. These classifiers provide improved results in many studies [21][22][23]. Furthermore, studies found significantly improved results by applying XGB in COVID-19 mortality prediction and prepared a clinically operable Covid decision support system using XGB for clinical staff [23,24]. This motivation leads to applying XGB along with other state-of-theart RF and SVM. Therefore, XGB has mainly been described in detail with the important hyperparameters of RF and SVM.
XGB is a comprehensive ML system for tree boosting proposed by Chen and Guestrin [25], which was a winner of the Kaggle ML competition in 2015. Gradient Boosting is the base model of XGB, where multiple iterations co-occur. To optimize the specified loss function, the residual will correct the previous predictor at each iteration of gradient boosting, as illustrated in Figure 3. Regularization is applied to the loss function in order to establish the objective function in XGB, which is used to assess the model's effectiveness [22,26].
Suppose there is a dataset with n samples, where X a is an independent variable and every variable has m number of features, therefore X a R m . For every variable, there are corresponding dependent variables y a , hence y a R. A tree ensemble model, such as y a , predicts the dependent variable using the n additive functions of independent variables, as shown in Equation (1).
where f n belongs to a leaf score independent tree structure, and F is the space of trees. After minimizing the above equation, it takes the following form in Equation (2) [25].
where the loss function is l, and Ω is the term that reduces the complexity of the model, which is defined in Equation (3).
In the training phase, 10-fold cross-validation is employed to conduct experiments on main three ML classifiers. One fold is used as a testing set in the outer loop, and the remaining nine folds are utilized in the inner loop for the model's training and hyperparameter optimization [18]. The Bayesian optimization algorithm has been employed to optimize the hyperparameters of all the classifiers [23]. Table 1 confers different important hyperparameters to be tuned using Bayesian Optimization. The intuition behind Bayesian optimization is illustrated in the following paragraphs.  To revolutionize the performance of the classifiers employed in the classification task, important hyperparameters must be optimized. In this study, mainly three classification algorithms, such as Random Forest (RF), Support Vector Machine (SVM), and eXtreme Gradient Boosting (XGBoost), have been utilized. Each classifier mentioned earlier has some distinct hyperparameters, which the Bayesian Optimization technique has tuned. For instance, Gamma (γ) and Cost (C) parameters have been tuned for SVM classifiers, where the Radial-Basis Function (RBF) kernel is controlled by the Gamma parameter. Furthermore, the TSVM with linear kernel and TSVM with RBF kernel has been employed in this research to compare these two variants, and the prevailing SVM [27,28]. While using the TSVM, the non-parallel hyperplanes are required to be controlled by tuning several hyperparameters, such as c 1 , c 2 , e 1 , e 2 , v 1 , v 2 . The broadness of the decision region depends on the lowness of the Gamma parameter, and the misclassification of the training instances is controlled by the Cost function. In the case of XGBoost, seven hyperparameters have been optimized: n_estimators define the total boosting stages; maximum depth of the estimator is denoted by max_depth, learning_rate controls the activities of each tree; Gamma represents the minimum loss; min_child_weight signifies the least summation of instance weight, colsample_by_tree represents the column subsample, and n_jobs symbolizes the parallel threads. In addition, the four hyperparameters for RF (criterion, max_depth, max_features, n_estimators ) have been optimized, where the criterion estimates the split quality and the maximum number of features is denoted using max_features.
The reason behind choosing Bayesian optimization is that it performs better than random search, grid search, or manual search techniques for tuning the hyperparameters of the classifiers. The most crucial advantage of Bayesian Optimization is that it keeps the previous evaluation in memory and helps in tuning the hyperparameter in a probabilistic manner. The important steps of Bayesian optimization are: (1) define a cost function; (2) search space; (3) number of iterations; and choose a search algorithm. Initially, there is a requirement to define a cost function; for instance, in this proposed approach, the 10-fold cross-validation loss has been considered in the cost function, followed by the task of searching space at the boundary of 0 and 1. In the last step, the number of iterations is fixed, and the best algorithm is chosen. In this proposed approach, the Treestructured Parzen Estimator (TPE) algorithm has been considered by HyperOpt to finalize the optimization process.

Hyperparameters
Short Description

XGB classifier
Estimator numbers Multiple boosting rounds worth of gradients were used to boost these trees

User Application Design
The best-performing ML classifier from an experimental setting is deployed into a pickle file for developing a user-friendly application. Binary protocols are implemented in the pickle module so that a Python object structure can be serialized and deserialized into a bytes stream, and then the bytes stream can be converted into an object hierarchy again. However, in a real-world scenario, the practice of pickling and unpickling is widespread as it allows one to quickly transfer the data from one server/system to another and then store it in a file or database [29]. Then, flask, a Python web framework, is deployed to integrate the pickled model with the web application. Compared to the Django web framework, the Flask web application is more precise in everyday scenarios; hence, it is considered more Pythonic [23]. Besides, a user-friendly web application has been designed through which end users can easily input the data to test whether they are affected by COVID-19 or not in a probabilistic fashion. A higher probability indicates that the user has a higher risk of being COVID-19 affected and vice versa. High-risk users should follow proper guidelines and take medication for COVID-19 as early as possible.

Experimented Outcome and Discussions
This study comprised two types of analysis: (i) results related to classification model building; performance analysis of the model; and identification of essential features and the best classifier (see Section 3.1) and (ii) employment of the best classifier to develop an expert system discussed in Section 3.2.

The Outcome of ML Classifiers
In this study, 70.0% data was used to train the model and 30.0% to evaluate the trained model. The K Best and RFE algorithms have been practiced on the training dataset and also output 10 essential features (see Table 2).
Loss of smell; 10. Loss of taste.  Table 3, it is noteworthy that the Select K Best feature selection method provides better results than the RFE feature selection approach. The performance of the XGB classifier [30] has been significantly enhanced due to the Select K Best feature selection process. The XGB classifier demonstrates the highest metrics values compared to the other ML classifiers, such as RF and SVM (see Table 3). In the end, the XGB classifier, in conjunction with Select K = 10 Best features, has provided the best result for the article. Furthermore, the class-wise COVID-19 prognostication results are displayed in the confusion matrix, which is the visual representation of the classification performance. Figure 4a-c depict the confusion matrices for the XGB, RF, and SVM classifiers, respectively.
In addition, it has also been endeavored to test the results of the COVID-19 patient classification using the two well-known variants of Twin Support Vector Machine (TSVM) [27,31] because of its high-speed computational capacity. Other exciting advantages of using TSVM are that it ascertains two non-parallel hyperplanes instead of one hyperplane in the case of conventional SVM, and it can automatically unearth two-dimensional projections of the data. The perks of TSVM have been projected using both tabular and pictorial depictions. For instance, Table 4 delineates the comparative performance of traditional SVM and its two variants; Figure 4d,e visualize the confusion matrices of TSVM with linear kernel and TSVM with RBF kernel. From the tabular and graphical illustrations, it can be concretely decided that, although two variants of SVM perform better than the conventional SVM, the performance of XGB far outweighs the rest of the classifiers, including the TSVM. This is because the variants of TSVM have not been utilized for the further analysis of the proposed research work.
The first two diagonal cells of Figure 4 expose the corresponding number of trained networks and the percentage of proper classifications. For the best-performing XGB classifier, 66 of the COVID-19 cases are incorrectly classified as non-Covid, corresponding to 0.9% of all the 7121 examples in the data. Likewise, 135 of the non-Covid samples are falsely classified as COVID-19, corresponding to 1.9% of all the data. Furthermore, the XGB classifier has outputted a 98.1% positive predictive value, outperforming the RF and SVM classifiers, respectively, by the margins of 0.4% and 33.3%. Again, that comparative analysis of those three confusion matrices explains that the XGB classifier provides fewer false-positive and false-negative COVID-19 identification results with better true positive and negative while applying the K = 10 Best feature selection.
Again, all the classifiers are further evaluated in terms of ROC curves and their AUC values. Figure 5a visualizes the ROC for the main three classifiers, which recognizes that the optimized XGB affords better results in terms of ROC compared to SVM and SVM. Higher AUC values offer robust performance in differentiating between the target classes. The experimental results reveal that the XGB classifier has the highest possible AUC of 98.0%.   The optimized XGB classifier has offered the best results in this study, as is evident from the above outcomes and discussions. The further assessment of the best performing XGB classifier has been accomplished using a bootstrap ROC, as attested in Figure 5b. In addition, the ROC 100 times (nboot = 100) has been replicated, and thereby the corresponding results offer a 95% confidence interval. The earlier results are based on the random train-test split, where 70.0% of the dataset was adopted in training and 30.0% of it in the testing phase. The 10-fold cross-validation (10-fold CV) was considered for additional robustness testing. This is done 10 times to estimate the model's overall performance. In Table 5, the performances of the main three classifiers using a 10-fold CV are numerically presented. The mean accuracy for the XGB applying the 10-fold CV has the highest value of 96.68%. In contrast, the SVM offers 63.62%, and RF provides 90.77% cross-validation accuracy. The Box and Whisker plot in Figure 6a of those three models explains that the XGB classifier produces the best results, showing very few inter-fold variations and the highest mean accuracy. The statistical significance of this study is carried out using the ANOVA test, employing a 10-fold CV. Additionally, the Multi-comparison Analysis (MCP) has been adopted, which has critical statistical properties. Failing to account for them leads to three induction algorithm pathologies: attribute selection errors, overfitting, and oversearching. However, a significant statistical analysis has been performed and calculated the p-value, 4.8 × 10 −14 (p < 0.001, statistical significance). This critical analysis has revealed statistically significant results. The optimized XGB result is statistically more important than the RF and SVM manifested in Figure 6b, as is apparent from the multi-comparison test.

COVID-positive COVID-negeative
The importance of the t10 selected best features is exhibited in Figure 7a using the best performing XGB classifier. The most important features are loss of taste and smell, and the least essential feature among the 10 is wheezes. Cumulative importance (CI) calculates the most important features and describes how the second-most important features improve the performance. Performing CI, the 10 features are shown in Figure 7b, and the performance is almost 95%.
The violin plot displayed in Figure 8 is also created to visualize the data and data distribution in terms of fever and cough. It clearly distinguishes between Covid and non-Covid patients [32]. Therefore, an optimized XGB model using 10 important features (age, temperature, pulse, respiratory rate, bronchi , wheezes, cough, fever, loss of smell, loss of taste) has been used to design an intelligent app that is discussed in Section 3.2 [33].
Furthermore, from Table 6, it is evident that the model performance has been bestowed on the selected features. Without feature selection, the performance of RF and XGB is almost similar, with an accuracy of 97.89% and 97.18% separately. However, the Matthews correlation coefficient is 95.79% and 94.37% for RF and XGB, respectively. The results using the 10 best features are slightly better than this. So, it can be concluded that using the 10 features resulting from SelectkBest in collaboration with the optimized XGB provides the best results. From these results, the violin plots have also been carried out (see Figure 8).
The most important rules are derived from a set of 10 attributes. It has been attempted to discover a relationship between the selected features due to this result. These nine rules are extracted based on confidence (35%) and support (95%), and Figure 9 depicts the relationship between those variables and the rules for Covid patients. . Feature importance and cumulative importance using XGB. The most important feature is loss_of_test, which is a noteworthy discovery from the dataset investigation. Then, (a) showed the selected 10 critical features, and (b) visualizes the cumulative feature importance curve, which also indicates that it grows to 95% above.

State-of-the-Art Comparison
The previous discussion reveals that 10 features have been selected by employing Select-K Best feature selection and are shown to outperform using XGB. The results are also comparable with other contemporary studies. In Table 7, Awal et al. [23] worked on the same dataset where the feature number and total instances were 33 and 21578. They applied mainly three different ML algorithms and got the best accuracy of 98.63% from XGB with 33 features, but no expert system was built there. On the other hand, with 10 significant features, the expert system developed using the proposed approach achieved 97.91% accuracy in the XGB model, which is quite close to Awal et al. [23] work. Kumar et al. [10] also developed their prediction model by applying RF and XGB. Arpaci et al. [7] achieved the best accuracy of 97.7% by applying XGB on the dataset containing three features and 5840 instances. Debjit et al. [8] used a dataset consisting of 20 clinical features and obtained 92.54% accuracy. The proposed system was based on 10 features, and the cases were 21,578. However, it is obvious that the better accuracy is 97.91% from XGB, 97.81% accuracy from RF, and 73.37% is obtained from SVM.

Mobile and Web Application Development
A user-friendly web and mobile application have been developed from the model on which end users can easily input the data to test whether they can affect COVID-19 in a probabilistic fashion or not. Here, end-users input 10 data points such as age, body temperature, pulse rate, respiratory rate, bronchi, wheezes, cough, fever, loss of smell, and taste. The higher probability indicates a higher risk of affective COVID-19 and vice versa. The high-risk user should follow proper guidelines and medication against COVID-19. The web application, as well as the mobile interface of the proposed framework, have been presented in Figures 10 and 11, respectively.  After input, those 10 data that predict the button return the following Figure 11 with the positive and negative probability percentages. Here is the screenshot of the newly designed smartphone application. End users can here input these 10 data and predict the probability of positive or negative.

Conclusions
COVID-19 has altered global, social, and economic conditions. The sturdiness of international relations has been tested, whether on a unilateral or multilateral basis. The most obvious consequences of this disease are economic recession, a global governance crisis, trade protectionism, and increased isolationist sentiment. There have been restrictions on international exchanges of people, culture, and travel. Nevertheless, this is only the beginning. The world will be able to maintain stability in the future when faced with similar challenges because people will overcome the pandemic. If COVID-19 is detected early, it may be possible to defeat the virus. Furthermore, here, a model is built for the early detection of COVID-19 to facilitate end-users by quickly testing whether they are affected by COVID-19 in a probabilistic fashion. A higher probability indicates that the user is more likely to be affected by COVID-19 and vice versa. Mainly, three ML algorithms have been utilized, namely XGB, RF, and SVM, to establish this model, which obtained an average accuracy of 97.91% for XGB, 97.81% for RF, and 73.37% for SVM. Finally, an expert intelligence system has been developed using the outcomes of this study, and it is aimed to be highly efficient and accurate in predicting and classifying COVID-19, as its symptoms are the same. This application can be applied to any country's people. In the future, deep learning and transfer learning may be integrated into the proposed algorithm to make the application more efficient. The sample size of the dataset can be enlarged to make the analysis more feasible, accurate, and valuable. The current dataset has been stored in the GitHub repository (https://github.com/awalece04ku/Covid-ML/) (accessed on 26 February 2022) for future app usability testing and performance measurement.