Prognostic Model of COVID-19 Severity and Survival among Hospitalized Patients Using Machine Learning Techniques

We conducted a statistical study and developed a machine learning model to triage COVID-19 patients affected during the height of the COVID-19 pandemic in Hong Kong based on their medical records and test results (features) collected during their hospitalization. The correlation between the values of these features is studied against discharge status and disease severity as a preliminary step to identify those features with a more pronounced effect on the patient outcome. Once identified, they constitute the inputs of four machine learning models, Decision Tree, Random Forest, Gradient and RUSBoosting, which predict both the Mortality and Severity associated with the disease. We test the accuracy of the models when the number of input features is varied, demonstrating their stability; i.e., the models are already highly predictive when run over a core set of (6) features. We show that Random Forest and Gradient Boosting classifiers are highly accurate in predicting patients’ Mortality (average accuracy ∼99%) as well as categorize patients (average accuracy ∼91%) into four distinct risk classes (Severity of COVID-19 infection). Our methodical and broad approach combines statistical insights with various machine learning models, which paves the way forward in the AI-assisted triage and prognosis of COVID-19 cases, which is potentially generalizable to other seasonal flus.


Introduction
The coronavirus disease  pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), started in Hubei province of China in December 2019 and has since spread worldwide, claiming the lives of more than 5 million people, as of January 2022 [1]. The virus and its variants, all possessing high-transmissibility properties, cause a variety of symptoms, ranging from acute respiratory distress to (systemic) organ failure [2]. Medical institutions and professionals all over the world have dealt with neverbefore-seen emergencies: overcrowded hospitals, scarce medical resources and response systems pushed to their limits. Although diagnostic tests, with variable sensitivity and specificity, have been widely available since 2020, it is still problematic to predict when a new peak of infection will present itself in a population and what measures should be taken to contain the spread while furnishing appropriate medical care. For these reasons, researchers have tried to identify specific features or test results that may be reasonably used as a predictor of the Severity of respiratory distress for COVID-19 positive patients as well as their risk of death [3][4][5][6][7][8][9][10][11][12][13][14] (see also [15] and reference therein).
We study the effect of various pre-treatment features on the final status (outcome) of COVID-19 patients in Hong Kong with the goal of facilitating a faster and more informed response program for triaging infected individuals. The aim of this triaging model is to accurately predict the risk of death or severe COVID-19 infection of a patient at the moment of admission. This way, the most severe infections would take priority in terms of medical care. As a preliminary step, we conduct a statistical analysis based on the computation of correlation coefficients and significance tests to determine the extent of influence of each feature on the patient triaging outcomes. We then perform feature selection using the mutual information as a measure to identify the most predictive features. Finally, the outputs of the feature selection process are passed to the machine learning (ML) model. We will comment on the methods used in our analysis and present the results, including the accuracies for our ML model. For completeness, we will also test the predictive power of our ML model when the full feature list is used in the categorization.

Methods
This is a retrospective observational study of 429 patients admitted between 12 February 2020 and 25 August 2020 to Queen Elizabeth Hospital under suspicion of COVID-19 infection. These patients were subsequently confirmed to be COVID-19 positive via polymerase chain reaction (PCR) testing and retained in the hospital for varying periods of time.
The dataset includes information about the patients, among which their pre-existing medical conditions (comorbidities), test results at admission and amount of oxygen administered. During their stay in the hospital, these patients also received medical treatment based on their health condition and consequential medical suggestion. The nine treatments used were: interferon beta-1b, ribavirin, lopinavir/ritonavir (Kaletra), remdesivir, tocilizumab, dexamethasone, hydrocortisone, prednisolone and convalescent plasma.
Patients are labeled by one of two separate outcome results: Mortality and Severity. The Severity of the disease, related to the severity of respiratory distress, is defined based on the total amount of oxygen given to the patients during hospital stay: 0-stable case ( 0 L of oxygen/min); 1-mild case ( 1-3 L O 2 /min); 2-serious case ( 3-6 L O 2 /min); 3-critical case ( >6 L O 2 /min). The outcome Mortality is self-explanatory and simply distinguishes patients who have survived from patients who have died (i.e., 0-alive; 1-dead), within a few months window of time.
An unbiased estimate of the effect of treatments on either outcome cannot be worked out from our dataset (see comments in Section 4); we instead present below a precise statistical analysis of the relevant discrete and continuous features, establishing which one(s) is most associated with either Severity or Mortality. In order to eliminate any spurious correlation between features, which may over-or under-estimate the power of certain features in predicting triage-outcomes; we will first test the association between features themselves.
We then consider a machine learning feature selection algorithm based on the mutual information between features and outcomes. The resulting features of this classifier are employed as input variables in well-known machine learning algorithms to predict the Severity and Mortality-chances for each patient.

Statistical Analysis for Feature Selection
To enhance the predictive power of the machine learning algorithm, and given the low number of data points present, we intend to select the subset of the full list of features which can best predict a triaging outcome.
We start by calculating the Spearman's rank correlation coefficient ρ [16], which is particularly suited to extract the measure of monotonic association between discrete and/or continuous features. Next, we proceed by considering a natural distinction between discrete and continuous data and the study of their association with triaging outcomes, which merit separate discussions in the subsequent subsections.

Discrete Features
At the moment of hospital admission, patients were asked to give background information about existent comorbidities and other categorical/discrete features: chronic heart disease, hypertension, asthma, chronic kidney disease, diabetes mellitus, hematological malignancies, sex and age. In addition to the feature "age", which we bin into 5 sub-categories, the rest of these features are in fact binary, with 0 (1) representing, respectively, the lack (or presence) of a comorbidity or male (female) sex.
In order to extract the measure of association between these discrete features and the triaging outcomes, we use the well-known Pearson's χ 2 -test of statistical significance [17] based on the χ 2 test distributions (for more details, we refer the reader to the Appendix A.1).

Continuous Features
The admitted patients also underwent blood-tests, from which a variety of continuous features values were extracted. Specifically, the features examined were: total bilirubin, ALT, creatine kinase, neutrophil, hsTnI (high-sensitivity troponin I), urea, CRP, Hb, WBC, lymphocyte, LDH, creatinine and platelet count. For each feature f in this list, we consider the 2 Mortality labels and the 4 Severity labels, each of which generates a distribution, Here, we will focus on statistical point estimation, i.e. median and percentile ranges, so as to compare the statistical distributions of features of patients infected with COVID-19 with known statistical point estimation for normal-values. Specifically, we will make use of two different metrics which quantify the (normalized) "distance" between the median and interquartile range of the distribution of values of a feature, with the known non-infected population's interquartile range. To be more precise, the first metric will measure how far the median of a certain distribution of COVID-19 infected patients falls from the normal population interquartile range. Similarly, the second metric will measure how much overlap there is between the interquartile range of a distribution of COVID-19 positive patients and the normal population interquartile range. We give the explicit mathematical form of these metrics in the Appendix A to this paper.

Machine Learning for Feature Selection
We employ a powerful feature selector, based on mutual information [18], which quantifies the extent of information gain due to a chosen feature, on the target outcome (see the Appendix A for details). The implementation of our classifier from scikitlearn, sklearn.feature_selection.mutual_info_classif, based on work by Ross [19] is specifically designed to deal with hybrid discrete and continuous features, as is the case in our analysis. The algorithm runs over the totality of the data, starting from an initial random seed. Given the small size of the dataset, the results vary slightly. We take an average of the results over 100 runs to decrease the bias of the selector.

Machine Learning for Triage-Prediction
We performed a comparative study of several conventional classifiers: Decision Tree, Random Forests [20], Gradient Boosting [21,22] and RUSBoost models [23]. The Decision Tree classifier utilises a max depth of 400 beyond which the tree does not branch out, while the Random Forest uses 100 estimators. The boosting models use 280 estimators for both RUSBoost and for Gradient Boosting. We choose this to be our parameter space after running multiple trials to maximize the accuracy while placing emphasis on reducing the number of false-negative predictions of the machine learning model in the Mortality class.
The model(s) utilizes a random 70-30% split between training and test data. Given the small sample and consequent large variances, the accuracy of algorithms will also vary within a few percentage points. This is what would one expect from an unbiased classifier trained on a small dataset (Algorithms trained on small samples of data producing constant results for complex classifications analyses are inevitably biased).

Data Cleaning
We imposed a large 25% tolerance on missing data, i.e., features with more than 25% unpopulated information were eliminated. For features with unpopulated data below this threshold, we draw synthetic representative data from the appropriate intervals corresponding to the outcome class. The feature for which the most entries were generated is hsTnI (30 synthetic data points).
To further cull the irrelevant features, we also considered a Variance Threshold test, with a tolerance of 1%: we simply eliminated all features whose values do not vary significantly within the tolerance (in this case, only 1% of the patients' data for a specific features are allowed to vary from a fixed value). This also would necessarily imply that such features cannot be good predictors of either triaging outcome.

Data Augmentation
Since the raw data contain a significantly larger number of patients who did not display serious symptoms (or did not die) as opposed to those severely affected by COVID-19, we use SMOTE to address the problem of imbalanced classification [24]. The Synthetic Minority Oversampling TEchnique (SMOTE) is a data augmentation technique that generates artificially more members of the minority class(es). This method adds new information to our model as it interpolates between examples that are selected in a particular region of the feature distribution. We set the nearest neighbour parameter for SMOTE to 3, i.e., the algorithm looks for 3 nearest neighbors in the minority class, to form the convex hull from which the synthetic samples are drawn. It is important to note that this tunes our model to recognize very selective patient traits for both classification, i.e., inevitably the bias of the predictive algorithm (discussed in the sections below) is increased.

Results
Among the 429 patients, four patients were discharged against medical advice, so their data will not enter any of our analyses. For seven other patients, data about oxygen therapy were not present: in total, 418 patients are labeled based on the total amount of oxygen given to them during their hospital stay.
Similarly, among the 425 initial patients, 76 were transferred to other hospitals by necessity, and unfortunately, it has been not possible to contact them afterwards. Hence, 349 of the initial 425 patients are labeled by Mortality. In Tables 1 and 2, we show point estimation of the collected data, for each of the two label classes, Mortality or Severity. Figure 1 shows the result of the Spearman's correlation test applied between the features themselves, with no accounting for the outcome. As it immediate to see, none of the features seem to possess a high degree of monotonic association; the only exception is perhaps represented by neutrophil count and WBC, whose exact relation is however unknown to us. More complex measures of dependence between features could be probed if more data were available; hence, for the purpose of this paper, we will consider the features under examination to be statistically independent of each other, knowing that in the worst case the use of both neutrophil count and WBC as inputs of the machine learning algorithms may create light redundancies.

Discrete Feature Selection
In the following, we present a statistical analysis of the degree of association of the discrete features, namely diabetes mellitus, hypertension, chronic heart disease, chronic kidney disease, asthma, hematological malignancy, age and sex with the Severity and Mortality outcomes. The latter feature does not seem to be relevant to discriminate between either of the outcomes, as it is immediately evident from Tables 1 and 2. Below, in Tables 3-5, we present the contingency tables for the categorization by Severity of the discrete features which do not satisfy the truth of the null hypothesis of no association between the features and outcomes, i.e., of the discrete features which are statistically associated to the Severity outcome.  The features diabetes mellitus, hypertension and age are found to be associated to the severity outcome, while the null-hypothesis holds true for chronic heart disease, chronic kidney disease, asthma, hematological malignancy. The results are summarized in Table 6. Next, we consider the association between the discrete features and the Mortality outcome. This time, we find that the null hypothesis can be rejected for both chronic heart disease and age but not for the other discrete features. The results are shown in Tables 7 and 8 and summarized in Table 9. Hence, the feature age seems to have a high degree of association to both the outcome examined, while pre-existing conditions seem to be relevant only for one of the two.

Continuous Feature Selection
In this section, we present in a different and more precise form the results already shown in Tables 1 and 2. In Figure 2, we show for each outcome, Mortality and Severity, and each class distribution M i and S i , the median values (red/blue dots) and the interquartile range (red/blue vertical segments). The green horizontal lines instead correspond to the interquartile range of values associated to normal population ranges [25]. If either distance r a , r b (see Appendix A) for any feature f class distribution is above the threshold of 15%, the feature f will be selected as being statistically relevant, since its median or interquartile range values are not included in or do not overlap significantly with the normal range (obtained from statistics on non-infected patients population), testifying how the feature itself seems to be associated to a COVID-19-infection. LDH (S 1 : r a = 23.5%, r b = 118.5%; S 2 : r a = 111%, *; S 3 : r a = 78%, *); • Neutrophil (S 3 : r b = 57.9%).
In the above lists, we have made use of the asterisk symbol * to indicate that, for those distributions, the interquartile range was fully outside the normal ranges: in those cases, we presented only the r a ratio, which was sufficient to select the associated features as statistically significant.
Note that these results serve as an indicator of discriminatory features: another complementary test will be discussed in the section below.  Figure 3 shows the results of the mutual information feature selector for both Mortality and Severity outcomes. The eight most important features selected by the algorithm are given in Table 10. Notice, aside from age (here treated as a discrete variable), only continuous variables are selected by the mutual information algorithm.

Comparison with Feature Selection Results from Literature
Up until now, the emphasis of our analysis lay on the prediction of the features, both discrete and continuous, most relevant for triaging patients by Severity and Mortality. As a way to confirm the validity of our results for feature selection, we note that age [8,[11][12][13], CHD [4,11,12], CRP [12,13], neutrophil [4,13] and LDH [7] were also proven to be statistically significant features to predict the Mortality outcome, while age [14], CRP and LDH [3,5,9,14] (among many others) were found to be statistically significant for the Severity outcome. Furthermore, in [15], a variety of results from other studies are summarized, which possess a notable overlap with the results presented above for the statistically relevant features to predict both outcomes.

Machine Learning-Severity and Mortality Prediction
Keeping the trade-off between accuracy and bias, which depends not only on the type but also the number of features inputted, in this section, we present the results obtained from our machine learning classifiers used as a predictor for the patient outcomes. We present the maximum accuracy results of a single run of the models in Figures 4 and 5, where we use raw data first and raw data balanced by the addition of synthetic data later. In these plots, the 10 most significant features have been chosen as an input of the classifiers.
In Tables 11 and 12, we collect the accuracies of all four algorithms, Gradient Boosting (GB), Decision Tree (DT), Random Forest (RF) and RUSBossting (RUSB) for prediction of the outcomes Mortality and Severity, respectively, averaged over 100 runs with different 70-30% training-test data split. The same results are presented in Tables 13 and 14, where Smote was used to balance the dataset. Those accuracies have been computed using as an input 6 and 10 of the most predictive features as well as the full set (17)

Discussion and Conclusions
We analyzed COVID-19 patient test results, taken shortly after hospitalization and their medical history, such as comorbidities record. A preliminary analysis showed that the preprocessed dataset was complete in its entries but displayed signs of heavy class imbalance that we resolved by synthetically augmenting the minority classes in the dataset using SMOTE. We discovered that CK, creatinine, CRP, WBCs, Hb, hsTnI, LDH, neutrophil and urea levels were among the important features correlated with the Mortality outcome while CRP, creatinine, Hb, hsTnI, LDH and neutrophil levels were associated with the degree of Severity (respiratory distress) of each patient. These results were reaffirmed by the mutual information feature selector, which also revealed patient age to be highly correlated with both triaging outcomes, and confirmed by comparison with the literature [3][4][5][6][7][8][9][10][11][12][13][14][15].
The fact that both outcomes are predicted to be associated with almost the same set of features (see Table 10, and the substantial overlap with the features extracted from the earlier statistical analysis on continuous features), attests to the correctness and consistency of our analysis and results. However, since the lists are not exactly the same, one must decide which set of features to consider as input for the machine-learning algorithm. A statistical analysis of the accuracies confirms that the use of a larger (union) rather than a smaller (intersection) set of features marginally increases the accuracy (5-10%+) of our triaging algorithm but most likely also the bias of the results. Specifically, the machine learning algorithms we use perform very similarly on both classification tasks and their predictions are stable against randomized choice of training sample. Furthermore, they perform extremely well, as their accuracies testify: without the use of SMOTE, i.e., considering the original dataset with unbalanced outcome labels, the maximal accuracy is in the range [95.5-96.2%] for Mortality, and [73.7-79%] for Severity (note that this range corresponds to about 25 (mis)classified cases especially in the S 2 label). The range of accuracies when synthetic data are used to balance the labels are instead [98-99.1%] for Mortality and [74.2-94.6%] for Severity. This considerably improves on previously attempted similar studies which had even access to larger datasets, e.g., [7,[10][11][12][13][14].
The analysis above is in principle amenable to a causal treatment, whereby the variables of a problem are represented as nodes and their causal relation (if any) as a directed arrow [26]. In the graph-based causal language, a directed arrow typically stems from the initial test results and points both at the treatment employed and the final outcome of the patient. In our study, however, the treatment was given independently of the initial test result, which might allow us to neglect the arrow connecting the initial test result to treatment. Furthermore, we note that (un)conscious biases may emerge during the administration of a treatment based on the visible (but not recorded) symptoms of the patient. The same unrecorded symptoms may also have a direct effect on the recorded symptoms, possibly creating a confounding arc that provides an unobserved backdoor pathway from the initial test results and the treatment, irrevocably stopping us from computing an unbiased estimate of the direct causal effect of test-results on outcomes (mediated by treatments). While in principle, one can identify and even quantify these effects to make useful predictions, the small size of our dataset limited the scope of our analysis.
It is worth noting that our study does not consider cases of long COVID-19 which requires data collected over longer periods. Similarly, our dataset provides information about the severity of respiratory distress due to COVID-19 during the course of hospitalization but not other flu-like symptoms such as fever, cough, etc., or other more dangerous effects, such as organ failures.
Notwithstanding the limitations imposed by the data on our analysis, the techniques adopted for the statistical and machine learning analyses of the data have shed valuable insight on the most likely features to be associated with severe and lethal COVID-19 infections in the sample of patients considered.
We conclude this discussion by noting that our methodical analysis exploiting both statistical and machine learning techniques, here used for the prognosis of COVID-19 patients, is also potentially generalizable to other seasonal infections and future pandemic diseases. Data Availability Statement: The datasets generated and/or analysed during the current study are not publicly available due to privacy agreement between the Queen Elisabeth hospital and its patient but are available, in symbolically encoded form, from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare they have no competing interest.