1. Introduction
The Coronavirus Disease 2019 (COVID-19) pandemic continues to strike the globe with second and third waves of infections, as the emerging variants of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are more transmissible and deadly [
1]. Different countries are vaccinating their population with several vaccines to reduce the disease burden and mitigate the pandemic, but, in this race, all countries are not at the same level [
2,
3]. COVID-19 vaccine production, distribution, and administration have not reached the needed vaccine coverage globally [
2,
3]. Therefore, the spread of emerging SARS-CoV-2 variants is exceeding the speed of vaccination campaigns resulting in a continuous global burden of COVID-19 disease. As of today, 7 August 2021, there have been a total number of approximately 201 million cases worldwide with 4.27 million deaths [
4].
COVID-19 has a spectrum of clinical presentations ranging from asymptomatic patients to critically ill patients. According to several studies [
5,
6,
7,
8,
9], the severity of the disease depends mostly on age and comorbid conditions. Moreover, genetic factors are also being studied to identify the relationship of COVID-19 with severity [
10]. In one particular study, a link was found between genes encoding blood groups, specifically type A, with serious clinical manifestations [
11,
12]. There are many complications associated with COVID-19 and patients may present with symptoms affecting multiple systems including respiratory, cardiovascular, and gastrointestinal (GI), in addition to affecting coagulability [
13,
14]. COVID-19 is known to affect coagulation profile and cardiac biomarkers. The rates of cardiac injury among COVID-19 patients are between 19.7% and 27.8% of admitted cases and the associated mortality rates are between 23% and 51.2% [
15]. An analysis of coagulopathy, inflammation, and troponin can help to explain the mechanism of myocardial injury. Notably, raised troponin levels among critically ill patients point to cardiac injury and are a sign of poor prognosis. It is a clear indication that the cytokine divulgence syndrome potentially mediates myocardial injury. Longitudinal follow-up insinuates a notable divergence between critically ill patients who die and those who do not. Lastly, cardiac injury during admission relates to severe outcomes. On the third day, C-reactive protein (CRP), a blood marker, measures the level of inflammation. CRP is a protein made by the liver and sent into the bloodstream in response to inflammation. Interleukin-6 (IL-6) is also an inflammatory marker, which is an indicator of disease severity [
16], and it was found that the IL-6 peak among critically ill survivors falls between the fourth and seventh days [
15]. By contrast, this increases continuously among those who do not survive. D-dimer, a marker of coagulopathy, remains high in those who do not survive in contrast to those who do. COVID-19 is also associated with changes in levels of various circulatory inflammatory coagulation biomarkers including fibrinogen and D-dimer. D-dimer levels have been noticed to be within normal ranges or slightly increased in the early stages of the disease. As the disease and severity progress, levels of D-dimer are significantly increased [
17]. Fibrinogen, a protein produced by the liver, also increases with inflammation and a coagulation bio-marker. Creatinine, Lactate Dehydrogenase (LDH) levels, Lymphocyte count, D-Dimer, Troponin, IL-6 and CRP are shown to be important biomarkers for the severity prognosis of COVID-19. Creatinine is a chemical compound leftover from energy-producing processes in the muscles, which a healthy kidney filters out of the blood. LDH is an enzyme involved in energy production, which is found in almost all cells in the body, used to monitor tissue damage associated with a wide range of disorders, including liver disease and interstitial lung disease. The increase of LDH reflects tissue damage, which suggests a viral infection or lung damage, such as the pneumonia induced by SARS-CoV-2 [
18].
Assessing COVID-19 severity and prognosis has been of great importance in clinical patient management. Machine learning has played a noteworthy role in detecting COVID-19 using clinical data and chest X-ray and computer tomography images in patients [
19,
20,
21,
22,
23,
24,
25]. Banerjee et al. in [
26] used full blood counts to recognize COVID-positive cases, instead of the traditional identification of symptoms, and have found that positive patients exhibit lower amounts of leukocytes, platelets, and lymphocytes. Brinati et al. [
27] used routine blood biomarkers to test a sample of 279 COVID patients using machine learning models, which results in accuracy ranging between 82% and 86%, and sensitivity ranging between 92% and 95%. Yang et al. [
28] evaluated the use of machine learning in routine laboratory blood tests to predict COVID-19, which offers an opportunity for early detection of the illness in areas where RT-PCR tests are not available. Machine learning was also used to predict mortality and critical events in patients with COVID-19. Rahman et al. [
29] used easily available complete blood count (CBC) parameters to predict the severity of COVID-19 patients and the developed model was validated on another external dataset reporting very high classification accuracy. Chowdhury et al. [
24] investigated demographic and clinical characteristics and patient outcomes using machine learning tools to identify key biomarkers in order to predict the mortality of the individual patient. A nomogram was developed for predicting the mortality risk among COVID-19 patients. Lactate dehydrogenase, neutrophils (%), lymphocyte (%), highly sensitive C-reactive protein, and age (LNLCA), information acquired at hospital admission, were identified as key predictors of death by the multi-tree XGBoost model. The area under the curve (AUC) of the nomogram for the derivation and validation cohort was 0.961 and 0.991, respectively. An integrated score was calculated with the corresponding death probability. COVID-19 patients were divided into three subgroups: low-, moderate- and high-risk groups. Vaid et al. in [
30] claim that with the XGBoost classifier, such trends as acute kidney injury, elevated LDH, tachypnea, hyperglycemia, higher age, anion gap, and C-reactive protein were the strongest drivers associated with mortality and critical events. Aladag and Atabey [
31] have attempted to predict mortality risk for critical COVID-19 patients using coagulopathy markers. Terwangne et al. in [
32] showed the predictive accuracy of severity classification of COVID-19 using a model based on Bayesian network analysis with the help of five important parameters: acute kidney injury, age, Lactate Dehydrogenase Levels (LDH), lymphocytes and activated prothrombin time (aPTT).
Huang et al. [
5] used nine independent risk factors at admission to the hospital to quantify the risk score and stratify the patients into various risk groups in a retrospective, multicenter analysis of 336 confirmed COVID-19 patients and 139 control patients. This research did not use any external validation. The independent relationship between the baseline level of four indicators (Neutrophil to Lymphocyte Ratio (NLR), LDH, D-dimer, and CT score) on admission and the severity of COVID-19 was assessed using logistic regression. The presence of high levels of NLR and LDH in serum could help in the early detection of COVID-19 patients who are at high risk. It was shown that the usage of LDH and NLR together increased detection sensitivity [
6]. This model, however, is based on a CT image-based ranking, which is not available for all patients. In a limited number of hospitalized patients (84) with COVID-19 pneumonia, Liu et al. [
7] suggested combining the NLR and CRP to predict 7-day disease severity. A retrospective cohort of 80 COVID-19 patients treated at Beijing You’an Hospital was analyzed to identify risk factors for serious and even fatal pneumonia and establish a scoring system for prediction, which was later validated in a group of 22 COVID-19 patients [
8]. Age, diabetes, coronary heart disease (CHD), percentage of lymphocytes (LYM percent), procalcitonin (PCT), serum urea, CRP, and D-dimer were found to be correlated with mortality by LASSO binary logistic regression in a total of 2529 COVID-19 patients. The researchers then used multivariable analysis to determine that old age, CHD, LYM percent, PCT, and D-dimer independently posed risks for mortality. A COVID-19 scoring system (CSS) was developed based on the above variables to classify patients into low-risk and high-risk categories with discrimination of AUC = 0.919 and calibration of
p = 0.64 [
9].
Although there have been recent works utilizing machine learning approaches for early mortality prediction of patients using biomarkers [
5,
7,
8,
9,
33,
34,
35,
36,
37,
38], to the best of the authors’ knowledge, there has been no work to develop a generalized and reliable model for both COVID-19 and non-COVID-19 patients, which is the motivation behind this study, and important to develop during the pandemic situation when medical personnel are dealing with both types of patient. It is critical for both resource planning and treatment planning to identify and prioritize the patients at high risk. In addition, it should be possible for high-risk patients to be constantly tracked during their hospital stay using a reliable scoring method. The patients at risk, who typically end with ill outcomes, require treatment in an intensive care unit (ICU), which can be identified by the proposed tool, helping in saving the lives of a significant number of people during this pandemic. Thus, the novelty of the work in this paper can be stated as the development of a generalized and reliable early mortality risk predicting technique for identifying the patients with high risk among both COVID-19 and non-COVID patients. It also adds to the body of knowledge for developing a framework of prognostic models using machine-learning approaches. This paper not only develops a nomogram-based scoring technique but also validates the performance on a completely unseen dataset from different countries and populations.
The rest of the paper is organized as follows:
Section 2 discusses the methodology of the study by describing the datasets used in this paper, the details of data pre-processing stages for machine learning classifiers, and the nomogram-based scoring technique.
Section 3 discusses the result of the classification models and nomogram-based scoring techniques.
Section 4 discusses the result and validates the performance of the developed nomogram-based scoring technique and, finally, the article is concluded in
Section 5.
2. Methodology
The study consists of two important phases: the model development and model validation phase using two datasets. Dataset-1 [
39] (Day-0 patient’s data) is used for the prediction model development and Dataset-2 [
40] is used for external validation of the developed model. The code for machine learning pipeline used in this study can be found in [
41]. As further illustrated in the methodology diagram (
Figure 1), Day-3 and Day-7 patients’ data from dataset 1 is also used for external validation. Pre-processing, and feature selection and reduction were important parts of the feature-engineering task. In the model development phase, the authors have divided the training dataset (Day-0 patients’ data from dataset 1) with the selected features into training, validation and testing data. The validation dataset is used for the tuning of hyper parameters in the machine learning process, and the testing dataset is used for model evaluation. The best-trained model is used to develop the scoring technique to classify the patients into three mortality risk categories: Low, Medium and High. Finally, the developed model is validated using external datasets and the results are reported. The remaining part of the section will provide details of the datasets, pre-processing techniques, performance metrics for machine learning model and the nomogram based scoring technique.
2.1. Study Population
In this study, two clinical biomarker datasets from two different countries were used. The first dataset (Dataset-1) was used to develop and validate an early death prediction model and the second dataset (Dataset-2) was used as an external validation model. The first dataset was created from the Emergency Department (ED) of a metropolitan and academic hospital in Boston during the first wave of the COVID-19 pandemic from 24 March 2020 to 30 April 2020. The study was carried out with institutional ethical approval [
39]. Patients 18 years or older with clinical concern at the time of hospital admission with acute respiratory illness were included in the study, with at least one of the following conditions: (1) tachypnea (about 22 breaths per minute), (2) oxygen saturation ≤ 92% on room air, (3) supplemental oxygen requirement or (4) positive pressure ventilation requirement. The patients were monitored up to 28 days after registration for the clinical outcome or discharged if the patient recovered. The dataset consists of the biomarkers for three separate days (0, 3, and 7 days). There are six groups of patients available in the enrolled 384 patients. Among them, the first group (Class 1) were the patients with death outcomes (49 (12.76%) patients) and the other groups (Class 2–6) were the patients in the survived class (335 (87.24%) patients). Among the 384 patients, 78 (20%) patients tested as SARS-CoV-2 negative and 306 (80%) patients tested as SARS-CoV-2 positive by RT-PCR.
Table 1 shows the description of the first dataset (Dataset-1).
The second dataset (Dataset-2) was collected retrospectively from 375 patients in Wuhan, China between 10 January and 18 February 2020 to find valid and relevant clinical markers of mortality risk. Standard case report forms were used to collect medical records, which included information on epidemiological, demographic, clinical, laboratory, and mortality results. Yan et al. [
18] have published the dataset along with their article, and the original study was approved by the Tongji Hospital Ethics Committee. 187 (49.9%) patients had fever symptoms among 375 patients, while cough, weariness, dyspnea, chest discomfort, and muscular pain were reported for 52 (13.9%), 14 (3.7%), 8 (2.1%), 7 (1.9%), and 2 (0.5%) patients, respectively. Among 375 COVID-19 positive patients, 174 and 201 patients were classified as (‘1′) for those who died and (‘0′) for those who survived respectively. Patients’ outcomes with the condition of COVID-19 positive and negative are summarized in
Figure 2. There are 76 parameters present in the dataset; the common parameters of Dataset-1, 2 were used for this study, and the parameters from Dataset-2 were normalized in the same way as they appear in Dataset-1, as shown in
Table 1, so that Dataset-2 can be used as an external validation set.
2.2. Statistical Analysis
Python 3.7 and Stata/MP 13.0 were used to conduct the statistical analysis. Continuous variables, age, and other biomarkers were reported with the number of missing data, and frequency for each biomarker in death and survival groups. Chi-square univariate test was conducted to identify the statistically significant different features among the dead and survived group and the difference is considered significant if the
p-value is <0.05. There were 20 features present in the original dataset; the top five features using the feature selection method were identified as promising (reported in the later section) and are summarized in
Table 2A,B for Dataset-1, and Dataset-2, respectively. The ranked five features were Age, Lymphocyte count, D-dimer, Creatinine, and CRP.
2.3. Data Preprocessing
While patient’s blood sample data were available for multiple days, the study used first-day data (from Dataset-1) for model training and validation to identify the primary predictors of the severity of the disease. The model also helps to differentiate between patients who need urgent medical support. Clinical data always suffer from missing data problems that contribute to either biased models or degradation in model performance. This problem can be tackled by deleting the corresponding rows of data for further investigation, but it is stated in [
38] that this easy method of removing missing data rows can often lead to the loss of important data that would have been useful in the study, and can also lead to skewed estimates. To fix the missing data, many standard data imputation techniques are available. The most common technique for clinical data imputation is multiple imputations using the chained equations (MICE) data imputation technique [
42]. Based on the other variables present in the dataset, the missing data is estimated using multiple regression models. The technique often takes into account the data form of the missing variables before imputing them. Using logistic regression, binary variables are predicted, while continuous variables are predicted using statistical mean matching [
38].
Supplementary Figure S1 shows the number of missing values in different features in Dataset-1. Most of the features appear to be completely populated, while Lymphocyte count, d-dimer, creatinine, LDH, monocyte, CRP, and neutrophils seem spottier. The spark-line at right summarizes the general shape of the data completeness in the dataset. The imbalanced data can result in a biased model and, therefore, the dataset needs to be balanced. The synthetic minority oversampling technique (SMOTE) is a powerful approach to tackle the imbalance problem [
43]. In this study, alive patients are about seven times more frequent than dead patients, so SMOTE was used for balancing the data.
Twenty different features present in Dataset-1 were checked to identify the correlation among different features. Feature reduction, with the help of the removal of highly correlated features, has always helped in improving the classifier performance [
44].
Supplementary Figure S2 shows the heat map of correlation and it is found that most of them are not correlated with each other. The maximum correlation found between creatinine and kidney parameters is 0.56. Therefore, no feature can be removed based on correlation; rather feature ranking and identifying the best feature combination for stratifying the dead and survived group is required.
2.4. Development and Validation of Classification Model
The authors have investigated different machine learning classifiers: Random Forest [
45], Support Vector Machine (SVM) [
46], K-nearest neighbor (KNN) [
47], XGBoost [
48], Extra-tree [
49] and Logistic regression [
50]. Logistic regression was the best performing machine learning classifier and has been used in this study (
Table 3). Logistic Regression is also a commonly used model for clinical investigation and is a supervised machine learning method for classification tasks [
50]. When we want to estimate the likelihood of a binary classification problem (i.e., survival or death of a patient), this technique is very popular [
51]. The logistic function is a sigmoid function and shrinks continuous inputs into a probability value. The logistic regression classifier is used to classify the data into two classes: Death and Survived using the ranked features, and the best feature combination is identified for both COVID and NON-COVID data and COVID data alone.
Dataset-1 was divided into training and validation sets (80% of the data) and testing sets (20%). Different machine learning models were investigated using five-fold cross-validation. The performance of different models was evaluated on the test dataset using several performance metrics, including sensitivity, specificity, precision, accuracy, and F1-score as shown in Equations (1)–(5). The receiver operating characteristic curve, or ROC curve, is used to measure the area under the curve (AUC) separately for single predictors as well as for a combination of them. To determine the performance of various top-ranked parameters in stratifying dead and survived patients, the AUC values for different individual features and their combinations’ contributions were evaluated. The performance of unseen (test) folds was combined to create the overall confusion matrix for the five-fold.
The number of patients with death outcome classified as death, the number of survived patients identified as survivors, the number of survived patients incorrectly identified as death, and the number of death patients incorrectly identified as survivors, respectively, are denoted by the true positive (TP), true negative (TN), false positive (FP), and false-negative (FN).
2.5. Development and Validation of Logistic Regression-Based Nomogram
The study proposed a diagnosis nomogram based on multivariate logistic regression analysis and Stata/MP software version 13.0, which was developed using Alexander Zlotnik’s Nomolog [
52]. Nomograms are graphic representations of complicated mathematical formulas. Medical nomograms graphically represent a statistical prognostic model that predicts a likelihood of a clinical event, such as cancer recurrence or death, for a specific individual, using biologic and clinical data such as tumor grade and patient age. Each variable is listed separately in a nomogram, with a corresponding number of points allocated to each variable’s magnitude. The total point score for all factors is then matched to an outcome scale [
53]. A binary regression is used in logistic (logit) regression to estimate the parameter. The dependent variable, generally labeled ‘0′ and ‘1′, is the response variable. Those that survived are marked with a ‘0′, while those who died are marked with a ‘1′. Equation (6) shows the odds, which shows the ratio of probability (Pr) of occurring death and not occurring death (1 − Pr). While the probability can vary from 0 to 1, the odds can vary from 0 to
. The logarithm of odds is a linear combination of one or more independent variables (predictors) in the logistic regression. The independent variables can be a binary variable (e.g., gender) and a continuous variable (e.g., age). The log of odds can be termed as linear prediction (LP), as seen in Equation (7), and can be related to the probability of a particular outcome (e.g., death). Equations (6)–(9) are used to create a relationship between death probability and the key predictors using logistic regression.
The logistic regression-based nomogram was created using the top-ranked independent variables with the best performance. The clinical parameters from Dataset-1′s Day-0 data were utilized for model creation, while day-3, day-7, and data upon hospital admission from Dataset-2 were used for model validation. Internal calibration curves, with the first dataset, and external calibration curves, with the second dataset, are used to compare the performance of the developed model. To determine the threshold values at which nomograms will be clinically relevant, the study used Decision Curve Analysis (DCA).
2.6. Nomogram-Based Scoring System
Nomograms are widely used for clinical prognosis as they can help in simplifying statistical predictive models into probability of an event, i.e., mortality in this study. They are preferred by clinicians due to their user-friendly graphical interfaces [
54]. The nomogram represents many independent variables as a numerated horizontal axis scale, with the patient’s values placed on that scale. From the many parameters numerated and arranged scales, a vertical line was traced down to a horizontal score axis. On the score axis, all of the scores from the independent variables were combined to create a total score, which was then linked to a death probability, which was a horizontal axis scaled from 0 to 1. It should be emphasized that, according to the nomogram, a greater score indicates a higher risk of mortality. The model was created using the patients’ Day-0 data. However, it can be used to longitudinally validate the model to predict death probability using biomarkers acquired later during the patients’ hospital stay.
4. Discussion
The association between the severity of the disease and the clinical evidence was explored in the current analysis. Based on the data acquired at hospital admission time, ten predictors were defined by the logistic regression algorithm as death probability predictors. Ten different classification models were trained, validated, and evaluated using this technique for the Top 1 to 10 features. The AUC and performance matrices for the top five features with the highest AUC of 0.94 were observed. A logistic regression-based nomogram was then developed utilizing these five variables. An overall score known as ALDCC has been proposed for early categorization of death severity. Moreover, the results obtained in the paper are better than in some recent similar works, as can be seen from
Table 6.
Age has been identified as a primary predictor of death in earlier research on the coronavirus family, including SARS [
59], Middle East respiratory disease (MERS) [
60] and COVID-19 [
61]. Immuno-senescence and/or various medical problems appear to make individuals more sensitive to significant COVID-19 disease with older age [
55]. Increased Lymphocytes, according to Liu et al. [
62], can aid in the early detection of COVID-19 disease severity. Lymphocytes, a type of immune cell, play a critical role in host defense and infection clearance. Lymphopenia, defined as a decrease in the number of blood lymphocytes, is a common biologic finding in COVID-19 patients and may play a role in disease progression and death [
63]. Patients with community-acquired pneumonia have considerable immune system activation and/or immunological malfunction, leading to alterations in their levels, according to earlier studies. It has been observed [
64,
65] that reducing creatinine levels, due to kidney problems occurring due to COVID-19 is an indicator of COVID-19 severity and mortality. It was observed in this study that Lymphocytes and creatinine parameters were small for high-risk patients. CRP testing at the time of admission, according to Lu et al. in [
66], can help to predict COVID-19-associated mortality. CRP is an acute-phase protein generated by hepatocytes in response to infection, inflammation, or tissue damage-induced cytokines from leukocytes [
63,
66,
67,
68,
69,
70]. This study found similar findings, with higher CRP rates estimated upon admission for COVID-19/Non-COVID individuals with high mortality risk. This indicated that these patients had severe lung inflammation or, more likely, a subsequent bacterial infection [
61]. Weng et al. [
55] recently indicated that individual primary predictors associated with death probability were age, Lymphocyte count, D-dimer, and CRP. A nomogram for death prediction was developed using these key predictors. In this study, a logistic regression model, using the selected five key predictors reported at admission, was used to construct a nomogram-based prognostic model that exhibits excellent calibration and discrimination in predicting the probability of death of patients with COVID-19 and non-COVID-19. An unseen external cohort was used for validation and the model also showed an outstanding performance on the external dataset. Additionally, several blood sample data obtained from patients during their hospital stay were analyzed, and the model outperformed the competition on longitudinal data. To the best of our knowledge, the AUC values for the development, internal, and external validation cohorts were 0.987, 0.999, and 0.992, respectively, which is superior to all previous nomogram-based mortality prediction methods.
Furthermore, this nomogram-derived ALDCC score provided a simple, easy-to-understand, and interpretable early warning method for stratifying and thus assisting clinical management of high-risk patients at admission. Using the ALDCC score assessed and determined at admission, all patients were grouped into three risk groups. The patients who are in the low-risk category can be isolated and handled in an isolation unit, while the isolation ward could be treated as a specialized facility with moderate-risk patients. On the other hand, patients in the high-risk community should be closely monitored and, if possible, transferred to vital care facilities or ICU for emergency treatment.
This research has scope for improvement in future. Firstly, the article suggests that clinical data on both COVID-19 and non-COVID-19 could be used to aid in the estimation of early mortality. The model can be improved much more with the help of a larger dataset. Secondly, unlike the first dataset where a limited number of parameters are present, if we have access to large features set (like Dataset-2), the machine learning model can be used to identify the best features in multi-center and multi-country data to create a more generalized model that can be used in any country.
Being able to predict the risk of mortality for patients is needed for allocating the right resources during a crisis. Indeed, very high mortality patients might not be the target for receiving the highest level of support and might need comfort care in a situation of crisis, as we have seen during the first period of the pandemic in many countries.
On the contrary, the low risk mortality patients should not be directed to demanding resources units such as ICU and can be treated outside the hospital, easing the strong pressure on the healthcare facilities. This tool might be used too in research to evaluate its ability to predict in a prospective manner the death of COVID-19 patients and refine this by including other parameters. The limitation of this kind of tool is that it takes into consideration clinical and biological parameters and does not integrate treatments, and is obviously exposed to bias.