Rapid Assessment of COVID-19 Mortality Risk with GASS Classifiers

Risk prediction models are fundamental to effectively triage incoming COVID-19 patients. However, current triaging methods often have poor predictive performance, are based on variables that are expensive to measure, and often lead to hard-to-interpret decisions. We introduce two new classification methods that can predict COVID-19 mortality risk from the automatic analysis of routine clinical variables with high accuracy and interpretability. SVM22-GASS and Clinical-GASS classifiers leverage machine learning methods and clinical expertise, respectively. Both were developed using a derivation cohort of 499 patients from the first wave of the pandemic and were validated with an independent validation cohort of 250 patients from the second pandemic phase. The Clinical-GASS classifier is a threshold-based classifier that leverages the General Assessment of SARS-CoV-2 Severity (GASS) score, a COVID-19-specific clinical score that recently showed its effectiveness in predicting the COVID-19 mortality risk. The SVM22-GASS model is a binary classifier that non-linearly processes clinical data using a Support Vector Machine (SVM). In this study, we show that SMV22-GASS was able to predict the mortality risk of the validation cohort with an AUC of 0.87 and an accuracy of 0.88, better than most scores previously developed. Similarly, the Clinical-GASS classifier predicted the mortality risk of the validation cohort with an AUC of 0.77 and an accuracy of 0.78, on par with other established and emerging machine-learning-based methods. Our results demonstrate the feasibility of accurate COVID-19 mortality risk prediction using only routine clinical variables, readily collected in the early stages of hospital admission.


Introduction
In late 2019, a new member of the coronavirus family, named Severe Acute Respiratory Syndrome CoronaVirus-2 (SARS-CoV-2), emerged in the Chinese province of Hubei [1] and rapidly spread worldwide, causing the first pandemic by a coronavirus. As of January 2 of 18 2023, the virus has caused more than 670 million cases of infection, and has led to over 6.7 million deaths in more than 200 countries [2]. , the clinical manifestation of SARS-CoV-2 infection, is undoubtedly the most important global health concern this century has witnessed and has been taking a heavy toll in terms of human, financial, and social resources. Moreover, the appearance of new SARS-CoV-2 strains and the lagging vaccination campaigns hinder the objective of reaching herd immunity, implying that such issues will likely persist in the coming months. Hospitals are particularly affected, especially during periods of increased disease transmission, when the number of patients that simultaneously need treatment increases exponentially. Such waves of patients can quickly deplete hospital resources if not dealt with optimally.
In these circumstances, it is fundamental to estimate the amount of resources incoming COVID-19 patients will likely require during their hospital stay; ideally, this assessment would be made in the early phases of hospital admission. This would ensure that high-risk patients have access to adequate resources and that, in general, hospital resources are not misdirected. General risk scores and comorbidity indices, such as the Charlson Comorbidity Index (CCI) [3], have been shown to be helpful in estimating the prognosis and thus the level of care that COVID-19 patients require [4,5]. Similarly, pneumonia-specific risk scores such as the CURB-6 [6] and the CURB-65 [7] scores have been shown to have even greater prognostic ability [8,9]. Nevertheless, both kinds of risk scores tend to underperform clinical risk scores that are specifically targeted to COVID-19 patients [10][11][12]. This is in part due to the remarkable variability across patients of the SARS-CoV-2 manifestations.
The SARS-CoV-2 infection manifests itself heterogeneously: symptoms can range from being completely absent (up to 33% of total cases [13]) to being critical (up to 5% of total cases [14]), in which case they include respiratory failure, high fever, hyperinflammation, and multiorgan dysfunction. The percentage of critical cases increases substantially in the cohorts of hospitalized COVID-19 patients, reaching peaks close to 20% [15]; such percentages are particularly high in older patients with coexisting morbid conditions and cardiovascular diseases.
To move beyond the limitations of general risk scores, in this work, we introduce two COVID-19-specific classifiers: the Clinical-GASS and the SVM22-GASS (GASS = General Assessment of SARS-CoV-2 Severity; SVM = Support Vector Machine). Both classifiers identify high-risk patients by predicting their 30-day mortality outcome and were developed and validated using independent derivation and validation cohorts. The SVM22-GASS classifier is based on a SVM model and is built in a fully data-driven fashion exploiting state-of-the-art machine learning methods. The Clinical-GASS builds on a COVID-19 specific risk score we recently introduced and which proved effective at stratifying the population of hospitalized COVID-19 patients: the GASS score [10].

Study Design and Participant Recruitment
This is an observational cohort study, developed in the two hospitals of Ferrara's territory dedicated to COVID inpatients, "Arcispedale S.Anna" in Cona (Fe) and "Ospedale del Delta" in Lagosanto (Fe). The province of Ferrara is geographically located in the Eastern part of the Emilia-Romagna region of Italy, with a population of approximately 350,000 inhabitants, and it is characterized by a high presence of elderly subjects (~26% of total population is aged >65 years, and nearly 1% >90 years).
This study analyzes the data of two independent cohorts of COVID-19 patients: a derivation cohort-used to develop the classifiers-and a validation cohort-used to validate them. In the machine learning community, such data are commonly referred to as training and test set, respectively. The derivation cohort comprises data of 499 patients recruited between March 2020 and June 2020, which have been already partially described and analyzed in our previous study [10]. The validation cohort comprises data of 250 patients recruited between September 2020 and March 2021; in this period, 450 patients were admit-ted to our departments but the data relative to 200 subjects were not complete for all the relevant variables and were thus discarded.
The collected variables consist of demographic, anamnestic, and laboratory data. The anamnestic data comprise: previous history of smoking, hypertension, ischemic heart disease, heart failure, chronic kidney disease, stroke or TIA (Transient Ischemic Attack), peripheral arterial disease (PAD), chronic obstructive pulmonary disease (COPD), hepatopathy, cancer, dementia, and diabetes. The laboratory data comprise: white blood cell count (WBC), lymphocyte count, creatinine, C Reactive Protein (CRP), procalcitonin, fibrinogen, D-dimer, isoamylase, alanine transferase (ALT), creatine phosphokinase (CPK), lactic dehydrogenase (LDH), ferritin, brain natriuretic peptide (BNP), and HS Tropoin I (HS TnI). All the variables were retrieved from the patients' electronic health records with their consent. Baseline symptoms and vital signs were added to their records and used to calculate both their CCI score (Charlson Comorbidity Index), and the GASS score. Note that the variables on which the CCI and GASS scores are based are defined and explained in previous studies [3,10], and are reported in the Supplementary Tables S1 and S2. SARS-CoV-2 infection was confirmed with the reverse transcriptase-polymerase chain reaction (RT-PCR) test. The exclusion criteria were the negativity of the swab tests to viral detection and age (patients younger than 18 years were excluded). All patients signed an informed consent designed specifically for the purpose of this study; in case the patients were unable to sign, the consent was obtained from their legal representatives.
Importantly, all data were collected at the first hospital visit or blood examination.

Descriptive Analyses
We evaluated the differences between subjects in terms of the three major COVID-19 outcomes (1. IoC = intensification of care, meant as the need for non-invasive mechanical ventilation or for endotracheal intubation; 2. in-hospital death; 3. 30-day death). The observation period was protracted until the 30th day after hospital admission for those patients who survived the hospitalization: this was possible using records of the local registry office linked to the hospital information system. Patients needing IoC can be recognized by the acronym "IoCp" while, for those who did not undergo IoC, we chose the acronym "nIoCp". The groups of patients who survived or died after a period of observation of 30 days are presented with the acronyms "30-ddp" (30-day deceased patients) and "30-dsp" (30-day survived patients), respectively.

Statistical Analysis
Data analyses were performed by using SPSS 26.0 software (IBM SPSS Statistics, IBM Corporation) and MATLAB (MATLAB 2020a, The MathWorks, Natick, MA, USA). The normal distribution of the continuous variables was analyzed using Kolmogorov-Smirnov and Shapiro-Wilk tests. Variables not normally distributed were analyzed using non-parametric tests. Categorical variables were summarized by using frequencies and percentages, and continuous data were presented as median (interquartile range, IQR). The Mann-Whitney U test was used for continuous variables, while the χ2 test was used for categorical variables. Variables with a p value < 0.05 in the univariate analyses were used to perform multivariate logistic regression analyses. All p values < 0.05 are considered statistically significant.

The Clinical-GASS Classifier
In our previous study [10], we defined a new clinical score, the GASS score, and discussed its ability to stratify the population of hospitalized COVID-19 patients into groups with significantly different 30-day mortality risk. Such a stratification has the potential to improve patient care by signaling to the hospital personnel the patients who are likely to need a mild (GASS < 6), moderate (6 ≤ GASS ≤ 10), or high (GASS > 10) level of care. However, clinical personnel might also benefit from knowing whether the patient at hand has a low or a high risk of dying and is likely to require specialized treatment such as the admission to the intensive care unit. This can be accomplished by building a binary classifier that predicts, for example, the 30-day mortality outcome. This is the primary outcome measure we aim to predict with this, and the other classifiers considered in the following sections. To assess the potential of the GASS score to be used for such a binary classification task, we built a simple threshold-based classifier. The classifier-named Clinical-GASS-was designed by simply computing the optimal Receiver Operating Characteristic (ROC) threshold on the training set, in terms of the cost/benefit ratio [16]; equal costs were assumed for the misclassification of survivors and non-survivors. Patients with a GASS score below such a threshold are classified as likely survivors while patients with a GASS score above such a threshold are considered as likely non-survivors. A similar approach was used to build the Charlson Comorbidity Index (CCI) classifier from the CCI score. The CCI is a clinical score that is often used as a baseline method to estimate the mortality risk in the general population.

The SVM22-GASS Classifier
Both the Clinical-GASS and the CCI classifiers are built leveraging medical expertise and thus might have a strong appeal to clinicians: they are likely to have full understanding of how the classifiers work, to trust their decisions, and to thus be willing to use them. However, both classifiers only use a limited number of features (Clinical-GASS: n = 11, CCI classifier: n = 19) and combine them with only sums and other simple operations defined by piecewise constant functions. For these reasons, both classifiers are potentially unable to fully exploit useful features contained in the patients' health records and their non-linear interactions, which might be predictive of survival outcome.
To address this issue, we used a Support-Vector-Machine-based classifier (SVM) with Radial Basis Function (RBF) kernel [17]. An RBF-SVM is a binary non-linear classification method that classifies data by non-linearly mapping the feature vectors into an infinite dimensional space in which data from different groups can be easily separated by a hyperplane. In brief, data x are classified by the decision function f (x), which is given by Here, N is the number of support vectors, α i are weighting coefficients, and k( , ) is the kernel function. In our case, we chose the Gaussian Radial Basis Function: Therefore, the decision function f(x) becomes: The SVM classifier was trained using only the data contained in the training set, was tested on the independent test set, and was compared against the CCI and GASS classifiers. In the remainder of this article, we will refer to the SVM classifier trained on our training set as SVM22-GASS.

Data Preprocessing
Before training the SVM classifier, we made sure to have sound training data. This was accomplished in two steps. First, we excluded from the analysis the variables that were measured in less than 50% of the patients. Secondly, we imputed the remaining missing values with the weighted average of the values of the three most similar instances, with weights inversely proportional to the distances from these. As a distance measure, we used the Euclidean distance for continuous features and the Hamming distance for binary ones.

Data Augmentation
Learning a classifier from imbalanced training data such as ours (% survivors = 75.7 %) is a challenging task and can lead to poor sensitivity to the minority class. To deal with this issue, we balanced the training set using the Synthetic Minority Over-sampling Technique (SMOTE). We used this specific method because of its proved effectiveness and simplicity [18]. In brief, this method works by generating synthetic instances of the minority class by linearly combining samples of the minority class. Specifically, for each minority class sample s one can generate h synthetic samples by first randomly sampling h of the k nearest neighbors, and then randomly perturbing s along the directions of the difference vectors between s and the h samples. In this algorithm, k is a hyperparameter, while h depends on the number of synthetic samples to generate.

Feature Selection
To remove potentially redundant and uninformative features, while reducing the computational cost of training the SVM model, we performed feature selection; this was accomplished by learning a regularized logistic regression model with LASSO penalty [19,20]. The regularization parameter λ LASSO , which determines the regularization strength and thus the number of selected features, was chosen by minimizing the ten-fold crossvalidation deviance. Note that we chose this specific feature selection method due to its well known cost-effectiveness, its ability to deal with models with both continuous and categorical variables, and its strong performance exhibited in a previous study [10].

Parameter Training
To learn the model's parameters, we minimized the hinge loss using the Limitedmemory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) solver [21]. Importantly, to speed up training and reduce memory requirements, we approximated the Gaussian kernel using the Fastfood [22] random feature expansion method. To deal with potential residual classification biases due to the class imbalance, we adjusted the SVM classification threshold [23]. Specifically, we used the SVM classification scores to compute the optimal operating point of the ROC curve and used the resulting threshold to classify the data.

Hyperparameters Optimization
Before learning the final model's parameters, to preselect a class of SVM models suitable for the dataset at hand, we performed hyperparameter optimization [24] using Bayesian optimization [25]. Specifically, with this procedure we found the kernel scale σ 2 , the regularization strength λ SVM , and the dimensionality of the random features space δ, that minimize the five-fold cross-validation hinge loss. Note that we used Bayesian optimization rather than simpler methods such as random or grid search because this method tends to be less sensitive to the choice of the hyperparameter ranges. Furthermore, it generally requires a lower expected number of calls to the cost function, which reduces the computational cost.

Model Evaluation
To evaluate the quality of the model's predictions, we computed the Area Under the ROC Curve (AUC), accuracy, sensitivity, and specificity on the independent test set. The SVM's performance is then compared to that of the other classifiers of interest, namely, the CCI classifier and the GASS classifier. As a baseline model, we also considered the majority classifier, which assigns the majority class to every instance in the dataset, regardless of its features. Importantly, the majority classifier is patient agnostic: it does not use the data associated with a specific patient to determine the risk class. Thus, comparing our classifiers to this baseline model allows us to quickly assess whether they are able to competently extract useful information from patient data, rather than only exploiting the class distribution of the training set.

Model Interpretation
As it is often the case with most high-dimensional and non-linear machine learning models, also in our case, once the SVM model is fully trained, it is challenging to gain an intuitive understanding of the mechanism underlying the classifier's decision process. To shed some light on such a mechanism, we estimated the variable importance (VI) [26] of each input feature of the SVM model, by computing the model reliance. In brief, such a method assigns high reliance to the features that, when perturbed. lead to a strong decrease in classification accuracy. Furthermore, to assess the discriminative power of the most important features selected with this method, we trained a reduced SVM model to classify survivors and non-survivors using only the top-3 features [27] and measured the decrease in classification performance.

Guidelines and Ethical Approval
For the compilation of this manuscript, we followed STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines for reporting observational studies. The local Ethics Committee "Comitato Etico Indipendente di Area Vasta Emilia Centro (CE-AVEC)" approved the protocol of this study; the protocol code is 712/2020/Oss/AOUFe.

Descriptive Analyses
The distribution of males and females in the overall population was quite homogeneous (54.8% males vs. 45.2% females), while in the group of patients who underwent intensification of care (IoCp) there was a higher percentage of males (65.8% vs. 34.2%; p = 0.021). The median age of the overall population was 72 years (IQR 58-82 years) and in-hospital mortality was recorded for 62 patients (24.8%) The subjects who died within the hospitalization period (deceased) were significantly older than those who were discharged (median age 82 vs. 67 years; p < 0.001). This was true also for the group of patients who died within 30 days (30-ddp) compared with those who survived within the same period (30-dsp) (median age 81 vs. 68 years; p < 0.001).
Not surprisingly, we found differences between the IoCp and NIoCp groups in terms of length of stay (21 vs. 10 days; p < 0.001); significant differences were also found between the 30-ddp and 30-dsp groups (9 days vs. 13 days; p < 0.001).
All data concerning the characteristics of the studied population can be found in Table 1; extended results are provided only for those variables found to be important in the multivariate analyses later. Differences between groups were also found in terms of comorbidities, and all the results are summarized in Table 2. Patients of the IoCp group were more often smokers; in the group of deceased subjects we found more diagnoses of hypertension, ischemic heart disease, heart failure, chronic kidney disease, history of stroke or transient ischemic attack (TIA), peripheral arterial disease (PAD), chronic obstructive pulmonary disease (COPD), localized or hematological cancer, dementia and diabetes with organ damage. Comparison analyses between groups were also performed for each item considered in the Clinical-GASS score and they can be found in the Supplementary Tables S3-S5. All variables found to be significantly different between groups in the univariate analyses were entered into multivariate analyses.
In the logistic regression analyses concerning the need for intensification of care, the strongest predictive variable was the PaO2/FiO2 ratio: both ratios <100 and 100-199 showed to be somehow protective against the intensification of care (OR 0.11, 95% CI 0.02-0.72; p = 0.021 and OR 0.07, 95% CI 0.02-0.27; p < 0.001, respectively), while the male sex and a higher respiratory rate (>30 breaths per minute) seemed to be predictive for a higher need for intensification of care (OR 3.56, 95% CI 1.20-10.56; p = 0.022 and OR 4.76, 95% CI 1.00-22.96; p = 0.050).
The Clinical-GASS score was calculated for each patient, and this allowed us to predict the relative risk of in-hospital and 30-day death based on the analyses of this cohort of subjects. We developed an open access web tool to help clinicians identify the patients at higher risk of bad outcomes by COVID-19, simply filling out the form with all the variables needed. This tool can be found at the following link: https://ml.unife.it/GASS.html and it is available for free consultation.
Furthermore, we performed Receiver Operating Characteristic (ROC) curve analyses, in order to evaluate the predictive power of the GASS score, compared to the Charlson Comorbidity Index, which has been also recently tested in COVID-19 inpatients [28]. The main results of the classification analysis are reported in Figure 1.

The CCI Classifier
The optimal ROC threshold for the Charlson Comorbidity Index classifier was 11. This means that the optimal way to classify COVID-19 patients based on the CCI alone is to consider as high-risk all the patients with a CCI greater than or equal to this value. However, this classifier performed poorly on the test set, with an Area Under the Curve (AUC) of 0.66 and an accuracy of 0.76 ( Figure 1D, grey bars). The poor performance of this classifier is attributable to the complete inability to correctly identify non-survivors (sensitivity = 0), which makes it effectively equivalent to the naïve majority classifier (black bars). We refer the reader to Section 2.4 for further details about the CCI classifier.

The Clinical-GASS Classifier
The optimal ROC threshold for the Clinical-GASS classifier was 13, which lies, as expected, within the range of values of the high mortality risk class (GASS > 10). Overall, the Clinical-GASS displayed satisfying test performance, with an AUC of 0.77 and an accuracy of 0.78 ( Figure 1D, orange bars). Importantly, the Clinical-GASS classifier performed better than both the majority and the CCI classifiers, due to its improved ability to identify nonsurvivors (sensitivity = 0.31). We refer the reader to Section 2.4 for further details about the Clinical-GASS classifier.

The SVM22-GASS Classifier
The minimum-deviance ten-fold cross-validation regularization strength λ LASSO was equal to 4.9 × 10 −3 . The corresponding LASSO logistic regression model fitted with such a parameter selected a total of 38 features, which are reported in Figure 1A. The best cross-validated hyperparameters of the RBF-SVM classifier trained using such features were σ2 = 774.16, λ SVM = 1.4 × 10 −6 , and δ = 7318. The classifier trained with such hyperparameters achieved nearly optimal performance on the training set (AUC ≈ 1, accuracy ≈ 1, Figure 1C-blue bars), which means that the decision hyperplane is able to completely separate survivors from non-survivors in the transformed feature space. Importantly, the SVM classifier achieved very good performance also on the independent test set, with an AUC of 0.87 and an accuracy of 0.88. The ability to classify COVID-19 patients of the SVM classifier is thus superior to that of all the other considered classifiers. Such an improvement is largely attributable to a markedly better ability to correctly identify non-survivors (sensitivity = 0.61). We refer the reader to Section 2.5 for further details about the SVM22-GASS classifier. at higher risk of bad outcomes by COVID-19, simply filling out the form with all the variables needed. This tool can be found at the following link: https://ml.unife.it/GASS.html and it is available for free consultation.
Furthermore, we performed Receiver Operating Characteristic (ROC) curve analyses, in order to evaluate the predictive power of the GASS score, compared to the Charlson Comorbidity Index, which has been also recently tested in COVID-19 inpatients [28]. The main results of the classification analysis are reported in Figure 1. (D) Corresponding measures computed on the test set. Note that the majority class in both the training and test set is the survival class, which was arbitrarily associated to a negative test result; therefore, by always predicting survival, the majority classifier has a sensitivity of 0 and a specificity of 1.

Model Interpretation and Reduced Models
The SVM classifier just described predicts the most likely 30-day mortality outcome of COVID-19 patients by non-linearly processing a selection of 38 features that characterize them. To understand whether some of these features are more important than others, we computed the model reliance [17]. The results of this analysis are reported in Figure 2A To understand how such features differ between survivors and non-survivors, and whether such differences are statistically significant, we performed Mann-Whitney U tests, using a Bonferroni-adjusted significance value of 0.05 ( Figure 2B). Out of the 8 most important features, only 2 do not appear to be significantly different between survivors and non-survivors (CPK and fibrinogen, p value > 0.05). Medians, 95% bootstrap confidence intervals, and interquartile ranges of the top eight features, for the 30-day survival (green) and death (red) group can be found in the Figure 2, panel B.
The fact that the full SVM model appears to rely mostly on a subset of the full feature set of 38 features, raises the question of whether a reduced model, which has only access to the most informative features, can also accurately predict the COVID-19 outcome. To address this question, we trained a reduced RBF-SVM model to predict the COVID-19 outcome using only the three most informative features, that is White Blood Cell Count (WBC), Lymphocyte Count (LYM), and Brain Natriuretic Peptide (BNP). Choosing exactly three features allows us to visualize the entire feature space in which the classifier operates, and to plot the resulting decision surface.
As an additional baseline, we also trained another RBF-SVM model that had access to only the three least informative features (namely, Creatinine (CRE), Procalcitonin (PCT), and D-dimer (XDP)). The models, which we refer to as SVM-TOP3 and SVM-BOT3 respectively, were trained following the same steps adopted for the full SVM model. The results of this analysis are reported in Figure 3. As expected, both reduced models exhibited worse training and test performance than the full SVM model (Figure 3B,C). Nevertheless, while the SVM-BOT3 classifier exhibited overall poor test performance, To understand how such features differ between survivors and non-survivors, and whether such differences are statistically significant, we performed Mann-Whitney U tests, using a Bonferroni-adjusted significance value of 0.05 ( Figure 2B). Out of the 8 most important features, only 2 do not appear to be significantly different between survivors and non-survivors (CPK and fibrinogen, p value > 0.05). Medians, 95% bootstrap confidence intervals, and interquartile ranges of the top eight features, for the 30-day survival (green) and death (red) group can be found in the Figure 2, panel B.
The fact that the full SVM model appears to rely mostly on a subset of the full feature set of 38 features, raises the question of whether a reduced model, which has only access to the most informative features, can also accurately predict the COVID-19 outcome. To address this question, we trained a reduced RBF-SVM model to predict the COVID-19 outcome using only the three most informative features, that is White Blood Cell Count (WBC), Lymphocyte Count (LYM), and Brain Natriuretic Peptide (BNP). Choosing exactly three features allows us to visualize the entire feature space in which the classifier operates, and to plot the resulting decision surface.
As an additional baseline, we also trained another RBF-SVM model that had access to only the three least informative features (namely, Creatinine (CRE), Procalcitonin (PCT), and D-dimer (XDP)). The models, which we refer to as SVM-TOP3 and SVM-BOT3 respectively, were trained following the same steps adopted for the full SVM model. The results of this analysis are reported in Figure 3. As expected, both reduced models exhibited worse training and test performance than the full SVM model ( Figure 3B,C). Nevertheless, while the SVM-BOT3 classifier exhibited overall poor test performance, comparable to that of the CCI classifier (AUC = 0.7, accuracy = 0.7- Figure 3C), the SVM-TOP3 classifier performed surprisingly well. As a matter of fact, the SVM-TOP3 had better test performance than the Clinical-GASS classifier (AUC = 0.83, accuracy = 0.83- Figure 3C), despite using only three features.
Biomedicines 2023, 11, x FOR PEER REVIEW 13 of 19 comparable to that of the CCI classifier (AUC = 0.7, accuracy = 0.7- Figure 3C), the SVM-TOP3 classifier performed surprisingly well. As a matter of fact, the SVM-TOP3 had better test performance than the Clinical-GASS classifier (AUC = 0.83, accuracy = 0.83- Figure  3C), despite using only three features. The resulting decision surface ( Figure 3A) is, as expected, highly non-linear. The 3D plot of the feature space confirms that, generally, non-survivors seem to have higher serum BNP and lymphocyte count. Interestingly, the white blood cells count appears to become strongly predictive of a non-survival outcome, only for very high values (WBC > 10⋅× 10 3 n/mmc), especially when associated with high values of lymphocytes (LYM > 9⋅× 10 2 n/mmc).

Discussion
Predicting the mortality risk of hospitalized COVID-19 patients could go a long way to optimal allocation of hospital resources. Popular clinical scores are potentially useful in this regard, but have proven to be less accurate than COVID-19-specific methods [10,29]. In this work we have introduced and validated two new methods for predicting 30-day mortality risk: the Clinical-GASS and the SVM22-GASS. Overall, our results show that both methods are reliable and can thus be used to effectively triage incoming COVID-19 patients using only readily available variables. The resulting decision surface ( Figure 3A) is, as expected, highly non-linear. The 3D plot of the feature space confirms that, generally, non-survivors seem to have higher serum BNP and lymphocyte count. Interestingly, the white blood cells count appears to become strongly predictive of a non-survival outcome, only for very high values (WBC > 10 × 10 3 n/mmc), especially when associated with high values of lymphocytes (LYM > 9 × 10 2 n/mmc).

Discussion
Predicting the mortality risk of hospitalized COVID-19 patients could go a long way to optimal allocation of hospital resources. Popular clinical scores are potentially useful in this regard, but have proven to be less accurate than COVID-19-specific methods [10,29]. In this work we have introduced and validated two new methods for predicting 30-day mortality risk: the Clinical-GASS and the SVM22-GASS. Overall, our results show that both methods are reliable and can thus be used to effectively triage incoming COVID-19 patients using only readily available variables.

Clinical-GASS Score and Risk Factors
The Clinical-GASS score is based on 11 variables easily available after the first visit at the Emergency Room, with the execution of both a venous and an arterial blood sample. The need for reliable and easy-to-use tools, able to predict the outcomes in COVID-19 inpatients is day by day more urgent because of the scarcity of human and financial resources, already under strain from the beginning of the pandemic.
The retrospective analyses from the first 499 patients hospitalized with COVID-19 to our hospitals in the territory of Ferrara, allowed us to notice a strong relationship between the Clinical-GASS score and mortality by COVID-19 (both in-hospital and 30-day rates); moreover, a slight relationship with the need for intensification of care was observed and confirmed only in the population with a Clinical-GASS score lower than 10 points; in the current study, this was true only for extreme values of the score (<5 or >10 points).
In our cohort of patients, the in-hospital mortality rate was 24.8%, while on the 30th day of observation, it was 23.6%. Our findings showed how patients who underwent intensification of care, needing non-invasive mechanical ventilation or endotracheal intubation, were more often males, with a greater Clinical-GASS score (p < 0.001) and worse respiratory performances (higher respiratory rates and lower PaO2/FiO2 ratios). As for laboratory abnormalities, those patients had higher serum inflammatory markers (CRP and ferritin), higher organ damage markers (isoamylase, ALT, LDH, HS TnI), and marked pro-coagulative status (higher serum D-Dimer).
Such findings are consistent with previous studies: male sex has been already linked to poorer chances of survival from COVID-19 and with increased risk of admission to the Intensive Care Units (ICUs) [30]; similarly, the PaO2/FiO2 ratio was already shown to be independently associated with worse COVID-19 outcomes [29,31]. Furthermore, recent studies showed how COVID-19 inpatients present commonly with laboratory abnormalities and the role of inflammatory, and organ-damage markers has been also consistently reported [32,33]. Our analyses also confirmed the role of age. Early Chinese studies showed how the elderly population was bound to encounter worse COVID-19 outcomes [34,35], and such findings were later confirmed in studies with European [36,37] and American patients [38]. Interestingly, the laboratory findings and respiratory performances reported in these studies were similar to those of our IoC group. Additionally, in accordance with previous studies that showed how low levels of blood pressure at hospital admission are often associated with worse COVID-19 outcomes [39], we observed higher blood pressure (both systolic and diastolic) in the survivor group. Finally, the negative role of comorbidities (especially cardiovascular) and smoking [40] is also well established [41,42].
The first COVID-19-specific scores were developed in China [11,45], during the first wave of the pandemic. Further scores for prediction of COVID outcomes were later developed all over the world [46][47][48], and data from more heterogeneous datasets were collected in order to reduce potential selection bias stemming from sampling from a restricted population of subjects. A case in point is represented by the 4C score, developed by Knight et al. [12]: such a score has quickly become popular due to its vast derivation and validation cohorts and its ease of use. However, the predictive performance of the score (validation AUC = 0.77), appears to lag behind that of SVM22-GASS introduced in this work (validation AUC = 0.87, accuracy = 0.88), and to be comparable with that of the Clinical-GASS (validation AUC = 0.77, accuracy = 0.78). Similar arguments can be made with regard to the recently developed "Piacenza score" (validation AUC = 0.78, accuracy = 0.55).
Importantly, the SVM22-GASS bases its prediction on the analysis of clinical variables that can be quickly and inexpensively retrieved early on during hospital admission. This is a significant advantage over other popular multivariate methods that rely on the manual analysis of medical imaging results (e.g., [29]), and yet achieve comparable predictive performance (validation AUC = 0.88). Like most machine-learning-based classifiers, the SVM22-GASS, the 4C and the Piacenza scores are purely data-driven. This makes them carry an intrinsic limitation represented by the fact that often it can be hard to give a "clinical" sense to the weight assigned to each feature considered in the development of the score. For this reason, in our first work, we decided to develop the GASS score, an informatic tool with the ability to accurately predict COVID-19 outcomes while keeping a clinical sense. The 11 features were chosen, in fact, based on both the strength of the associations with the outcomes and the clinical importance established for each of them in the current literature.

Interpreting the SVM22-GASS Classifier
The SVM22-GASS predicts the 30-day mortality outcome by nonlinearly processing 38 patient features. Our VARIABLE importance analysis isolated eight variables as the most important for the model: White Blood Cell Count (WBC), Lymphocyte Count (LYM), Brain Natriuretic Peptide (BNP), Creatine Phosphokinase (CPK), Lactate Dehydrogenase (LDH), Fibrinogen (FIBR), PaO2/FiO2 Ratio, and (PFR), and High-Sensitivity Troponin I (TnI). Interestingly, half of them, (namely, LYM, BNP, PFR, and TnI) are also part of the Clinical-GASS score, which confirms their strong information content and predictive power. Consistently, serum BNP has been recently found to be significantly elevated in critically ill COVID patients in a recent meta-analysis [49]; whether or not this peptide can help discriminate high-risk COVID-19 patients remains unclear and it merits further investigation. Similarly, high WBC values, high LDH values, and low LYM values [15] have also been recently linked to high mortality risk.
On the other hand, variables such as age, sex, respiratory rate, and serum D-Dimer (XDP) do not appear to significantly influence the predictions of the SVM22-GASS classifier. This finding might appear surprising, as their association with poor COVID-19 outcome is well established. For example, age is among the predominant risk factors for developing the severe form of the disease, due to the immuno-senescence and all the physiological modifications related to it [35]. Similarly, the serum levels of D-Dimer (XDP) were found to be strictly associated with COVID mortality [50]. Furthermore, the respiratory rate is a useful indicator of potential respiratory dysfunctions. Nevertheless, the fact that the SVM22-GASS does not rely on such variables, does not mean that they provide no information about the mortality risk; it merely means that other variables, those that carry greater weight, are more informative and/or less noisy.
To validate the results of the variable importance analysis, we trained two additional classifiers: the SVM-TOP3-which had only access to the three most important features White Blood Cell Count (WBC), Lymphocyte Count (LYM), Brain Natriuretic Peptide (BNP)-and the SVM-BOT3-which had only access to the three least important features. Interestingly, we observed only a moderate performance decrease of the SVM-TOP3 with respect to the full SVM22-GASS model (AUC: 0.83 vs. 0.87, accuracy: 0.83 vs. 0.88). This suggests that SVM-TOP3 is also a competent predictor and can be used in those cases in which one does not have access to all of the 38 variables for a given patient.

Potential Limitations
In this observational cohort study, we used independent derivation and validation cohorts to develop and validate our COVID-19-specific classifiers and risk score. However, both cohorts were recruited from the same two hospitals in the Italian Province of Ferrara and included only participants of Caucasian ethnicity. This might lead to overestimating the performance of our methods. Additionally, the sample size we used, albeit comparable to that of many other related studies (e.g., see [48]), is still too small to allow us to draw any final conclusion concerning the generalizability of our approaches.
To deal with these issues, future work will focus on undertaking a larger and multicentered study to validate our approaches on a larger and more heterogeneous cohort.
Additionally, some of the patients included in this study had missing values that were imputed using a weighted nearest neighbor method. Imputation is not an easy ask and might potentially lead to inflated performance measures. However, in real world setting it often is a necessary evil that comes with analyzing large and heterogenous datasets such as ours.
Furthermore, in the current study we did not use information about medical treatments, as this was missing. This implies that we cannot determine whether and how drug prescriptions altered the prognostic trajectories of our patients.
Finally, the fact that our feature selection step selected a subset of the 38 variables, suggests that some of the original variables are correlated. Future studies will investigate these correlations in depth to provide further insight into the disease and potentially reduce the number of variables to collect.

Conclusions
This work introduced two reliable classification approaches for the rapid assessment of mortality risk of COVID-19 patients. Importantly, both approaches rely on the automatic analysis of only routine clinical variables that can be readily acquired during hospital admission. The classifiers were developed and validated with independent derivation and validation cohorts and exhibited classification performance comparable to that of popular approaches that leverage information obtained with expensive medical imaging methods (e.g., chest radiography).
Our results prove that the classifiers have the potential to facilitate the triaging of incoming COVID-19 patients, thereby optimizing the allocation of hospital resources.
Clearly, predictive scores, in general, should not be taken into consideration as the only tools able to evaluate patients and their complexity. All scoring systems should be considered as an addition to clinical reasoning, laboratory, and instrumental examinations, besides experts' opinion. Clinicians must always remember that any decision must be taken only after having the complete picture of their patients.
In conclusion, we are confident that future work will be able to further validate the proposed methods, using larger and more heterogeneous validation cohorts and we will focus on designing suitable strategies to assess them in hospital settings.
Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedicines11030831/s1, Table S1: GASS score items; Table S2: Charlson Comorbidity Index; Table S3: GASS score items -differences between groups in terms of Intensification of Care; Table S4: GASS score items -differences between groups in terms of in-hospital mortality; Table S5: GASS score items -differences between groups in terms of 30-day mortality.