Machine and Deep Learning Algorithms for COVID-19 Mortality Prediction Using Clinical and Radiomic Features

: Aim: Machine learning (ML) and deep learning (DL) predictive models have been employed widely in clinical settings. Their potential support and aid to the clinician of providing an objective measure that can be shared among different centers enables the possibility of building more robust multicentric studies. This study aimed to propose a user-friendly and low-cost tool for COVID-19 mortality prediction using both an ML and a DL approach. Method: We enrolled 2348 patients from several hospitals in the Province of Reggio Emilia. Overall, 19 clinical features were provided by the Radiology Units of Azienda USL-IRCCS of Reggio Emilia, and 5892 radiomic features were extracted from each COVID-19 patient’s high-resolution computed tomography. We built and trained two classiﬁers to predict COVID-19 mortality: a machine learning algorithm, or support vector machine (SVM), and a deep learning model, or feedforward neural network (FNN). In order to evaluate the impact of the different feature sets on the ﬁnal performance of the classiﬁers, we repeated the training session three times, ﬁrst using only clinical features, then employing only radiomic features, and ﬁnally combining both information. Results: We obtained similar performances for both the machine learning and deep learning algorithms, with the best area under the receiver operating characteristic (ROC) curve, or AUC, obtained exploiting both clinical and radiomic information: 0.803 for the machine learning model and 0.864 for the deep learning model. Conclusions: Our work, performed on large and heterogeneous datasets (i.e., data from different CT scanners), conﬁrms the results obtained in the recent literature. Such algorithms have the potential to be included in a clinical practice framework since they can not only be applied to COVID-19 mortality prediction but also to other classiﬁcation problems such as diabetic prediction, asthma prediction, and cancer metastases prediction. Our study proves that the lesion’s inhomogeneity depicted by radiomic features combined with clinical information is relevant for COVID-19 mortality prediction.


Introduction
Since 2020, SARS-CoV-2 respiratory syndrome affected the whole world [1,2].Highresolution computed tomography (HRCT) was extensively used to diagnose, assess, and monitor the progression of the disease.From this medical imaging technique, it is possible to extract quantitative data, the so-called radiomics.Through this set of numerical descriptors, we can investigate the possible correlation between the information extracted directly from the images' pixel values and the clinical or biological characteristics of the involved patients [3].As CT scans were a necessary part of the diagnostic pathway during the first period of the COVID-19 pandemic, at least in some centers, we can count CT as a non-invasive tool for enhancing the existing data available to clinicians by means of advanced mathematical analysis [4][5][6][7][8][9].
A recent review [10] analyzed studies regarding COVID-19 mortality prediction.Among the considered works, only Xiao [38] explored the contribution of radiomics to clinical data in improving mortality prediction.In addition to this, other studies [23,35,39,40] pointed out that radiomic and clinical data combination is a promising way to predict which triage is suitable for each patient.
Bae [17], Varghese [19], and Shiri [22] showed that useful information to predict patient prognosis can also be extracted from 2D radiography image sets.On the other hand, Shiri [42] built a residual network for the detection and quantification of lung pneumonia damage, while Ozturk's model [43] aimed to discriminate between COVID-19, general pneumonia, and no findings.
From recent studies, it is clear that one of the main drawbacks in this emerging field is the small number of enrolled patients, which makes the obtained results less generalizable and prone to overfitting [47].
The future clinical utility of these prediction models is their potential support and aid to the clinician, providing an objective measure that can be shared among different centers, enabling the possibility of building more robust multicentric studies [48].More importantly, those tools will also play a substantial role in developing and assessing personalized treatments [48,49].
In our work, we built two classifiers to predict COVID-19 mortality: a machine learning algorithm, or support vector machine (SVM), and a deep learning model, or feedforward neural network (FNN), were trained to discriminate between deceased and non-deceased patients.We chose mortality as the outcome for our evaluation since it showed the lowest bias with respect to other endpoints collectible in the first phase of the COVID-19 outbreak.We enrolled a relevant number of patients (2348), one of the major strengths of this study, from several hospitals in the Province of Reggio Emilia, Italy.Our dataset's intrinsic inhomogeneity could make our results more robust, displaying a lower overfitting risk.

Study Population
The current study is part of a major multicentric study called "Endothelial, neutrophil, and complement perturbation linked to acute and chronic damage in COVID-19 pneumonitis coupled with machine learning approaches", code: COVID-2020-12371808.
The project was approved by the Area Vasta Emilia Nord (AVEN) Ethics Committee (project number dated back to the 28 July 2020: 855/2020/OSS/AUSLRE) and competent authorities, following the EU and national directives and according to the principles of the Helsinki Declaration.
It engaged different Units of Azienda Unità Sanitaria Locale-IRCCS di Reggio Emilia and gathered patients from the whole Reggio Emilia province.
Patients' inclusion criteria were: age > 18 years old; positive reverse transcriptase polymerase chain reaction (RT-PCR) swab; and an HRCT or CXR to confirm the presence of pneumonia between the 27 February 2020 and the 30 May 2020, whose positive RT-PCR swab was dated within 12 days of the CT exam.These identifed an initial cohort of 2805 patients.After excluding patients with steroids and biological agents' therapies ongoing at diagnosis/baseline, we enrolled a final cohort of 2553 patients.Data collection met the rules of the European General Data Protection Regulation (GDPR) for chest imaging data and data analysis.
We excluded 205 patients during our work due to issues in extracting the radiomics features from the associated segmented volumes.The final dataset encompassed the clinical and radiomic information for 2348 patients.The inclusion criteria applied to the initial cohort are reported in Figure 1.Table 1 shows clinical variables (such as gender and age) and outcome (death) for our cohort of 2348 patients.Patient deaths were clinically attributed to COVID-19 disease [50].It is worth noting that our dataset presents a significant imbalance regarding the investigated outcome (death by COVID-19): 287 (12%) deceased patients and 2061 (88%) survived patients.This disparity will be considered during our model training [51].

Clinical Features Collection
Clinical data, including age, sex, comorbidities (such as chronic obstructive pulmonary disease (COPD), vascular diseases, heart failure, etc.), and C-reactive protein (CRP) level at the time of hospital admission, were extracted from institutional informative systems and patient clinical charts.Radiological data, including the presence of ground glass opacities (GGO), consolidations, and parenchyma involvement (PI) score (<60% or >60%), were extracted from structured reports of the hospital admission CT scan.We removed all clinical features with more than 20% missing values.We replaced the other missing values with the mean or mode of the corresponding feature distribution, depending on whether the variables were continuous or categorical [52].Eventually, the clinical dataset was composed of the 19 clinical features presented in Table 2.

Radiomic Features Collection
Eight different scanners were employed to acquire the HRCT images, following a breath-hold acquisition protocol: BrightSpeed, Discovery MI, LightSpeed Pro 32 (GE Medical Systems, Chicago, IL, USA), Brilliance 64, iCT 128, Ingenuity CT (Koninklijke Philips N.V., Eindhoven, the Netherlands), Emotion 16, and SOMATON Definition Edge (Siemens, Munich, Germany).All devices are periodically tested through standard quality assurance programs according to national and international guidelines [53,54].
We segmented each HRCT image using Coreline Soft (Seoul, Republic of Korea) software, AVIEW (https://www.corelinesoft.com/en/lung-texture/[55]).AVIEW lung cancer screening (LCS) is a medical imaging DL-based software that provides automated lung nodule detection, segmentation, and analysis from low-dose CT chest images.It segments different damaged lung parenchymal volumes with a thresholding technique, as previously reported [56].Figure 2 shows two examples of AVIEW outputs from our cohort.
GGOs are hazy areas with slightly increased lung density without obscuration of bronchial and vascular structures [59].They may be caused by partial air space-filling or interstitial thickening.GGOs are frequently accompanied by other features or patterns, including reticular and/or interlobular septal thickening and consolidation.Consolidation refers to alveolar air being replaced by pathological fluids, cells, or tissues, manifesting as an increase in pulmonary parenchymal density that obscures underlying vessels and airway walls [59].Recent studies highlighted that GGO could progress to consolidations so that consolidations can be considered as an indication of disease progression.Reticular pattern refers to thickened pulmonary interstitial structures, such as interlobular septa and perilobular lines, manifesting as linear opacities on CT images [59].

Machine Learning Pipeline
In the first step, we kept clinical and radiomic data split into two different datasets, as shown in Figure 3.Then, in each dataset separately, we standardized each feature, subtracting the mean and dividing it by the standard deviation of the corresponding feature distribution.We excluded features with a Spearman correlation coefficient >0.98 from our subsequent analyses.Finally, we split our data into two subsets, one for classifier training (70%, n = 1643) and the other for classifier validation (30%, n = 705).Thus, we performed feature selection for the COVID-19 mortality prediction outcome to the training datasets (10-fold cross-validation least absolute shrinkage and selection operator (LASSO)).We generated a new dataset from the remaining features containing clinical and radiomic selected features.Thus, for our training, we used these three different datasets.In this work, we chose the support vector machine (SVM) algorithm due to the class imbalance of our data.We trained our SVM model with two subsequent stratified 10-fold cross-validations to fine-tune its hyperparameters, which were then employed in the final training phase.Table 3 shows the set of hyperparameters chosen for the training.In this work, we will call the model trained with radiomic data only the Radiomic Model, the one trained with clinical data only the Clinical Model, and the one trained with the combination of both features the Clinical-Radiomic Model.We evaluated our model performances on the testing sets (holdout).The SVM training session lasted 2260 s, while the testing required less than 1 s.Table 4 shows the results in terms of four metrics: area under the curve (AUC, which is a measure of the performance of a classification model in all classification thresholds [60]), accuracy (ACC, (1)), sensitivity (SENS, (2)), and specificity (SPEC, (3)).

Deep Learning Pipeline
We developed a feedforward neural network (FNN) model to compare its performance with our machine learning model.The architecture is synthesized in Figure 4.
We used a grid-search strategy to identify the best model configuration and hyperparameters.We chose the optimal model as the one that maximized the result of a five-fold cross-validation, with the ROC AUC as the metric function.The grid search was performed on the training data only.The hyperparameters included in the grid search were: number of layers, number of neurons per layer, dropout probability, number of epochs, learning rate, and batch size.Figure 4 shows the number of layers and neurons.The best dropout probability found was 0, so no dropout was used; the optimal number of epochs was 15, the learning rate was 10 −3 , and the best batch size found was 32.The total number of parameters of the network is 143,080.To account for class imbalance, the model loss was weighted to make the model pay more attention to the minority class.The final network had two branches: one to process the radiomic features (radiomic branch) and the other to process the clinical features (clinical branch).The radiomic branch was made of three consecutive fully connected layers with 100, 50, and 10 neurons each; the clinical branch had only a fully connected layer with ten neurons because the clinical features were significantly fewer than the radiomic ones.The final layers of the two branches were then concatenated and connected to the output layer of the network, which identified the patient's probability of belonging to one of the two classes.Given the independent nature of the two branches, building two separate versions of the original network was also possible: one with the radiomic branch only and the other with the clinical branch only.These were built to evaluate the impact of the different feature sets on the final performance of the network and to compare this to the machine learning results.
In this work, we will call the Clinical-Radiomic Model the one with the two branches, the Radiomic Model the one with the radiomic branch only, and the Clinical Model the one with the clinical branch only.
The network was trained on 70% of the data and tested on the remaining 30%.No feature selection was performed on the data used to train and test the network, i.e., all of the available features were employed.All of the features were standardized using the mean and standard deviation values calculated on the training set.
FNN training session needed about 1 s per epoch; then, it took less than 1 s to test the trained model.
The results are collected in Table 4 in terms of the aforementioned metrics: area under the curve (AUC) score, accuracy (ACC), sensitivity (SENS), and specificity (SPEC).

Results
To validate our obtained models, we tested their predictions on a set of data not used during the training session (test set).Table 4 shows the metrics for the test set using only the clinical features, only the radiomic features, or a combination of both.

Discussion
Prediction models in a clinical setting aim to support the physician in decision-making regarding personalized patient care pathways.This framework could help establish the stage of the ongoing disease in a screening scenario (diagnosis prediction) or to highlight patients with a worse prognosis to choose a more intense therapy (prognosis prediction) promptly.However, especially in the latter case, their introduction into clinical routine is still delayed by the models' reliability and accuracy when predicting new cases.Nonetheless, an automated tool is still a fundamental asset to standardize and reduce the intrinsic intra-observer variability that characterizes these processes.Deep learning and machine learning techniques are increasingly leading in predicting the outcome of COVID-19 patients [10].In fact, recent studies reported good results for predictive models used in COVID diagnosis and the prediction of severity, prognosis, length of hospital stay, intensive care unit admission, or mechanical ventilation modes [3,10,25,[27][28][29][30][31][32][33][34][35][36][37][38][39][40]46].
Our results, collected in Table 4, align with the recent literature, showing that our models can achieve good prediction accuracy.The machine learning and deep learning algorithms performed similarly, with the latter slightly outperforming the former.We can explain this result with both our parametrization strategy for our outcome (binary) and the size of our training cohort: the number of patients and the low complexity of the task (2 classes) advantaged the machine learning model.On the other hand, the deep learning model, which is usually employed for more complex tasks (multi-class or segmentation tasks) with a bigger training cohort, still performed acceptably given the task, but it could not surpass the other model.Furthermore, our DL algorithm took as input the same features as the ML one; they are usually able to manage the whole image data (i.e., pixel values), which could also be the reason why the models' performances were similar.This interesting result can be interpreted as a potential new perspective for our future studies as we can explore how the DL algorithm can perform in more complex tasks and with different inputs.The Clinical-Radiomic Model reached higher AUCs than the other two, both for the ML and DL algorithms, confirming that combining radiomic information with clinical data can improve the model prediction.The Clinical-Radiomic Model obtained the highest AUC score in both and a significantly higher SENS score.The ACC score is similar for all three models, with the Radiomic and the Clinical Models achieved a slightly higher result than the Clinical-Radiomic Model; however, the accuracy metric is less significant than the other metric with data imbalance.
Palmisano et al. [26] retrospectively enrolled 1125 COVID-19 adults to develop a userfriendly AI platform for automatic risk stratification of COVID-19 patients.Their results are based on clinical data and CT automatic analysis and are expressed as performances in predicting patient outcomes.Their best model showed an AUC of 0.842, which is slightly lower than the one reached by our deep learning algorithm in the Clinical-Radiomic model but still consistent with our results.Our results are expected given the present literature [10,20,25,[27][28][29][30][31][32][33][35][36][37]46,[61][62][63][64][65][66][67][68][69][70][71][72][73][74].Table 5 collects the performance of the most recent works regarding ML and DL methods for COVID-19 mortality prediction.As can be seen, our results are in line with those presented, considering that our method relies on external validation.Moreover, compared with most of these studies, our study had a larger cohort for training the models.Regarding ML models, the future perspective is to validate these data on new patients and gradually introduce prediction tools in the clinical practice to support the physician's decision.
As previously stated, another future direction of this work could be to explore further the potential of the DL algorithm, which proved to work efficiently with previously extracted features.The next step could be to perform a more complex task (i.e., survival) or provide the image data directly to the network without pre-computing features.This course of action could provide several advantages, for example, avoiding the pitfalls usually typical in models dealing with radiomic features extracted from images rather than the original pixel values, such as repeatability and reproducibility issues.

Conclusions
In this study, we proposed a technique providing a user-friendly and low-cost tool for COVID-19 mortality prediction.We presented a comparison between a machine Learning and a deep learning framework that can be used not only for COVID-19 mortality prediction but also for other classification tasks such as diabetic prediction, asthma prediction, and cancer metastases prediction.Moreover, even if the COVID-19 outbreak seems to be under control and the importance of the CT images in the diagnosis step decreased, imaging-based models could still be relevant, for example, in serious cases where CT imaging could remain a viable solution.We obtained similar performances for both ML and DL algorithms, which means that our algorithms have the potential to be included in a clinical practice framework.According to our results, clinical and radiomic information are found to be predictors of COVID-19 mortality.The best performance was obtained with a combination of clinical and radiomic data, both for ML and DL, with AUCs equal to 0.803 and 0.864, respectively.
Future works should further explore the potential of DL algorithms, for example, directly using image pixel data and gathering data from other centers to validate our results on external data.

Figure 1 .
Figure 1.A flowchart showing patient inclusion criteria.

Figure 4 .
Figure 4. Neural network architecture.The network is divided into two separate branches for the independent analysis of the radiomic and clinical features.The radiomic branch has a higher number of hidden layers and neurons because radiomic features are way more numerous than clinical ones.The outputs of the two branches are then concatenated and passed to a final dense layer to produce a comprehensive output.

Table 1 .
Population features summary.

Table 2 .
Clinical and radiological features comprising the clinical dataset.

Table 3 .
Best hyperparameter values tuned during SVM training for death prediction.

Table 4 .
Performance metrics of the ML classifier (SVM) and DL algorithm for the holdout testing sets (clinical dataset, radiomic dataset, and the set containing both).AUC: area under the curve, ACC: accuracy, SENS: sensitivity, and SPEC: specificity.The best result obtained for each metric is bolded.

Table 5 .
Performance Metrics (AUC, ACC, SENS, SPEC) of recent machine learning and deep learning models for COVID-19 mortality prediction.AUC: area under the curve, ACC: accuracy, SENS: sensitivity, and SPEC: specificity.