Machine Learning to Predict In-Hospital Mortality in COVID-19 Patients Using Computed Tomography-Derived Pulmonary and Vascular Features

Pulmonary parenchymal and vascular damage are frequently reported in COVID-19 patients and can be assessed with unenhanced chest computed tomography (CT), widely used as a triaging exam. Integrating clinical data, chest CT features, and CT-derived vascular metrics, we aimed to build a predictive model of in-hospital mortality using univariate analysis (Mann–Whitney U test) and machine learning models (support vectors machines (SVM) and multilayer perceptrons (MLP)). Patients with RT-PCR-confirmed SARS-CoV-2 infection and unenhanced chest CT performed on emergency department admission were included after retrieving their outcome (discharge or death), with an 85/15% training/test dataset split. Out of 897 patients, the 229 (26%) patients who died during hospitalization had higher median pulmonary artery diameter (29.0 mm) than patients who survived (27.0 mm, p < 0.001) and higher median ascending aortic diameter (36.6 mm versus 34.0 mm, p < 0.001). SVM and MLP best models considered the same ten input features, yielding a 0.747 (precision 0.522, recall 0.800) and 0.844 (precision 0.680, recall 0.567) area under the curve, respectively. In this model integrating clinical and radiological data, pulmonary artery diameter was the third most important predictor after age and parenchymal involvement extent, contributing to reliable in-hospital mortality prediction, highlighting the value of vascular metrics in improving patient stratification.


Introduction
Since the inception of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, chest computed tomography (CT) has been widely used both as a triaging test on emergency department admission [1][2][3]-especially in the case of unavailability of immediate result of reverse transcriptase-polymerase chain reaction (RT-PCR)-and as a diagnostic tool to monitor Coronavirus Disease 2019 (COVID-19) pneumonia [1,2].
Rather than identifying thrombotic phenomena as a byproduct of pulmonary embolism, ever-increasing evidence points to a direct origin of thrombosis in arterial lung vasculature [14,32,33], highlighting the need for a more accurate characterization of these phenomena in their true location [34].
Diagnosis of pulmonary arterial thrombosis requires contrast-enhanced CT angiography, which implies the intravenous administration of iodinated contrast agents [35]. However, alterations of pulmonary vascular metrics also detectable with unenhanced CT, such as the enlargement of the PA and an increased ratio between diameters of the PA and of the ascending aorta (AA), are known indirect signs of pulmonary hypertension caused by fibrotic or thromboembolic processes [36][37][38][39][40][41][42]. Notably, these pulmonary vascular metrics have been demonstrated to be altered also in COVID-19 patients-when compared to previous values measured on CT scans acquired before the SARS-CoV-2 pandemic-and to carry prognostic implications [25][26][27]. Since unenhanced chest CT is still widely performed in COVID-19 patients, the integration of CT-derived vascular features with other readily available clinical and imaging parameters could improve the stratification of COVID-19 patients on emergency department admission, potentially improving patient management and prognosis.
This multicenter study, conducted in six different hospitals in northern Italy during the first pandemic peak of 2020, aims to evaluate the prognostic power of a machine learning model integrating imaging features of lung parenchyma and vasculature with clinical data of COVID-19 patients routinely retrieved on emergency department admission.
We included patients hospitalized in each of these institutions during the study period (21 February to 30 April 2020) with RT-PCR-confirmed SARS-CoV-2 infection and unenhanced chest CT performed on emergency department admission. Authors from each center reviewed their own institutional electronic databases to retrieve patients' clinical data, including: demographics, symptoms, partial pressure of oxygen in arterial blood (PaO2), comorbidities, smoking history, body mass index, and white blood cells, lymphocytes, platelets, and lactate dehydrogenase values. For outcome assessment, censoring was applied on 1 June 2020, when all patients had either been discharged or had died during hospitalization.

Image Acquisition and Analysis
Unenhanced chest CT scans were performed in all six centers in supine position, during a single inspiratory breath-hold condition when possible. Table 1 shows the technical characteristics of the scanners and acquisition parameters.
Radiologists with 7 to 32 years of experience blindly reviewed CT images from their own institution, assessing lung parenchyma and the presence of pleural effusion and mediastinum lymph nodes with short-axis diameters larger than 1 cm [43]. Lung parenchymal involvement was qualitatively assessed by radiologists on their own institutional digital imaging and communications in medicine (DICOM) viewer, considering the presence of parenchymal involvement signs, (i.e., ground-glass opacities (GGOs), consolidations, and crazy paving pattern) and the volumetric extent of parenchymal involvement, assessed according to Chung et al. [21]: 0% (absent, 0); 1-25% (minimal, 1); 26-50% (mild, 2); 51-75% (moderate, 3); and over 75% (severe, 4). As described by Wells et al. [44], PA and AA maximum diameters were measured on a single axial slice selected at the level of pulmonary arterial main trunk bifurcation ( Figure 1).

Data Preprocessing and Initial Feature Selection
To build a predictive model of in-hospital mortality, we selected among available features those with no more than 25% missing data. In these selected features, residual missing data were imputed with the mean value of the respective feature. Data were randomly split into a training/validation set (85%) and a testing set (15%). In each set, all

Data Preprocessing and Initial Feature Selection
To build a predictive model of in-hospital mortality, we selected among available features those with no more than 25% missing data. In these selected features, residual missing data were imputed with the mean value of the respective feature. Data were randomly split into a training/validation set (85%) and a testing set (15%). In each set, all features were rescaled to have null mean and unitary variance.
To prevent overfitting, we used training/validation data to perform a two-step feature selection process. First, we removed highly correlated features by thresholding the feature correlation matrix, setting the absolute Pearson's correlation threshold value at 0.7. Then, we used the least absolute shrinkage and selection operator (LASSO) to compute feature importance, i.e., the absolute value of the LASSO regression coefficients [45,46]. The best α value was determined using a ten-fold cross-validation process. Selected features were used to develop several machine learning models using both support vector machines (SVM) and multilayer perceptrons (MLP).

Support Vector Machines
We initialized the SVM classifier with linear kernel, C value = 1, and balanced classweights using all selected features. Then, we performed a systematic hyperparameter grid-search to find: the optimal SVM kernel among linear, radial basis function, polynomial, and sigmoid kernels; the best C values among 100 values linearly sampled in the range 1-8; and the best γ value (for nonlinear kernels) among ten logarithmically-sampled (base 10) values in the range 10 −10 -10 −1 . Each candidate model (n = 310) was validated using a ten-fold cross-validation process, accounting for 31,000 fits. The best model was defined as the one that maximized the average F1 score in validation data. F1 score was selected as a scoring metric to consider class imbalance. Finally, the best model was refitted on the entire training/validation set. Given this initial model and its optimal C (C opt ) and γ (γ opt ) hyperparameters, we performed an importance-based backward feature selection. We optimized SVM hyperparameters for each feature subset using a second systematic grid search strategy, further fine-tuning C and γ values in narrower ranges. The new search-grid was defined by selecting 25 C values linearly sampled in the range (C opt ± 2), while 50 γ values were logarithmically sampled in the range (10 log (γ opt −1) , 10 log (γ opt +1) ) Each one of the 1250 candidate models was validated using a ten-fold cross-validation process. Again, the best model (i.e., the one that maximized the average F1 score in validation data) was finally refitted over the entire training/validation set.

Multilayer Perceptrons
After selecting all features with a non-zero LASSO coefficient, we further selected features by keeping the most important ones that allowed us to meet the rule of thumb of at least ten events in our training/validation set for each feature included in our model [47,48].
The general MLP architecture is summarized in Table 2. We initialized the hidden unit numbers N and M values to 7 and 5, respectively. To prevent dying rectified linear unit (ReLU) issues, we used LeakyReLU [49] as activation function (α = 0.03), while weights were initialized using the method proposed by He et al. [50]. To improve model generalizability and prevent overfitting, we added a dropout layer after each fully connected layer with a dropout rate equal to 0.2. The last layer of the perceptron comprised a single unit with a sigmoid activation function to encode patient mortality. The initial learning rate was set to 0.01, being then decreased during training using an exponential decay schedule of 1000 steps and a base of 0.5. We set binary cross-entropy as the loss function, adopted the Adam optimizer [51], and set a default batch-size value of 32. The F1 score, the area under the curve (AUC) at receiving operator characteristic (ROC) analysis, accuracy, precision, and recall were selected as the performance metrics. The training stopped if the loss function did not decrease for 25 epochs, then the best weights corresponding to the loss function minimum were restored. The maximum number of epochs was set to 1000.

Layer Number of Hidden Units Trainable Parameters
Dense_1 Given this model and its hyperparameters, we performed a systematic grid-search ten-fold cross-validation to optimize the following hyperparameters: the batch size in (4,16,32,64); the dropout rate in (0.1, 0.2, 0.4); the starting learning rate in (0.1, 0.01, 0.001); the number of hidden units in the second and third layers (N) in (5,7,10); and the number of hidden units in the fourth and fifth layers (M) in (3,5,7). Each one of the 324 candidate MLPs was validated using a ten-fold cross-validation process (3240 fits). The best model (i.e., the one that maximized the average F1 score in validation data) was finally refitted over the entire training/validation set.

Statistical Analysis
Due to their nonparametric distribution, assessed with the Shapiro-Wilk test, continuous variables were reported as median with interquartile range (IQR), while categorical variables as total number and percentage. The Mann-Whitney U test was used to compare means from different groups for continuous variables during univariate explorative analysis. Statistical analyses were performed using SPSS v.26.0 (IBM SPSS Inc., Chicago, IL, USA), and p values < 0.05 were considered statistically significant. All SVMs and MLPs were developed using python v.3.7.7 [52], sklearn v.0.23.1 [53], and tensorflow-gpu v.2.2.0 [54] on a laptop with an Intel Core i7-6500 CPU (2.50 GHz), 8 GB of RAM, and a Nvidia GeForce 940MX GPU.

Explorative Univariate Analysis of Pulmonary Vascular Features
The overall median PA maximum diameter was 28 mm (25-30 mm) and median AA 34 mm (32-37 mm). Patients who died during hospitalization showed significantly higher median PA maximum diameter (29.

Support Vector Machines and Multilayer Perceptrons
After discarding features with a percentage of missing data higher than 25%, 14 features were selected for further processing (Table 3). After this selection step, residual missing data were clustered in 223 patients, who had almost only input data and were therefore also removed. Only 674 out of the initial 897 patients (75%) were therefore used to develop all machine learning models. The percentage of missing data ranged from 0% to 5% in the remaining features (median 0.4%, IQR 0.1-3.2%). After the initial random database split, our training/validation dataset was composed of 572 patients, 160 of whom (28%) died during hospitalization, while the testing set accounted for 102 patients, 30 of whom died during hospitalization (29%). No features were excluded after Pearson's correlation matrix thresholding (Figure 2). Obtained LASSO coefficients for all initial features are reported in Figure 3.
to 5% in the remaining features (median 0.4%, IQR 0.1-3.2%). After the initial random database split, our training/validation dataset was composed of 572 patients, 160 of whom (28%) died during hospitalization, while the testing set accounted for 102 patients, 30 of whom died during hospitalization (29%). No features were excluded after Pearson's correlation matrix thresholding (Figure 2). Obtained LASSO coefficients for all initial features are reported in Figure 3.  Fourteen SVM models were developed using different feature subsets. The best SVM model took 10 features as inputs (Table S1). The best model had a radial basis function kernel, with C equal to 5.924 and γ equal to 6.06 × 10 −5 . After the model was refitted on the entire training/validation dataset, the final model performances on testing data were: an F1 score of 0.632, an AUC of 0.747, a precision of 0.522, and a recall of 0.800. All SVMs had radial basis function kernels, their other hyperparameters and performance being reported in Table S1, while the ROC curve of the best SVM model is depicted in panel a of Figure 4.
Due to the high number of events in the training/validation set, MLPs were built using all features with a non-null LASSO coefficient (n = 10). Therefore, all developed MLPs have the same input features as the best SVM model. Among the 324 models resulting from the systematic hyperparameter grid search, the best MLP had the following hyperparameters: N and M equal to 7 and 3 respectively; a batch-size of 4; a dropout rate equal to 0.1; and a starting learning rate equal to 0.01. After refitting the best model on the entire training/validation dataset, the final performance on testing data were: an F1 score of 0.618, an AUC of 0.844, a precision of 0.680, and a recall of 0.567. The ROC curve of the best MLP is depicted in panel b of Figure 4. Fourteen SVM models were developed using different feature subsets. The best SVM model took 10 features as inputs (Table S1). The best model had a radial basis function kernel, with C equal to 5.924 and γ equal to 6.06 × 10 −5 . After the model was refitted on the entire training/validation dataset, the final model performances on testing data were: an F1 score of 0.632, an AUC of 0.747, a precision of 0.522, and a recall of 0.800. All SVMs had radial basis function kernels, their other hyperparameters and performance being reported in Table S1, while the ROC curve of the best SVM model is depicted in panel a of Figure 4.
Due to the high number of events in the training/validation set, MLPs were built using all features with a non-null LASSO coefficient (n = 10). Therefore, all developed MLPs have the same input features as the best SVM model. Among the 324 models resulting from the systematic hyperparameter grid search, the best MLP had the following hyperparameters: N and M equal to 7 and 3 respectively; a batch-size of 4; a dropout rate equal to 0.1; and a starting learning rate equal to 0.01. After refitting the best model on the entire training/validation dataset, the final performance on testing data were: an F1 score of 0.618, an AUC of 0.844, a precision of 0.680, and a recall of 0.567. The ROC curve of the best MLP is depicted in panel b of Figure 4.

Discussion
In this retrospective multicenter study, we developed a predictive model of inhospital mortality for COVID-19 patients, using clinical and radiological data acquired on
In a preliminary study on a limited sample size [26] we already observed that COVID-19 patients have a PA maximum diameter greater than that measured in previous CTs performed for non-cardiovascular reasons. In that study, enlarged PA diameter was also associated to death during hospitalization [26]. These observations were confirmed by the present study already through the explorative univariate analysis of pulmonary vascular features, where we found that patients who died during hospitalization had a significantly higher PA maximum diameter. Further, as shown in Figure 3, the LASSO regression coefficient for the PA maximum diameter was the third highest among investigated predictors of in-hospital death, after age and the extent of lung parenchymal involvement. It is worth noting that lung consolidation, appearing in the later stages of COVID-19 pneumonia, seems to be associated with better prognosis, as highlighted by the negative LASSO regression coefficient. This apparently counterintuitive result, already hinted at by a few studies [60,61], can be explained by considering that extensive parenchymal involvement in COVID-19 patients most frequently leads to in-hospital death, leaving no time for pulmonary damage progression to consolidation. Indeed, consolidation was present in almost 50% of patients from our cohort but was sparsely distributed and was associated with GGOs in 91% of cases, pointing out how hospital admission occurred at relatively early stages of COVID-19 pneumonia.
The prognostic role of the PA maximum diameter also shows how the early microvascular damage caused by SARS-CoV-2 infection may represent one of the most insidious aspects of severe COVID-19 presentations [10]. Notably, SARS-CoV-2-induced microthrombosis in distal branches of the PA can start the disruption of the already fragile coagulation profile of these patients [5,8,9], but can be difficult to detect even on contrast-enhanced CT scans [23]. Conversely, the PA diameter is an indirect metric of the thromboembolic profile of COVID-19 patients that can be easily obtained from unenhanced CT scans. Its integration into a reliable prognostic model further highlights how features extracted from CT scans performed without the administration of a contrast agent carry a substantial prognostic potential in COVID-19 patients, as already hinted at by other studies [62,63].
This study was performed at the early stage of the SARS-CoV-2 pandemic, when prophylactic LMWH was not yet routinely administered. This represents the main limitation of our study, preventing the investigation of the impact of appropriate treatment-targeting vascular damage-on COVID-19 patients' prognosis. Moreover, shortages in ICU beds affecting all hospitals in northern Italy during the first pandemic peak [64] frequently restricted ICU admission to relatively younger patients without comorbidities, preventing us from assessing ICU admission predictive models, since these would have been biased by such patient selection. Of note, this logistical stress on healthcare systems also precluded the inclusion of young patients (which were prevalently treated at home), hindered data collection on admission, and likely impacted the prognosis of the included patients: all these aspects must be carefully considered when evaluating the generalizability of our results. From a technical point of view, the main limitation of our study is the lack of a longitudinal and independent test set, as we randomly sampled test patients from the original multicenter cohort. Since we aimed first to maximize the heterogeneity of the training set as much as possible, using retrospective data collected from all centers, a robust generalizability assessment will need to be conducted prospectively on an independent test set. Although commonly reported in published literature [25,[37][38][39]44], another technical limitation is represented by the manual measurement of the diameters of great vessels on axial scans: a better approach could be represented by acquiring vessel area measurements on reformatted oblique reconstructions considering the vessel axis [42]. Moreover, we focused on a single feature selection technique (i.e., the LASSO) and on few machine learning models (i.e., SVMs and MLPs) to develop the in-hospital mortality predictor, with a comparably small batch size in MLPs, although supported by results of systematic hyperparameter search. Finally, the use of the mean feature value as an imputation technique may also lead to suboptimal performance. Indeed, the use of multiple imputations or machine learning based imputation techniques could improve the predictive power of developed classifiers.
In conclusion, our study shows how a predictive model integrating simple clinical and radiological data acquired on admission, such as the PA/AA ratio on unenhanced CT scans, allowed to predict in-hospital death in COVID-19 patients, highlighting the major role of pulmonary vascular involvement in the stratification of COVID-19 patients.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10 .3390/jpm11060501/s1, Table S1: Input features, hyperparameters, and performance of the support vector machines developed as a result of the LASSO-importance-based backward feature selection. Data Availability Statement: We will be able to make available deidentified individual participant data that underlie the results reported in this article, beginning 12 months and ending 24 months following article publication. We will consider requests from researchers who will provide a methodologically sound proposal and will share data needed to achieve the aims declared in the approved proposal. Proposals may be submitted up to 36 months following article publication. After 24 months the data will be available in the data warehouse of the coordinating center but without investigator support other than deposited metadata. Information regarding submitting proposals and accessing data may be obtained by email inquiry to the first/corresponding author at andrea.cozzi1@unimi.it.
Conflicts of Interest: S. Schiaffino declares to have received travel support from Bracco Imaging and to be member of speakers' bureau for General Electric Healthcare. D. Fleischmann declares to have received a research grant from Siemens AG, to be a shareholder of Segmed, Inc., and to have ownership interests in Ischemia View, Inc. F. Sardanelli declares to have received grants from or to be a member of the speakers' bureau/advisory board for Bayer Healthcare, Bracco Group, and General Electric Healthcare. All other authors declare that they have no conflicts of interest and that they have nothing to disclose.