Cardiovascular and Renal Comorbidities Included into Neural Networks Predict the Outcome in COVID-19 Patients Admitted to an Intensive Care Unit: Three-Center, Cross-Validation, Age- and Sex-Matched Study

Here, we performed a multicenter, age- and sex-matched study to compare the efficiency of various machine learning algorithms in the prediction of COVID-19 fatal outcomes and to develop sensitive, specific, and robust artificial intelligence tools for the prompt triage of patients with severe COVID-19 in the intensive care unit setting. In a challenge against other established machine learning algorithms (decision trees, random forests, extra trees, neural networks, k-nearest neighbors, and gradient boosting: XGBoost, LightGBM, and CatBoost) and multivariate logistic regression as a reference, neural networks demonstrated the highest sensitivity, sufficient specificity, and excellent robustness. Further, neural networks based on coronary artery disease/chronic heart failure, stage 3–5 chronic kidney disease, blood urea nitrogen, and C-reactive protein as the predictors exceeded 90% sensitivity and 80% specificity, reaching AUROC of 0.866 at primary cross-validation and 0.849 at secondary cross-validation on virtual samples generated by the bootstrapping procedure. These results underscore the impact of cardiovascular and renal comorbidities in the context of thrombotic complications characteristic of severe COVID-19. As aforementioned predictors can be obtained from the case histories or are inexpensive to be measured at admission to the intensive care unit, we suggest this predictor composition is useful for the triage of critically ill COVID-19 patients.


Introduction
The COVID-19 pandemic has been a tremendous challenge for healthcare and society, being a confirmed cause of >636 million cases and >6.6 million deaths worldwide [1]. However, this healthcare burden led to the unprecedented accumulation of big clinical data, which have been processed and translated into artificial intelligence (AI) tools for the prognostication of COVID-19 [2][3][4]. Most of the machine learning (ML) studies (≈75%) focused on chest X-ray images and chest computed tomography data [5][6][7], demonstrating an 1.

7.
Multivariable logistic regression as a reference M-male, F-female, FDR-false discovery rate, Me-median, IQR-interquartile range, AH-arterial hypertension, DM-diabetes mellitus, CAD-coronary artery disease. CHF-chronic heart failure, COPD-chronic obstructive pulmonary disease, CKD-chronic kidney disease, WBC-white blood cell count, NE#-neutrophil count, LY#-lymphocyte count, NLR-neutrophil-to-lymphocyte ratio, PLT-platelet count, BUN-blood urea nitrogen, sCr-serum creatinine. GFR-glomerular filtration rate, AST-aspartate aminotransferase, ALT-alanine aminotransferase, FPG-fasting plasma glucose, CRP-C-reactive protein. Figure 1. Study design. The patients (n = 350) enrolled from three centers (Research Institute for Complex Issues of Cardiovascular Diseases, n = 100; Kuzbass Regional Infectious Diseases Clinical Hospital, n = 106; Kuzbass Regional Clinical Hospital, n = 144) have been pre-matched (1:1) by age, sex (male or female), and outcome (in-hospital death or hospital discharge). This patient dataset has been employed for the ML by a number of algorithms (decision trees, random forests, extra trees, neural networks, k-nearest neighbors, gradient boosting: XGBoost, LightGBM, and CatBoost) and multivariate logistic regression as a reference. In total, we assessed 14 continuous variables (age, WBC, NE#, LY#, NLR, PLT, BUN, sCr, GFR, AST, ALT, FPG, CRP, and D-dimer) and 6 binary variables (sex and past/present medical history of AH, DM, CAD/CHF, COPD/asthma, and stage 3-5 CKD) which were measured at the admission to the ICU. The outcome was binary (in-hospital death or hospital discharge). ML and cross-validation were performed either on a general dataset (70:30 learning:cross-validation samples proportion) or on all combinations of two sub-datasets from separate hospitals (Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Infectious Diseases Clinical Hospital, n = 206; Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Clinical Hospital, n = 244; Kuzbass Regional Infectious Diseases Clinical Hospital and Kuzbass Regional Clinical Hospital, n = 250) using the third dataset (Kuzbass Regional Clinical Hospital, n = 144; Kuzbass Regional Infectious Diseases Clinical Hospital, n = 106; Research Institute for Complex Issues of Cardiovascular Diseases, n = 100, respectively) as a cross-validation sample. The efficiency of the ML algorithms and tools was evaluated by AU-ROC, percent of correct predictions (%sensitivity and %specificity), and the range of these parameters between the distinct study centers. Figure 1. Study design. The patients (n = 350) enrolled from three centers (Research Institute for Complex Issues of Cardiovascular Diseases, n = 100; Kuzbass Regional Infectious Diseases Clinical Hospital, n = 106; Kuzbass Regional Clinical Hospital, n = 144) have been pre-matched (1:1) by age, sex (male or female), and outcome (in-hospital death or hospital discharge). This patient dataset has been employed for the ML by a number of algorithms (decision trees, random forests, extra trees, neural networks, k-nearest neighbors, gradient boosting: XGBoost, LightGBM, and CatBoost) and multivariate logistic regression as a reference. In total, we assessed 14 continuous variables (age, WBC, NE#, LY#, NLR, PLT, BUN, sCr, GFR, AST, ALT, FPG, CRP, and D-dimer) and 6 binary variables (sex and past/present medical history of AH, DM, CAD/CHF, COPD/asthma, and stage 3-5 CKD) which were measured at the admission to the ICU. The outcome was binary (in-hospital death or hospital discharge). ML and cross-validation were performed either on a general dataset (70:30 learning:cross-validation samples proportion) or on all combinations of two sub-datasets from separate hospitals (Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Infectious Diseases Clinical Hospital, n = 206; Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Clinical Hospital, n = 244; Kuzbass Regional Infectious Diseases Clinical Hospital and Kuzbass Regional Clinical Hospital, n = 250) using the third dataset (Kuzbass Regional Clinical Hospital, n = 144; Kuzbass Regional Infectious Diseases Clinical Hospital, n = 106; Research Institute for Complex Issues of Cardiovascular Diseases, n = 100, respectively) as a cross-validation sample. The efficiency of the ML algorithms and tools was evaluated by AUROC, percent of correct predictions (%sensitivity and %specificity), and the range of these parameters between the distinct study centers.
The work with ML algorithms was conducted in PyCharm, an integrated development environment using Python3 programming language and NumPy, Scikit Learn, Pandas, Matplotlib, CatBoost, XGBoost, and major-supervised libraries. In the process of ML, we employed the following techniques: 1.
Data preprocessing: missing data have been automatically inputted with median column values. Categorical data were replaced on the binary values, and multi categorical variables were converted into dummy variables.

2.
Power normalization: as some ML algorithms are sensitive to the data distribution, we applied Power transforms, a technique for transforming numerical input or output variables to have a Gaussian or more Gaussian-like probability distribution. This approach reduced data variability and skewness. The power transformation used in our study was based on the Yeo-Johnson transformation [40].

3.
Model hyperparameters were tuned by mljar-supervised "Compete" algorithm via hill climbing to fine-tune final models.

4.
Multicenter cross-validation. The general dataset was divided into two the learning dataset (including two sub-datasets from separate hospitals) and the test dataset (including a sub-dataset from the remaining hospital). This procedure was performed for all combinations of learning and test datasets: (1) learning dataset: Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Infectious Diseases Clinical Hospital (n = 206), cross-validation dataset: Kuzbass Regional Clinical Hospital (n = 144); (2) learning dataset: Research Institute for Complex Issues of Cardio-vascular Diseases and Kuzbass Regional Clinical Hospital (n = 244), crossvalidation dataset: Kuzbass Regional Infectious Diseases Clinical Hospital (n = 106); (3) learning dataset: Kuzbass Regional Infectious Diseases Clinical Hospital and Kuzbass Regional Clinical Hospital (n = 250), cross-validation dataset: Research Institute for Complex Issues of Cardiovascular Diseases (n = 100).

5.
As the evaluation metrics, we used AUROC (the primary metric for optimization), %sensitivity, %specificity, and range (variability) of these parameters between the distinct study centers. For binary classifications, we used the default probability threshold of 0.5 (irrespective of the number of folds for cross-validation) and then calculated sensitivity and specificity. We deliberately excluded the optimization of the probability threshold to ensure an equal evaluation for each cross-validation fold. Such a custom cross-validation strategy included training on the data from the two clinics and cross-validation on the dataset from the remaining clinic, with the testing of all three possible combinations in this regard. As a consequence, we had three values (according to the number of cross-validation folds) for each of the selected metrics (sensitivity and specificity), which were obtained using a unified probability threshold (0.5). These metrics provided transparency and permitted the clinical interpretation of the algorithm efficiency.

6.
Out of all models, we selected those having the highest AUROC, %sensitivity, and %specificity (9 models in total, one per each ML algorithm: decision trees, random forests, extra trees, neural networks, k-nearest neighbors, gradient boosting (XGBoost, LightGBM, and CatBoost), and multivariate logistic regression as a reference). Optimal parameters for the best models developed by each machine learning approach are provided in Table S1. 7.
During the ML, we conducted a feature importance analysis using a SHAP (SHapley Additive exPlanations) technique [41], a game theoretic approach that explains the contribution of each feature to an individual predicted value (i.e., measures the impact of each factor into the output of any used ML model). For each of the selected models (n = 9 as described above), we quantified the feature importance within the [−1, 1] interval. Further, we applied a Predictor Screening tool of STATISTICA 13 software (TIBCO Software, Palo Alto, CA, USA). 8.
In addition to PyCharm integrated development environment, we have also used STATISTICA Automated Neural Networks (SANN) tool, which automatically gen-erates, evaluates, and exports neural networks employing a multilayer perceptron architecture according to the input variables. The screening of the most efficient neural networks has been performed manually. When using this approach, ML and crossvalidation have been carried out on a general dataset (70:30 learning:cross-validation samples proportion). In addition, the most efficient neural networks underwent crossvalidation on four virtual patient samples generated by bootstrapping, a statistical procedure that resamples a single dataset by repeatedly drawing samples from the source data with replacement to create a simulated dataset.
Statistical analysis was performed using GraphPad Prism 8 (GraphPad Software, San Diego, CA, USA) and STATISTICA 13 (TIBCO Software, Palo Alto, CA, USA). For descriptive statistics, data are presented as median, 25th and 75th percentiles, and range. Proportions were compared by Pearson's chi-squared test with Yates's correction for continuity. Two independent groups were compared by the Mann-Whitney U-test. Three independent groups were compared by the Kruskal-Wallis test with post hoc calculation of false discovery rate (FDR) by the two-stage linear step-up procedure of Benjamini, Krieger, and Yekutieli. Correlation analysis was performed using Spearman's rank correlation coefficient. P values, or q values if FDR was applied (q values are the name given to the adjusted p values found using an optimized FDR approach), ≤0.05 were regarded as statistically significant.

Univariate Analysis Identifies Cardiovascular Comorbidity, Immune Cell Counts, Kidney Dysfunction Markers, C-Reactive Protein, and D-Dimer Levels as the Potential Predictors of COVID-19-Related Death at the Stage of ICU Admission
To screen for the putative risk factors of COVID-19-related death at the patient admission to the ICU, we first carried out a univariate risk factor analysis (Mann-Whitney U-test). Cardiovascular comorbidity (i.e., past/present medical history of AH and CAD/CHF), immune cell parameters of complete blood count (increased WBC and NE#, reduced LY#, and elevated NLR), kidney dysfunction markers (augmented BUN and sCr and reduced GFR), and increased CRP and D-dimer were significantly associated with in-hospital death ( Table 2). In contrast, diabetes mellitus, chronic obstructive pulmonary disease or asthma, stage 3-5 CKD, platelet count, alanine aminotransferase, and fasting plasma glucose did not reach statistical significance (Table 2).
Next, we searched for the associations between the predictors to facilitate their further manual selection in the process of ML. For these tasks, we performed Spearman's rankorder correlation analysis and found statistically significant moderate correlations between BUN and WBC (r = 0.31) and BUN and NE# (r = 0.32, Table S2 and Figure 2) that suggested an association between kidney dysfunction and systemic inflammatory response. Strong correlations between WBC and NE# (r = 0.92), sCr and GFR (r = −0.89), as well correlating NLR and NE# (r = 0.67), NLR and LY# (r = −0.72), sCr and BUN (r = 0.62), and AST and ALT (r = 0.66) confirmed the technical validity of the correlation analysis and the overall data integrity (Table S2 and Figure 2).

Neural Networks Represent the Most Reliable and Efficient Algorithm for the Prognostication of Patients Admitted to ICU with Severe COVID-19
Comparison of AUROC demonstrated significant differences in efficiency between distinct ML algorithms ( Figure 3 and Table 3). The highest efficiency was showed by Cat-Boost (AUROC = 0.879), random forests (AUROC = 0.863), and neural networks (AUROC = 0.860), whereas decision trees (AUROC = 0.724) and K-nearest neighbors had the lowest predictive value (AUROC = 0.784, Figure 3 and Table 3). In addition to the average AU-ROC, we carried out multicenter cross-validation and found considerable variability of AUROC between the different centers ( Table 3). The least range was demonstrated by multivariate logistic regression (reference technique, 0.014), K-nearest neighbors (0.044), XGBoost (0.058), and CatBoost (0.059, Table 3).   Most of the models reached the highest AUROC having been learned on the samples recruited in the Kuzbass Regional Infectious Diseases Clinical Hospital and Kuzbass Regional Clinical Hospital (n = 250) and cross-validated on the sample enrolled in the Research Institute for Complex Issues of Cardiovascular Diseases (n = 100), underscoring the substantial heterogeneity between the study centers ( Figure 4). Besides AUROC, the clinically relevant model must demonstrate equal or close sensitivity and specificity. Both sensitivity and specificity were significantly higher when learning predictive models on the samples collected from the Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Infectious Diseases Clinical Hospital (n = 206) and cross-validating them on the sample obtained in Kuzbass Regional Clinical Hospital (n = 144), confirming the heterogeneity between the samples enrolled in different centers ( Figure 5). Among all ML algorithms, neural networks had the highest average sensitivity and third-rank average specificity, suggesting potentially high predictive value as compared to other algorithms ( Figure 5 and Table 4). Importantly, neural networks showed the highest robustness in relation to the sample heterogeneity across different study centers (i.e., the least range between the sensitivities and specificities obtained in distinct centers, Table 4). Besides AUROC, the clinically relevant model must demonstrate equal or close sensitivity and specificity. Both sensitivity and specificity were significantly higher when learning predictive models on the samples collected from the Research Institute for Complex Issues of Cardiovascular Diseases and Kuzbass Regional Infectious Diseases Clinical Hospital (n = 206) and cross-validating them on the sample obtained in Kuzbass Regional Clinical Hospital (n = 144), confirming the heterogeneity between the samples enrolled in different centers ( Figure 5). Among all ML algorithms, neural networks had the highest average sensitivity and third-rank average specificity, suggesting potentially high predictive value as compared to other algorithms ( Figure 5 and Table 4). Importantly, neural networks showed the highest robustness in relation to the sample heterogeneity across different study centers (i.e., the least range between the sensitivities and specificities obtained in distinct centers, Table 4).   We have further screened the ensembles of neural networks using the SANN tool of the STATISTICA software to find the combination of the most sensitive and specific predictors and the most efficient instrument to predict the fatal outcome in patients admitted to the ICU because of severe COVID-19. Manual screening identified CAD/CHF, stage 3-5 CKD, BUN, and CRP as the best composition of the predictors (Table S3). Out of manually selected 15 neural networks with a percent of correct predictions >80% and AUROC >0.8, four showed AUROC of ≥ 0.85 (0.850, 0.853, 0.861, and 0.866), two showed sensitivity >90% (#9 and #12), and one demonstrated sensitivity and specificity >80% (#1) at primary cross-validation (30% of the general dataset, Table S3). Additional cross-validation of this neural networks ensemble using the bootstrapping approach (4 virtual cross-validation samples) testified to their high predictive value, as a neural network with AUROC = 0.866 at primary cross-validation had AUROC = 0.849 and almost 80% of correct predictions as average on the bootstrapping samples (Table S3). Other neural networks developed by the manual predictor screening approach also showed AUROC ≥ 0.8, and 10/15 of them demonstrated >80% of correct predictions after the bootstrapping cross-validation (Table S3).
Next, we ranked all predictors using the Predictor Screening tool of the STATISTICA software. Among all, CAD/CHF, CRP, LY#, and NLR were suggested as having the highest impact on mortality (information value > 0.3 and Cramer's V > 0.25, Table 5). However, only one of the top 10 neural networks selected by the automated predictor screening (#2, AUROC = 0.824, Table S4) reached the least AUROC of neural networks developed with the manual predictor screening (#6, 0.823, #6, Table S3) at primary cross-validation and none of them yielded AUROC > 0.8 or 80% of correct predictions at bootstrapping cross-validation (Table S4).
The combination of manual and automated screening predictors (CAD/CHF, stage 3-5 CKD, BUN, CRP, LY#, and NLR) did not improve the sensitivity and specificity of generated neural networks, as none of them contested AUROC of 0.85 at primary crossvalidation (Table S5). To summarize, neural networks were the most efficient (in particular, the most sensitive) ML algorithm for predicting the fatal outcome in age-and sex-matched patients with severe COVID-19 admitted to the ICU. Another advantage of neural networks was their robustness to the heterogeneity between the samples enrolled in distinct study centers. The AUROC of some neural networks exceeded 0.85 at primary cross-validation and reached almost similar values at virtual samples generated from the existing validation dataset by a bootstrapping procedure. The most valuable predictors at the time of the ICU admission were CAD/CHF, stage 3-5 CKD, BUN, and CRP. As all of these parameters can be measured and analyzed rapidly, and BUN and CRP are relatively inexpensive, the such composition of the predictors may be highly efficient for the triage of patients with severe COVID-19 in the ICUs if applied in a neural network context. CAD-coronary artery disease, CHF-chronic heart failure, CRP-C-reactive protein, LY#-lymphocyte count, NLR-neutrophil-to-lymphocyte ratio, FPG-fasting plasma glucose, NE#-neutrophil count, PLT-platelet count, WBC-white blood cell count, BUN-blood urea nitrogen, AH-arterial hypertension, GFR-glomerular filtration rate, sCr-serum creatinine, AST-aspartate aminotransferase, ALT-alanine aminotransferase, CKD-chronic kidney disease, COPD-chronic obstructive pulmonary disease, DM-diabetes mellitus.

Discussion
Albeit a lot of ML tools, including neural networks, for the prediction of COVID-19 outcomes have been developed and implemented [42][43][44][45][46], multicenter, age-and sex-matched studies are still rarely encountered. Albeit most of the studies focused on chest X-ray and computed tomography images [5][6][7], the use of routine clinical data (e.g., existing comorbidities or past/present medical history of cardiovascular events) and basic hematological and biochemical parameters embedded into the decision support system offers an opportunity of the rapid triage of SARS-CoV-2-positive patients with regards to their admission to the ICU or choice of treatment modalities within the ICU. The use of ML instruments removes subjectivity from the triage process as compared to the analysis of symptoms or complaints. However, the development of efficient AI tools is largely dependent on the study sample (including the treatment protocols, which vary between the hospitals) and the prevalence of the risk factors in the respective communities and patient cohorts. Further, ascending age and skewed recruitment of male or female patients inevitably create a bias as a number of comorbidities are more frequent in the elderly or in men or women. In order to avoid the aforementioned drawbacks, ML studies must be performed in distinct centers, and patient cohorts should be matched by age and sex in order to reveal the independent pathophysiological factors defining the risk of the fatal outcome.
The efficiency of ML classifier algorithms is typically estimated using AUROC (an ultimate metric combining sensitivity and specificity), sensitivity and specificity separately, and range between AUROC, sensitivities, and specificities obtained in distinct study centers. The latter metrics can be considered as a measure of robustness, as such support decision systems are designed to be applied in different clinics, and, therefore, they must avoid overfitting. In our study, we focused both on the selection of the best ML algorithm and on the development of best-in-class AI tools. The two-stage design of the study can be described as follows: first, we performed 3-fold cross-validation (where each fold represents one center or comparison) for all tested ML algorithms (decision trees, random forests, extra trees, neural networks, k-nearest neighbors, and gradient boosting: XGBoost, LightGBM, and CatBoost), using multivariate logistic regression as a reference. Then, we selected the best ML tool in terms of average sensitivity, specificity, and robustness in relation to the sample heterogeneity across different study centers. To find a combination of the most sensitive and specific predictors and to develop efficient ML tools to predict the fatal outcome, we applied a 70:30 split validation in the second stage of our study.
Screening of broadly established ML algorithms (decision trees, random forests, extra trees, neural networks, k-nearest neighbors, and gradient boosting: XGBoost, LightGBM, and CatBoost) and multivariate logistic regression as a reference revealed that neural networks exhibit the highest sensitivity, good specificity, and perfect robustness (the negligible range of sensitivities and specificities obtained in different study centers). Moreover, selected neural networks were able to exceed AUROC of 0.85, 90% sensitivity, and 80% specificity. The most valuable predictors included the past/present medical history of CAD/CHF and/or stage 3-5 CKD, BUN, and CRP, and therefore the cost of such screening analysis does not exceed 15 USD or EUR that is important for the medical economy, in particular in the middle-and low-income countries. Having pinpointed the age-and sex-independent importance of cardiovascular and renal comorbidities in the prediction of COVID-19 fatal outcomes in the ICU setting, we suggest that incorporation of other clinical factors such as age or body mass index into the neural networks may further improve their sensitivity and specificity, as the pathophysiological background of the such predictive model has been clarified in this study. In addition, the finding the specific endothelial dysfunction marker, which is currently lacking [47][48][49], may finalize the model, as prothrombotic and pro-inflammatory activation of endothelial cells represents a characteristic feature of COVID-19 [50][51][52][53].
In comparison with neural networks, random forests and CatBoost (one of the gradient boosting algorithms) showed higher average AUROC but also a considerably wider range in sensitivity and specificity, overall suggesting lower robustness. This observation emphasizes the importance of conducting multicenter instead of single-center studies. However, most of the existing predictive models have been developed in single-center studies, therefore suffering from a high risk of bias and also sharing common shortcomings such as inadequate sample size, poor or unclear handling of missing data, and weak crossvalidation [18,54]. Even the multicenter studies often do not report the variability between the centers, although some of them did [21,22,24,55]. Having ranked all machine learning algorithms by sensitivity and specificity, we found that extra trees had an average rank similar to neural networks and higher specificity, whilst neural networks demonstrated higher sensitivity that is presumably more important in the context of prompt triage in the ICU setting.
Despite the fact we applied a multicenter enrollment that permitted primary crossvalidation between the study centers and performed a bootstrapping procedure for ad-ditional virtual cross-validation when using a general dataset, and patient samples with distinct outcomes were age-and sex-matched, our study still has certain shortcomings. First, as we recruited patients from 2020 (when anti-SARS-CoV-2 vaccines have not been implemented, as Sputnik V was deployed on 27 November 2020 and widely distributed since January 2021) to 2022 (when the vaccination rate exceeded 50%), the vaccination status could impact the relative contribution of the predictors. Second, the sample size (n = 350) was limited because of age and sex-matching necessity, and resampling of patients regardless of age can increase the sensitivity and specificity of our neural networks. Yet, this can be considered as an aim for further study. Third, although all datasets included age-and sex-matched patients, the total number of matched pairs was limited (n = 175). An independent validation of the proposed models would be important for the unbiased evaluation of machine learning algorithms' performance. Fourth, all study centers were located in the same geographic region, restricting the generalizability of the developed models.

Conclusions
We conclude that neural networks represent an optimal ML algorithm to predict the fatal outcome in age-and sex-matched patients with severe COVID-19 admitted to the ICU, as they showed the highest sensitivity (>90%), sufficient specificity (>80%), good AUROC (>0.86 at primary cross-validation and ≈0.95 at virtual cross-validation if using a bootstrapping technique), and excellent robustness to the heterogeneity between the study centers. The most valuable predictors of COVID-19-related death at the time of the ICU admission were CAD/CHF, stage 3-5 CKD, BUN, and CRP. We suggest that this composition of the predictors might be applicable for the expedient triage of patients with severe COVID-19 in the ICUs if embedded into the neural network (as has been shown in the study).