COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal

More than a year has passed since the report of the first case of coronavirus disease 2019 (COVID), and increasing deaths continue to occur. Minimizing the time required for resource allocation and clinical decision making, such as triage, choice of ventilation modes and admission to the intensive care unit is important. Machine learning techniques are acquiring an increasingly sought-after role in predicting the outcome of COVID patients. Particularly, the use of baseline machine learning techniques is rapidly developing in COVID mortality prediction, since a mortality prediction model could rapidly and effectively help clinical decision-making for COVID patients at imminent risk of death. Recent studies reviewed predictive models for SARS-CoV-2 diagnosis, severity, length of hospital stay, intensive care unit admission or mechanical ventilation modes outcomes; however, systematic reviews focused on prediction of COVID mortality outcome with machine learning methods are lacking in the literature. The present review looked into the studies that implemented machine learning, including deep learning, methods in COVID mortality prediction thus trying to present the existing published literature and to provide possible explanations of the best results that the studies obtained. The study also discussed challenging aspects of current studies, providing suggestions for future developments.


Introduction
More than a year has passed since the report of the first case of coronavirus disease 2019 (COVID), and many deaths continue to occur. Despite the discovery of different vaccine formulas from different pharmaceutical companies, many problems related to mass production and distribution across the world still persist. This factor is accompanied by political and economic constraints that may further limit vaccine access [1]. For these reasons, pandemic containment is a hard task, resulting in increased deaths. At the time this manuscript is written, SARS-CoV-2 numbers reported by the World Health Organization (Ginevra, Switzerland) (https://covid19.who.int/, 31 May 2021) worldwide include: almost 173,005,553 people infected with SARS-CoV-2; more than 3,727,605 death cases and around 1,900,955,505 vaccine doses administered. Multiple hospitalizations, due to the rapid spread of the virus have required an improvement of patient management throughout the healthcare system. In this context, it is important to minimize the time required for resource allocation and clinical decision making, such as triage, choice of ventilation modality, admission to the intensive care unit. Currently, baseline machine learning (ML) and deep learning (DL) techniques are widely accepted thanks to their ability to obtain information from the input data without "a priori" definitions [2]. These approaches can be efficiently tested in healthcare applications such as diagnosis of diseases, analysis of medical images, collection of big data, research and clinical trials, management of smart health records, prediction of outbreaks [3]. Consequently, DL models are capable of solving complex tasks in the intricate clinical field [4]. ML is acquiring an increasingly sought-after role in predicting the outcome of COVID patients [3,[5][6][7]. For instance, a mortality prediction model could rapidly and effectively help clinical decision-making for COVID patients at imminent risk of death. Recent studies reviewed predictive models for SARS-CoV-2 diagnosis and severity, length of hospital stay, intensive care unit (ICU) admission, mechanical ventilation modality outcomes [8][9][10][11][12], highlighting pitfalls of the machine and deep learning methods based on imaging data [13]; however, systematic reviews focused on prediction of COVID mortality outcome with ML methods, including DL techniques, are lacking in the literature.
The aim of this review is to discuss the current state of the art of ML methods to predict COVID mortality by: (1) summarizing the existing published literature on baseline ML-and DL-based COVID mortality prognosis systems based on medical evaluations, laboratory exams and Computer Tomography (CT); (2) presenting relevant information including the type of data employed, the data splitting technique, the proposed ML methodology and evaluation metrics; (3) providing possible explanations of the best results obtained; (4) discussing challenging aspects of current studies, providing suggestions for future developments.

Literature Review Methods
This systematic review considers the state of the art in ML and DL as applied to COVID mortality prediction. We performed a MEDLINE search on PubMed on 26 May 2021 using the terms "machine learning covid survival" (146 results), "machine learning covid mortality" (131 results), "deep learning covid survival" (49 results), "deep learning covid mortality" (45 results) and additional similar terms. The search results were filtered to remove: duplicates, ML approaches for SARS-CoV-2 diagnosis or prognosis besides mortality, preprint works, abstract works, papers that deviated from our purpose. We try to shed some light on peculiar characteristics of these studies in terms of: (i) data source, (ii) data partitioning, (iii) class of features, (iv) implemented features ranking method, (v) implemented ML technique, (vi) metrics evaluated for performance assessment.

Data Source
We emphasized the study location and whether the dataset of each study was public or private, single site or multicenter.

Data Partitioning
We focused on the type of model validation that each study used to split data into train and test groups. Particularly, we chose to report the number of subjects used for the train and test set, and the corresponding number of survived and non-survived subjects. Additionally, we categorized validations type in: internal, external, merged and prospective (in particular internal prospective or merged prospective); referring to Internal validation when the studies subdivided a single-site database into train and test groups; external validation when studies trained and tested the model using data from independent cohorts, obtained from different sites. Moreover, we referred to merged validation for studies that combined data from different sites producing a single database to split into train and test groups or used multisite publicly available epidemiological datasets. Finally, we indicated prospective validation when studies implemented a temporal validation, assessing temporal generalizability. In the case of internal prospective validation, data of hospitalized patients from a first timeframe was used for training and data of patients admitted at a different time from the same hospital was used for testing. Differently, prospective merged validation relied on multisite data to train the model and multisite data collected in a subsequent timeframe for testing.

Class of Features
We expected to collect papers with both clinical and imaging features. In the latter, we included hand-made extracted features with radiomic analysis and the features learned with the use of convolutional neural networks (CNN). Clinical features comprise demographic (e.g., age, sex, race), comorbidities (e.g., diabetes, heart disease), symptoms (e.g., cough, fever), vital signs (e.g., heart rate, oxygen saturation), laboratory values (e.g., glucose, creatinine, haemoglobin), disease treatment and clinical course (e.g., artificial ventilation, length of hospital stay, drugs). Clinical features can be classified in binary (yes/no: 0/1) and continuous features (numerical values). We considered binary features when studies associated them with 0/1 values or dichotomized continuous feature's value in a binary form, defining a numerical range and setting the feature to 1 if the value is within that range, 0 otherwise. While we have referred to continuous features when studies used predictors (features used for prediction tasks) as continuous variables or dichotomized binary features in continuous features.

Implemented Features Ranking Method
To build a reliable model for solving classification, the feature set should contain as much useful information as possible, and a number of features as small as possible. It is necessary to filter out the irrelevant and redundant features by choosing a subset of relevant features to avoid over-fitting and tackle the problem of dimensionality [14]. Feature ranking (or selection or reduction) techniques are a good approach for features space dimensionality reduction [15]. Feature ranking improves features understanding and reduces the computational cost, increasing the efficiency of the classification. Since Shapley Additive Explanation (SHAP) and least absolute shrinkage and selection operator (LASSO) logistic regression algorithm are widely used methods for model interpretation and feature selection in survival studies [16][17][18][19], we highlighted whether the studies used these methods or others. Particularly SHAP is a method to explain individual predictions by computing the contribution of each feature to the prediction. LASSO is a new method for estimation in linear models based on regression analysis.

Implemented ML Techniques
With the aim of identifying the most used and performing methods, we focused on the prediction technique used in each work, highlighting whether it belonged to baseline ML or advanced DL algorithms. Since in literature there are many implementable and customizable algorithms, we expected to find several and different methods employed in the works included in this review. However, we expected to find techniques attributable to one of the following four classes, according to the characterized basic algorithm: (i) regression, (ii) classifier, (iii) neural network and (iv) ensemble learners. Particularly, we included in regression the algorithms that estimate the model function from the input features to numerical or continuous output. In classifiers, we included the algorithms that estimate the model function from the input features to discrete or categorical output. In neural networks, we comprehend architectures inspired by the neurons in the brain. Finally, we consider the ensemble models that combine several base models. In addition to the algorithm, we aimed to identify the K-fold cross-validation used in each work, a statistical method used to estimate the skill of a model, with k referring to the time of validations execution to reinforce the validity of the model.

Metrics
We highlighted the measures that each selected study reported to evaluate model performance, including Accuracy ( SPEC is the ability of a model to detect a true negative: PPV is the model ability in not categorizing some people as having the condition when in fact they do not: NPV is the ability of a model in not categorizing some people as not having the condition when in fact they do: F1-score is defined as the harmonic mean of precision and recall: MCC is a measure not influenced by the unbalanced nature of a dataset: B-ACC is a metric that evaluates binary classifier performance considering the imbalanced dataset: where TP, FP, TN, FN are respectively true positive, false positive, true negative and false negative.

Literature Review Results
Twenty-four papers were included for discussion in this work. Out of these, 3 were DL papers, 17 traditional ML papers and 4 hybrid papers.

Data Source
Public datasets were used by 2/24 papers [20,21] (Supplementary Table S1). Private data were used in 22/24 papers, with 9/24 using data from a single site and 13/24 using multicentric data. A total of 22/24 studies used data from a single country: 8/24 from China, 8/24 from the United States, 2/24 from the United Kingdom, 2/24 from Korea, 2/24 from Italy, 2/24 studies used data from more than one country: including Italy, Spain and the United States [22], and Iran and the United States [23] (Table 1).
A total of 7/24 studies performed a merged validation, particularly 4 of these combined data from different sites producing a single database [22,28,41,43] and 3 of these used multisite publicly available epidemiological datasets [20,21,44].
A total of 2/24 studies implemented internal prospective validation [24,38] and 3/24 studies implemented a prospective merged validation [24,40,43]. developed an artificial intelligence (AI) framework using deep neural networks to segment lung lobes and pulmonary opacities, and baseline ML methods to predict mortality based on radiological severity scores (accounting for the volume ratio of pulmonary opacities in each lung lobe).
In Table 2 we show features type and class, feature ranking techniques, features dimension reduction included in each study and the most important features derived.

Implemented Features Ranking Method
Most studies used a high number of starting features [24,27,[29][30][31][32][33]35,[37][38][39]. We found 8/24 articles in which SHAP method was used to optimize survival prediction in COVID [22,[24][25][26]29,32,[39][40][41]44]. Vaid et al. demonstrated that interactions between features had a weak contribution to outcome prediction compared to the importance of each feature individually [24]. On the contrary, Abdullal et al. used SHAP analysis to assess the contribution of patient variables to the mortality prediction, with no features reduction [25,26]. A similar approach was employed by other studies [22,29,32]. Subudhi et al. tested 18 models and performed the SHAP technique on the temporally distinct patients to compare the important features selected on the different validation cohorts [40]. In the other works, the most relevant features were selected with LASSO [24,[33][34][35]. Ko [38]. With DL models, features selection can be implemented by combining available features, as shown by Zhu et al. [31], to obtain the optimal number of features necessary for classification. Three articles did not apply any feature selection before the prediction algorithm [20,21,36].

Implemented ML Techniques
Based on our classification of ML methods above mentioned, we reported in this section the particular algorithm implemented in each study and we exploited the most relevant characteristics:

Regression Methods
A total of 8/24 papers evaluated LR performance and compared them with the performance of other ML tools tools [20,[22][23][24]30,35,37,43].According to Li et al., and compared to other methods, LR models are superior in terms of high-speed calculation and easy-to-interpret results, which might enhance their clinical applications [30]. Furthermore, Li et al. developed a novel LR modeling method that ensured the training of optimal predictors only (the adopted method for feature selection will be explored in-depth in the next paragraph) [30]. LR is one of the traditional regression techniques, widely used to observe the risk conditions among exposure and disease [45].
Abdulaal et al. implemented a Cox regression model, an algorithm used for multivariable analysis, starting from predictors chosen in concordance with previous literature. In the training phase, the variables not significantly associated with mortality were eliminated [26]. The model assessment was performed with the final Cox model, retrained on the entire training set. A total of 2/24 studies [24,34] implemented LASSO [19]. An et al. tested several machine-learning algorithms, including LASSO [34]. Likewise, Vaid et al. employed LASSO for data training [24]. A total of 1/24 studies proposed a partial least square (PLS) regression model [37], which is a very versatile, supervised method purposely used to address the issue of making good predictions in multivariate problems [34]. PLS is based on a mixture of linear regression models [46], to reduce the complexity of high-dimension class prediction problems.
The studies by An et al. and Subudhi et al. also included k-nearest neighbors (KNN) [34,40], considered the oldest and simplest method for pattern classification [48]. Moreover, An et al. reported the features associated with mortality as input data on multivariate Cox regression [34].  [25]. Information was then fed to two densely connected hidden layers, consisting of one-dimensional vectors placed in cascade. Each layer had the aim of creating increasingly meaningful representations of the input data before attempting outcome prediction. Zhu et al. implemented a deep neural network of six fully connected dense layers, whose input layer had 53 features, for predicting survival [31] In this work, features ranking was made with the permutation importance methodology, training 6-layer DNNs with 5-fold cross-validation. Once the top five clinical variables were selected, the neural network was reduced to a simple 2-layer DNNs, to prevent overfitting.
1/24 paper [21] tested convolutional neural networks for unsupervised features extraction from CT images, to predict patient mortality. Ning et al. developed three different algorithms: 13-layer CNN for CT slice-based predictions, a 7-layer DNNs for predictions based on clinical features, and finally, the integration of predictions from CT slices and clinical features was performed through the PLR algorithm, a regression model that evaluates one score for predicting mortality outcome [21]. Neural networks, randomization and parameter optimization, on the training dataset were performed ten times, and the model with the highest accuracy was taken into account for the final model. Moreover, to avoid overfitting, the authors opted for the dropout method, which randomly "drop out" neurons from the neural network during training. ReLU activation functions were set for both architectures, to activate the outcome of a neuron. For each method, a 10-fold cross-validation was executed ten times.
2/24 paper [36,40] used Multilayer perceptron (MLP) technique. In literature, MLP is considered a powerful machine-learning tool for medical prediction purposes, such as survival [49]. Although datasets with different origins were employed, in the study of Vaid et al. each MLP model was built with the same architecture. The MLP architecture consisted of an input layer, three hidden layers (40,10,2 units, respectively) and an output layer. In this article, the authors tried to solve the problem of data governance and privacy by training the algorithms collaboratively without exchanging the data itself, a technique known as Federated Learning (FL). The federated model was able to read the model parameters instead of raw data, thus fulfilling privacy requirements.

Ensemble
5/24 studies implemented the XGBoost algorithm, one of the most popular ensemble method for binary classification in ML ML [22,24,35,38,39]. This classifier relies on a recursive tree-based decision system, accommodating nonlinearity and interactions between predictors, with high performance on data [24]. In Bertsimas et al., XGBoost was chosen thanks to its capability of reducing the system complexity [22]. Vaid et al. decided to use the same algorithm with a first dataset containing missing subjects' data values, and a second dataset, in which features with >30% missing values were dropped and k-nearest neighbors were used to input missing data in the remaining feature space space [24]. Yan et al. carried out a different features selection, obtaining the final six significant features and used them for the training of the model defined as "simple-tree XGBoost" [38]. A total of 3/24 studies chose the Gradient Boosting Decision Tree (GBDT) algorithm algorithm [30,42,43]. The biggest difference with XGBoost is that the latter uses a more regularized model objective function to prevent overfitting. The studies compared this algorithm with other nonensemble learners, including LR, SVM and neural networks. Yu [41,50]. A total of 7/24 works decided to test Random Forest (RF) ensemble algorithm algorithm [20,23,28,34,37,40,42]. RF is another ensemble learning model characterized by multiple decision trees and considered as one of the best-performing learning algorithms [28]. An et al. and used RF to select the predictors before the final training [34,37]. Gao et al. developed an ensemble model based on the best performance obtained from baseline ML models, including LR, SVM, KNN, GBDT, and NN, on an internal validation cohort with 10-fold cross-validation to tune model parameters [33]. To improve the model's ability to recognize minority categories, they raised the weights of the minority class category in the model, increasing the punishment for the wrong classification of minority categories during training. Once the best predictive performance was achieved, an ensemble model derived from four baseline models (LR, SVM, GBDT, and NN), was proposed for prediction by assigning weights manually on each individual estimator. In order to improve the mortality prediction, Ko et al. created a new ensemble AI model combining a 5-layer DNN and an RF model, named EDRnet (ensemble learning model based on DNN and RF models) [27]. The structure included a DNN architecture with one input layer and 30 features, including 28 biomarkers, age and gender. The input layer was fed into three consecutive dense layers consisting of 30, 16 and 8 neurons, respectively. To avoid overfitting, the authors applied the dropout method. Finally, the last fully connected layer was fed into a softmax layer, providing probabilities for patient mortality as output. Separately, the authors trained an RF model using a maximum feature number of five. Soft voting was implemented to obtain the final predicted mortality probability value, starting from DNN and RF results. In particular, soft voting consists of the average of the two probability values p(DNN) and p(RF), if the value is greater than or equal to 0.5, then the prediction result represents death; otherwise, it represents survival. As already mentioned, Subudhi et al. are the only ones to test a very high number of baseline ML algorithms (18) including the known ensemble learners GBDT, XGB, RF, and others such as: AdaBoost classifier, Bagging classifier, Extra trees classifier and Gaussian process classifier.

Metrics
None of the articles chose to evaluate ACC, AUC-ROC, AU-PRC, SENS, SPEC, PPV, NPV, MCC, and B-ACC altogether. The 10-fold was the most frequent cross-validation method [21,[24][25][26]28,33,34,37,39]. In Table 3 we report the high-performing ML techniques and corresponding metrics' values for each paper, detailing whether the result is referred to as a kind of validation. For those works lacking a performance report, we did not show any values in the table.

Discussion
Few studies attempted COVID survival analysis with statistical methods [34,[51][52][53][54][55]. We decided to focus our review on mortality prediction through ML techniques which are able to fit nonlinear and complex interaction effects between predictors [56]. Particularly, ML improved predictability compared to other statistical methods on prediction of survival, in various practical domains [56,57]. Variability in dataset dimensions, experimental methods and features choices limit the comparison of the selected studies.

Datasets
The studies included in this review share several limitations. First, the number of patients available for testing might be considered small, affecting the significance of the results. Additionally, deceased cases are often a minority compared to the ones alive. The few datasets that are publicly available are subject to the possible risk of institutional bias [13] due to the lack of information about exclusion criteria. An additional bias could be related to the impossibility of knowing whether patients are truly SARS-CoV-2 positive due to the unclear definition of patients recruitment [13]. In addition, most studies were blind to patients who were admitted for clinically suspected SARS-CoV-2 and tested positive for the virus but died due to unrelated morbidities. Since imbalance issues characterize the SARS-CoV-2 mortality rate 3.6% (https://coronavirus.jhu.edu/data/mortality, 31 May 2021) ( Table 1), unbalanced data selection may positively or negatively affect the performance of the training and testing process [24,26,28,43]. It is known from the literature that a representative sample is required for a stable model [58]. Nevertheless, these good results may be due to the adopted methods (Neural networks, SVM, Ensemble algorithms) that are known from previous literature to achieve high performance on unbalanced datasets adjusted with oversampling or undersampling techniques [59][60][61][62]. Subudhi et al. adopted a random undersampling, comparing the excluded patients of the majority class with patients included to ensures that none of the features were significantly different (p ≥ 0.05) [40].

Demography
Although the implemented models are representative of hospitalized patients with confirmed SARS-CoV-2 infection and relative outcome within the geographic remit of the study site, caution should be used when generalizing to other populations. Particularly, results may not be generalized to populations with different geographical and socioeconomic conditions, differences in national health service or insurance-based health expenses. A merged database and a prospective validation could be useful in a target population generalization. Furthermore, caution should be exercised in management practices changes or evolution of COVID pathogenesis [40].

Accuracy and AUC-ROC
Looking at the performance measures of the developed models, only a few achieved ACC > 90% on at least one validation technique [24,27,33,35,43]. Moreover, studies that compared non-ensemble and ensemble learners showed best performance with ensemble models [22,24,27,30,33,35,39,40,42,43]. This is in line with ensemble learning being recognized as superior in terms of prediction performance to individual models [57]. Moreover, ensemble models are less prone to overfitting issues compared to individual classifiers [63,64]. A total of 3/24 studies reported ACC > 80% and AUC-ROC > 80%, but no information on K-fold cross-validation was available. K-fold cross-validation is important to achieve higher accurate results with a limited amount of data [65]. Moreover, Wong et al. suggested to repeat K-fold cross-validation several times in order to obtain reliable accuracy [66]. A total of 7/24 studies reported prediction performance with internal and external validation contributing to the model training generalization on a wide target population [22][23][24]27,30,33,35,37].

Other Metrics
In most of the selected studies SENS and SPEC, which provide information about the ability to detect deceased or cured cases, exceeded 70% on at least one validation [25,27,29,30,34,36,37,39,42]. Only a limited number of studies (9/24) indicated predictive values (PPV and NPV) [21,22,25,26,30,32,34,39,43], despite these being considered important information for performance prediction assessment, on a par with sensitivity and specificity [67]. Considering previous observations, the use of only the most common metrics, such as AUCROC and ACC, limits the model validation. Definitely, all the metrics mentioned above are required in a machine learning study to allow a complete view of the performance of the final model, so as to assess whether the result truly represents a performing model given the size of the dataset and its imbalance.

Mortality Prediction within Different Times
Three studies tested model performance for death prediction within different times from admission: 3, 5, 7 hospital days in the study by Vaid et al.; 0, 10, 35 days in the study by Yan et al.; 7 days after admission and 7 days prior to discharge in Stachel et al. [24,38,43]. Vaid et al. achieved a better prediction (in terms of ACC, AUC-ROC) of mortality events within 3 days from admission, suggesting a role of the ensemble learner in the identification of patients at immediate mortality risk [24]. Additionally, Yan et al. and Stachel et al. reported that the ACC value of the prediction increases closer to the patient's outcome, suggesting that deteriorations of patient's conditions could give an early warning to the clinicians [38,43]. According to Ikemura et al., it would also be interesting to test models performance for predicting the death of patients within two distant timeframes (e.g., the first week of admission and the fourth week after admission) [39].

Models Validation
Vaid et al. and Subudhi et al. reported that prospectively merged validation performance dropped compared with the internal validation [24,40]. The interest in the development and validation of prediction models in clinical setting is growing [68,69], but 14/24 studies of our review reported prediction performance with internal validation only [20,21,25,26,[28][29][30][31][32]34,36,38,39,42]. Furthermore, external validation is a rigorous key step before disseminating the prediction model in a clinical setting [70,71]. Since the aim of the reported predictive models is to inform patients and carriers about a mortality outcome, it is essential that predictions should be well-calibrated on a target population [72,73]. In this context, an external validation could contribute to extend this target population and to generalize the model. For this reason, measures about calibration (i.e., Z-statistic) should be considered [74] in addition to discriminate data into classes via metrics such as the AUCROC and ACC.

Clinical Features Predictive Ability
The ability to enhance prognostication through the integration of biomarkers in the clinical practice moves the medical field towards personalized medicine, as well as improving treatment strategies [75]. In this review, we identified the most significant biomarkers through features ranking techniques. Our analysis revealed that all the models were fed with binary and continuous features and all studies included laboratory parameters with a single exception [33]. Due to the retrospective nature of the studies, some implemented models do not include potentially important predictors of mortality outcome, such as comorbidities, vital signs, treatment, laboratory and radiological features. In addition, several studies have missing features for some subjects. Missing values are a challenging problem in SARS-CoV-2 baseline ML and DL model development [76][77][78]. Particularly some of the variables might be deleted during data pre-processing with the consequence of underestimating their role in predicting patients outcomes [37]. To overcome these limitations, it would be necessary to standardize relevant features in a prospective multicenter analysis. Among the features with higher ranking, there are Age, CRP, LDH (Table 2). A significant association between older age and SARS-CoV-2 infection mortality was observed in other literature without ML [79]. Moreover, the serum LDH level was found to be an independent risk factor for both severity and mortality in COVID patients [80]. Rastad et al. reported the CRP level as an additional risk factor [81]. Although the studies showed some variability in the feature extraction techniques, most of them have revealed a highly significant association between the feature's age, CRP, LDH and mortality [22,27,30,35,36,38,39,41]. Among the experiments that use ensemble methods [22,24,27,28,33,35,[38][39][40][41][42][43], the ones using the features CRP, LDH and age (after features ranking) obtained the best performing results [24,27,35]. Moreover, using these features, Vaid et al. compared ensemble model and non-ensemble models (LR and LASSO), obtaining the best performance with the former. This finding highlights the predictive power of the combination between high predictive features (Age, CRP and LDH) and ensemble models. Since the sample size is often imbalanced with a relative minority of COVID positive mortalities, it might be useful to create a worldwide database for the generalization of results and the most important extracted features, with a well-balanced number of survivors and non-survivors. Finally, it is important to note that the SARS-CoV-2 pandemic is unusual and evolving. Therefore, a real-time update of model prediction capabilities would be required.

Images Features Predictive Ability
Only two of the studies had information regarding radiologic images. Imaging may also be an important prognostic factor. The results obtained from Ning et al. and Fang et al. in terms of accuracy (78% reported by Ning et al.) and AUC-ROC (85% and 73.6% respectively) [21,23] [38], chest CT images play an important role in the diagnosis, monitoring and severity stratification of COVID [44]. The results reported by Ning et al. and Fang et al. showed that images related features are not best performing [21,23]. However, further studies with a dataset of clinical features and images should be created to fully exploit the benefits of integrating clinical and imaging features. Although different studies used X-rays for predicting mortality, with both radiologist-assessed [53,54] and AI-assessed [55] disease severity scores, in our knowledge, there are no studies that applied these predictors to ML methods in the evaluation of mortality. Further studies could evaluate the usefulness of this application

Conclusions
This systematic review specifically considers the state of the art in ML and DL as applied to COVID mortality prediction. Both binary and multi-class features are considered throughout the review. We summarized the developed models considering data source, data partitioning, class of features, ML technique and evaluation metrics for performance assessment. Clinical features are used in all studies for data samples, while only one paper currently has CT images features. Most of the studies presented an imbalanced number of survived and non-survived cases. We found some best practices that studies could follow for developing optimal ML models: (1) the use of a high-quality dataset with a large balanced number of samples, (2) the implementation of an ensemble of different ML methodologies, (3) clinical features should include different features class type including Age, CRP, LDH values, (4) as many metrics as possible should be reported to have a complete view on model performance, including both the most common metrics, such as AUCROC and ACC, and other important metrics for performance prediction assessment, such as SENS, SPEC, PPV and NPV.
The considerations in this review may help to develop further studies to predict mortality in COVID patients, including both adulthood and childhood, although children and young people remain at low risk of COVID mortality [82]. Moreover, suggestions collected in this study could also be useful to predict prognoses other than mortality (e.g., intubation and length of hospital stay).