Predicting In-Hospital Mortality in Severe COVID-19: A Systematic Review and External Validation of Clinical Prediction Rules

Multiple prediction models for risk of in-hospital mortality from COVID-19 have been developed, but not applied, to patient cohorts different to those from which they were derived. The MEDLINE, EMBASE, Scopus, and Web of Science (WOS) databases were searched. Risk of bias and applicability were assessed with PROBAST. Nomograms, whose variables were available in a well-defined cohort of 444 patients from our site, were externally validated. Overall, 71 studies, which derived a clinical prediction rule for mortality outcome from COVID-19, were identified. Predictive variables consisted of combinations of patients′ age, chronic conditions, dyspnea/taquipnea, radiographic chest alteration, and analytical values (LDH, CRP, lymphocytes, D-dimer); and markers of respiratory, renal, liver, and myocardial damage, which were mayor predictors in several nomograms. Twenty-five models could be externally validated. Areas under receiver operator curve (AUROC) in predicting mortality ranged from 0.71 to 1 in derivation cohorts; C-index values ranged from 0.823 to 0.970. Overall, 37/71 models provided very-good-to-outstanding test performance. Externally validated nomograms provided lower predictive performances for mortality in their respective derivation cohorts, with the AUROC being 0.654 to 0.806 (poor to acceptable performance). We can conclude that available nomograms were limited in predicting mortality when applied to different populations from which they were derived.


Introduction
The first cases of pneumonia caused by a new coronavirus [1] were reported just over three years ago in Wuhan, China [2]. The disease caused by coronavirus 2019  has since spread globally to constitute a public health emergency of international concern [3]. Although most patients had mild or moderate symptoms, a proportion of severely ill patients progressed rapidly to acute respiratory failure, with mortality in 49% [4]. Early identification and supportive care could effectively reduce the incidence of critical illness and in-hospital mortality. Hence, from the early stages of the pandemic, many risk-prediction models, or nomograms, were developed [5] by integrating demographic, clinical, and exploratory findings during early contact with health care systems. However, the merits of most available tools still remain unclear since many were developed to predict a diverse mix of complications (including aggravated disease, need for invasive ventilation, or admission to ICU) in addition to in-hospital mortality; furthermore, most had not been applied to different patient cohorts to those from which they were derived.
External validation is essential before implementing nomograms in clinical practice [6]; however, almost no prognostic model for in-hospital mortality for COVID- 19 has as yet been validated.
In this research, we aimed to systematically review and critically appraise all currently available prediction models for in-hospital mortality caused by COVID-19. We also aim to compare prediction performances by retrospectively applying nomograms to a well-defined severe patient series admitted to our hospital.

Eligibility Criteria and Searches
This review was conducted and is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines; a study protocol was registered with PROSPERO [CRD42020226076]. The MEDLINE, EMBASE, Scopus, and Web of Science (WOS) databases were searched for literature published up to 25 August 2021 (Table S1). No restrictions were applied to language or methodological design. No restrictions were placed on prediction horizon (how far ahead the model predicts) within the admission period, or countries or study settings. Additional relevant papers were identified by screening reference lists of included documents. Literature searches were repeated on 20 April 2022 to retrieve the most recent papers and provide updated results.
Studies were included if they (a) described the development/derivation and/or validation of a multivariable tool designed to predict risk of in-hospital mortality in patients with a confirmed diagnosis of severe COVID-19 infection, (b) provided the sensitivity and specificity of the tool or gave sufficient data to allow these metrics to be calculated, and (c) defined the variables or combination of variables used to predict the risk of mortality from COVID-19.

Study Selection
Two reviewers (MM-M and AJL) independently screened titles and abstracts against eligibility criteria. Potentially eligible papers were obtained, and the full texts were independently examined by the same reviewers. Any disagreements were resolved through discussion.

Data Extraction and Synthesis
Data extraction of included articles was undertaken by two independent reviewers (MM-M and AJL) and checked against full-text papers by a third reviewer (AAA) to ensure accuracy. Using a predeveloped template based on the CHARMS (critical appraisal and data extraction for systematic reviews of prediction modeling studies) checklist [7], information was extracted on study characteristics, source of data, participant eligibility and recruitment method, sample size, method for measurement of outcome, number, type and definition of predictors, number of participants with missing data for each predictor and handling of missing data, modeling method, model performance, and whether priori cut points were used. In addition, we assessed the method used for testing model performance, and the final and other multivariable models.
For all in-hospital mortality prediction models, area under the receiver operator characteristic curve (AUROC) or concordant (C)-index was used to compare discrimination (the ability of a tool to identify those patients who died from COVID-19 from those who did not). Due to the marked heterogeneity of included studies in terms of the study designs, populations, variable definitions, selection of predictions, use of different tool thresholds, and variable modeling performance, we were unable to perform any meta-analyses. Instead, performance characteristics were summarized in tabular form and a narrative synthesis approach was used.

Risk of Bias Assessment
PROBAST (prediction model risk of bias assessment tool) was used to assess the risk of bias and applicability of the included studies [8]. PROBAST assesses both the risk of bias and concerns regarding applicability of a study that develops or validates a multivariable diagnostic or prognostic prediction model. It includes 20 signaling questions across 4 domains (participants, predictors, outcome, and analysis). Each domain was rated as having "high", "low", or "unclear" (where insufficient information was provided) risk of bias. Two reviewers (A.A. and A.J.L.) independently assessed each study. Ratings were compared and disagreements were resolved by consensus.

External Validation of Included Clinical Prediction Rules
External validation was carried out as long as the variables integrated in a model were available among those registered in an external validation cohort of patients. No other restrictions were placed on the type of variable that could be included in a tool.
Data on 444 adults with confirmed SARS-CoV-2 infections, who were admitted to our hospital due to severe COVID-19 between 26 February to 31 May 2020 and with a 90-day follow-up period available, were used for the external validation of selected clinical prediction rules. Detailed methods and clinical and demographic characteristics of this patient series have been previously described [9,10]. All patients were independent from the data used in the derivation of any of the included clinical prediction rules. Data from the validation cohort were recoded to reproduce predictors and primary outcomes of each included clinical prediction rule, and modifications were made to match available data. The same point assignment and cut-off values provided by original derivation cohorts were used for external validation analyses.
The search strategy in the different databases is detailed in Table S1. The characteristics of the included clinical prediction rules are shown in Table S2.
Study data were collected between 1 January and 20 May 2020, during the first wave of the pandemic; the earliest data were provided by Chinese hospitals, all of them prior to 31 March 2020. The latest admissions were recorded in hospitals in the US and Mexico, all of them between the beginning of March and the end of May 2020.
Overall, data from 317,840 SARS-Cov-2-infected patients were included in the derivation cohorts of the 71 predictive models, 36,882 of whom died. The percentage of deaths varied widely between studies, from 2.2% to 48.9%.
The complete data on predictors were reported in all the 71 studies, and formulas to calculate mortality risks were provided or could be extracted in 32 studies (Table S2). The authors of three additional studies for which we were unable to find the formula [2,31,32] were twice contacted by email but did not respond. In the remaining studies, the formula could not be provided, as these predictive models were derived from complex techniques (decision trees or machine learning). Eight predictive models provided an online-available tool to automatically predict outcomes [15][16][17]26,48,50,52].
The most frequently used prognostic variables for mortality (included at least five times among the different nomograms) were age (in 53 models), diabetes mellitus (in 11 models), chronic lung disease (COPD or asthma) (8 models), heart disease or cardiac failure (in 13), chronic kidney insufficiency (in 10), hypertension (in 7), and chronic liver disease (in 5 models). Comorbidity, defined either as number of conditions selected from a predefined list or Charlons index, were recognized as determinants for mortality in nine additional models.
Clinical predictors at admission included in final models consisted of dyspnea/taquipnea (14 models) and radiographic chest alteration (7 models). Analytical variables at admission identified as predictors included serum lactate dehydrogenase (LDH) in 24 prediction rules, C-reactive protein in 28 models, lymphocytes (either absolute number per µL or neutrophilsto-lymphocytes ratio) in 23 models, renal function (defined in terms of urea, BUN, serum creatinine, or glomerular filtration rates) in 20 models, and respiratory function parameters (peripheral O2 saturation, supplemental O2 at admission, PaO2/FiO2, or alveolar-arterial oxygen gradient) in 23 prediction models. D-dimer was included in 16 models and platelet count in 8 additional ones. Markers for liver injury (elevated billirrubin or aminotransferase levels) and myocardial damage (including either troponine I, myoglobine, or creatine phos-phokinase) were included as predictors for mortality in 17 and 7 nomograms, respectively. Details on all predictors included in final models are provided in Table S2.

Risk of Bias
Twenty-five studies were at high risk of bias (ROB) according to assessment with PROBAST, and a further twenty-three showed an unclear ROB ( Figure 2 and Table S3). This suggests that their performances when used in predicting in-hospital mortality caused by COVID-19 is probably lower than that reported. Fifty-seven studies (80.3%) were evaluated as being of low ROB for the participants domain, thus indicating that the participants enrolled in the studies were representative of the models targeted populations. All but 19 studies had a low ROB for the predictor domain (with the remaining being unclear), which indicates that predictors were available at the models' intended time of use and clearly defined, or independent from mortality. There were concerns about bias induced by the outcome measurement in ten studies, especially due to lack of information on time intervals between predictor assessment and outcome determination as a result of registering data from infected outpatients, or for confusingly considering losses to followup as deceased patients. Twenty-two studies were evaluated as high ROB for the analysis domain, mainly because calibration was not assessed, or due to risk of model overfitting when complex modeling strategies were used. The applicability of the different CPRs was also assessed. High ROB in population domain was mainly due to inclusion of patients without severe COVID-19 in original derivation cohorts. Studies that derived CPR for disease progression outcomes were evaluated as being of unclear ROB. Predictors derived from patient cohorts with small numbers of deceased patients determined their unclear applicability. Predictors obtained from patients in ICU were considered to be of high ROB regarding their applicability to our systematic review. Combined outcomes (e.g., disease progression) were considered of high or unclear ROB when only a minority of deceased patients was included in derivation cohorts.

Evaluation of Tool Performance in Predicting COVID-19-Related Mortality
Studies that predicted mortality in derivation cohorts reported AUROC between 0.701 and 1. C-index values ranged between 0.823 and 0.970. These values were over 0.9 (very good test) and over 0.97 (outstanding test) in 31 and 6 of the 71 predictive models, respectively. When provided, sensitivity values for cut-off points provided by the different authors in derivation cohorts ranged between 32% and 98.4%, with specificity ranging from to 100% to 38.6%.
For all the 11 studies that externally validated their clinical prediction rules in patients from different institutions, the observed prediction performances were quite similar in each validation subcohort compared to respective derivation cohorts [11,13,26,38,48,52,55,65,70,74,75] [52]; and they were 0.943 and 0.878 in the study by Zhang et al. [55]. In addition, Li developed a CPR from a hospital in Wuhan, China, which provided a C-index of 0.97 in the derivation cohort. It was 0.96 in the internal validation cohort recruited in a second hospital in Wuhan and 0.92 when externally validated in a third neighboring hospital [70]. Similar results were found by He et al. as well in their study involving patients from three hospitals in the Hubei province, China [65].
Notably, external validation was mostly performed for patient cohorts from the same cities or countries that the prediction rules were derived in. However, Rahman et al. derived a prognostic model from 375 COVID-19 patients admitted to Tongji Hospital, China, with an AUROC value of 0.961; when it was validated with an external cohort of 103 patients of Dhaka Medical College, Bangladesh, the AUC value was 0.963 [75].
Sample sizes for data that were used in external validation of each clinical prediction rule were smaller than the respective derivation cohort.

External Validation in the Same New Cohort of Patients
Predicting variables included in 25 out of the 71 prognostic models identified in our systematic review were present in our local cohort of patients [11,12,15,16,22,25,26,28,30,34,35,37,39,40,45,[47][48][49][50]56,60,62,71,81], thus allowing external validation on its performance in a separate clinical setting [9] to assess how the rule could be used in real-life. Prediction rules were validated on subcohorts that included 208 to 444 patients (depending on the number of cases that had all the variables incorporated in each model). For all 25 prognostic models, prediction performances for mortality were notably lower than in the respective derivation cohorts or internal validation cohorts, with AUROC values ranging from 0.654 to 0.806 (poor to acceptable performance). No nomogram was considered good or outstanding to predict in-hospital mortality among our own cohort of patients. A wide variability of sensitivity (ranging 15.5% to 100%) and specificity (1.3% to 98.5%) was found when the best cut-off values (as provided by original authors) were selected (Table 1). Figure 3 represents a comparator of receiver operator characteristic curves for low ROB predictive models for in-hospital mortality, applicable to our external validation population.  Figure 3. Comparison of receiver operator characteristic curves for low-risk-of-bias-predictive models for in-hospital mortality [11,16,30,45,50,62,71], applicable to our external validation population.

Discussion
In this systematic review, we identified, retrieved, and critically appraised 71 individual studies that develop prediction models to support the prognostication of death among patients with COVID-19. To our knowledge, this is the first systematic assessment and comparison of prognostic performances of existing clinical prediction rules on risk for in-hospital mortality caused by severe COVID-19. All models were developed during the first wave of the pandemic and reported very-good-to-outstanding predictive performances in derivation and internal validation cohorts.
Predictive tools comprised simple analytical values-based nomograms, nomograms which included symptoms, analytical values and imaging tests, and more complex diagnostic prediction models that incorporated symptoms, test results, and comorbid conditions. Predicting factors included in the different nomograms varied widely among studies, but many have been repeatedly associated with poor prognosis in COVID-19. Thus, advanced age, COPD, heart disease, hypertension, chronic kidney failure, and diabetes were positively associated with risk of death in at least five nomograms and have been related to progression to severe disease as well [83]. In contrast, male sex, smoking history, and obesity were exceptionally included in nomograms, despite being identified as risk factors for progression to severe disease and death in COVID-19 patients in some studies [84][85][86]. Organ failure, including pneumonia, respiratory insufficiency, and ischemic cardiac or liver injury were repeatedly included in nomograms and have been related to poor prognosis in independent research [87][88][89]. Inflammatory markers such as C-reactive protein, D-dimer, as well as lymphopenia, thrombocytopenia, and elevated LHD were recognized by the earlier literature on COVID-19 [90] and the most recent research [9]. Models developed using data from different countries agreed on including common predictive analytical values, despite most nomograms being developed by China.
The methods to develop the different prognostic models available varied greatly in terms of modeling technique, methodology, and rigor of construction. Only 15 were assessed as of low ROB for development and applicability together. The prognostic performances of most tools were evaluated solely within the study datasets, with internal validation carried out on a subset of the original cohorts, thus reproducing the AUROC values provided in derivation subsets. Internally validating on the same cohort used for derivation usually overestimated the performance of scores [91]. For relatively small data sets, as those used to derive most of the nomograms, internal validation by bootstrap techniques might not be sufficient or indicative for the model s performance in future patients [6], despite demonstrating the stability and quality of the predictors selected within the same cohort. Only 11 tools were externally validated in other participants in the same article. However, these were almost always on patients from the same or neighboring cities as those included in the derivation cohort and therefore likely to have similar characteristics, thus showing overlapping results.
Until now, one single nomogram had been externally evaluated for its predictive capacity for COVID-19-related in-hospital mortality in a different series [49]. The predictive performance of this nomogram developed in Mexico, reduced markedly when applied to a different patient dataset from the same country [16], with the AUROC decreasing from 0.823 to 0.690. A still unpublished prediction rule derived from patients in Wuhan, China, provided C-index for death of 0.91 that only decreased to 0.74 when it was externally validated in patients admitted to hospital in London (UK) [92]. This research is, to our knowledge, the first in externally validating prediction rules for in-hospital mortality caused by COVID-19 altogether and provides further evidence of their limited performance when applied to different clinical settings. The study by Rahman et al. [75] represents a second attempt to externally validate; better results were produced, but these were affected by high risks of bias. It can be suggested therefore that each nomogram developed to predict mortality from COVID-19 should be applied to the same clinical settings from where it was derived.
The main strengths of this study include its systematic search on the multiple literature databases that index the main results of research on COVID-19; the fact that the research is up to date; and the critical appraisal of the methods and ROB of the studies retrieved. The different nomograms have been analyzed in detail and formulas to allow for the estimating of mortality risk for any population by using each of them were provided whenever possible. Finally, the predictive performance of nomograms based on demographic, clinical, and analytical parameters available in usual clinical practice were evaluated in a single external series of COVID-19 patients admitted to hospital. Some weaknesses should be acknowledged for our study. These are mainly as a result of the heterogeneity of source documents, which derive from different populations with variable disease severity, and cared for in not necessarily comparable healthcare settings. While certain clinical and laboratory variables were identified to contribute objectively to mortality in most studies, many others varied widely among different prediction models. As several nomograms included variables that are not routinely used in clinical practice, we could not provide external validation. Additionally, we did not retrieve studies only available as preprints, which might improve after peer review. Finally, no prediction model was derived from or validated in patients infected by COVID-19 during the second and successive waves of the pandemic, and their current usefulness has not been evaluated.

Conclusions
To conclude, our research demonstrates the limitations of prognostic rules for risk of mortality from COVID-19 when applied to different populations from which they were derived. Demographic, clinical, and analytical determinants for risk of mortality are influenced and modulated by many factors inherent to each clinical setting, which are not easily controllable or reproducible. Once main determinants for COVID-19-related mortality at hospital admission have been identified, the best predictive models could be those developed in each particular clinical setting.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/biomedicines10102414/s1. Table S1: Search strategies carried out in four bibliographic databases of documents that report on clinical prediction models for hospital mortality caused by COVID-19. Table S2: Overview of prediction models for mortality risk from COVID-19 identified in a systematic review of the literature, and performance of each model in derivation cohorts. Formulas to calculate mortality risk to be applied to any population were provided or extracted from original documents. Table S3: Risk of bias assessment (using PROBAST) based on four domains across the studies that developed and/or validated prediction models for in-hospital mortality due to coronavirus disease 2019. Informed Consent Statement: The Ethics Committee waived the need for obtaining informed consent from patients admitted to our hospital due to COVID-19 during the first wave of the pandemic and who were used to validate the different clinical prediction rules.

Data Availability Statement:
The data that support the findings of external validation of clinical prediction rules for this study are available from the corresponding author upon reasonable request.