External Validation of COVID-19 Risk Scores during Three Waves of Pandemic in a German Cohort—A Retrospective Study

Several risk scores were developed during the COVID-19 pandemic to identify patients at risk for critical illness as a basic step to personalizing medicine even in pandemic circumstances. However, the generalizability of these scores with regard to different populations, clinical settings, healthcare systems, and new epidemiological circumstances is unknown. The aim of our study was to compare the predictive validity of qSOFA, CRB65, NEWS, COVID-GRAM, and 4C-Mortality score. In a monocentric retrospective cohort, consecutively hospitalized adults with COVID-19 from February 2020 to June 2021 were included; risk scores at admission were calculated. The area under the receiver operating characteristic curve and the area under the precision–recall curve were compared using DeLong’s method and a bootstrapping approach. A total of 347 patients were included; 23.6% were admitted to the ICU, and 9.2% died in a hospital. NEWS and 4C-Score performed best for the outcomes ICU admission and in-hospital mortality. The easy-to-use bedside score NEWS has proven to identify patients at risk for critical illness, whereas the more complex COVID-19-specific scores 4C and COVID-GRAM were not superior. Decreasing mortality and ICU-admission rates affected the discriminatory ability of all scores. A further evaluation of risk assessment is needed in view of new and rapidly changing epidemiological evolution.


Introduction
COVID-19 spread around the world in alarming speed, burdening healthcare systems and hospitals. A large number of COVID-19 patients in life-threatening conditions had to be treated in hospitals that were provisionally set up, often by less experienced staff, indicating the need for an easy and objective clinical scoring model to identify high-risk patients [1]. Furthermore, risk scores are the first step in personalizing medicine to stratify individual patients to medical treatment. Several COVID-19-specific risk scores have been developed and validated during the first wave involving different populations and clinical settings: The COVID-GRAM risk score (GRAM: Guangzhou Institute of Respiratory Health Calculator at Admission) was developed in a large cohort of 1590 patients from 575 hospitals in China [2] with the aim of predicting critical illness in COVID-19 patients defined as admission to ICU (intensive care unit), mechanical ventilation, or death. The 2 of 11 risk score summarizes 10 variables based on a logistic model and requires-due to the complexity-an online tool for usage ([2], see Appendix A). In 2020, the 4C-Mortality Index (Coronavirus Clinical Characterisation Consortium) was developed by including 260 hospitals and 35,463 patients in England, Scotland, and Wales to predict in-hospital mortality [3]. In contrast to the younger patient cohort (mean age of 49 years) with a lower mortality rate of 3.2%, which helped develop the COVID-GRAM risk score, the 4C-Mortality Index was developed in an older cohort (mean age was 73 years) with a higher in-hospital mortality rate of 32.2% [3]. Such differences question the generalizability of these scores and warrant further validation of the scores in different populations and various clinical settings. Moreover, several preexisting risk scores developed for other infectious diseases and intensive care medicine such as NEWS (National Early Warning Score), CRB-65 (confusion, respiratory rate, blood pressure-age 65), and qSOFA (quick sequential organ failure assessment) are also in use, though their predictivity among COVID-19 patients remains unclear. For example, during the study period, NEWS was routinely being used as a risk assessment tool to initiate a timely clinical response such as nurse-doctor-contact or notification of ICU for COVID-19 patients at the University Hospital Tübingen. Though several authors examined the predictive performance of some of the scores in COVID-19 risk stratification [4][5][6][7][8][9][10], the performance of multiple scores has not been compared in the same patient population. Over the course of the pandemic, epidemiological circumstances and therapeutic options rapidly changed: remdesivir received conditional approval in the treatment of COVID-19 patients in July 2020, and the benefit of dexamethasone was proven by the RECOVERY Collaborative Group [11]. In December 2020, new SARS-CoV-2 variants with high transmissibility and a potential immune escape were reported in the United Kingdom and South Africa, which were declared by the WHO as variants of concern [12]. In 2021, the European Union licensed several COVID-19 vaccinations, which had shown high efficacy and safety in clinical trials. The progress of immunization and vaccination, different COVID-19 variants, various healthcare systems, and the increasing efficacy of treatment have had a great impact on COVID-19 mortality risk. Nevertheless, there is a residual risk reported for critical illness even among the vaccinated population [4]. Until today, there is no risk score recommended in the German COVID-19 guidelines [13].
Amidst this background, a retrospective cohort study was conducted with the primary objective of evaluating the performance of common clinical scores (NEWS, qSOFA, and CRB-65) and COVID-19-specific clinical scoring models (COVID-GRAM, 4C-Mortality score) for the prediction of ICU admission and in-hospital mortality and comparing their predictive performance in a cohort of hospitalized COVID-19 patients in Germany. To the best of our knowledge, this is the first study to compare all of the abovementioned scores in a defined population, comparing their discriminative abilities in different waves during the course of the pandemic.

Study Design and Setting
We conducted a monocentric retrospective cohort study at the University Hospital Tübingen (tertiary care hospital) located in Tübingen, Germany, during the first three waves of the COVID-19 pandemic.

Study Population
The study recruited patients ≥18 years admitted to the university hospital due to COVID-19 between 1 March 2020 and 30 May 2021. SARS-CoV-2 infection was confirmed by positive real-time polymerase chain reaction (RT-PCR). Exclusion criteria were as follows:

•
Patients not hospitalized for COVID-19 disease; • Patients with a patient decree determining a DNR/DNI (do not resuscitate/do not intubate) situation; • Patients transferred to our ICU from other hospitals, for example, due to the need for extracorporeal membrane oxygenation (ECMO).

Definition of Cohorts
The study period was divided into three cohort periods according to the classification of the waves by the Robert-Koch Institute [14] and based on key epidemiological factors: the number of COVID-19 cases, prevalence of different SARS-CoV-2 variants in Germany and at the University Hospital Tübingen, and availability of standardized specific therapy and vaccination. The prevalence of virus variants was assessed according to own data when genotyping was available and general evidence [12,15]. COVID-19 vaccination rates in Germany were assessed according to the results of the COVIMO study group [16]. The study population was thus divided into cohort 1 from 1 March to 30 June 2020, cohort 2 from 1 July 2020 to 7 March 2021, and cohort 3 from 8 March to 30 May 2021 (see Table 1). We decided to slightly deviate from the Robert-Koch Institute's classification of the COVID-19 waves due to the decreasing incidence of hospital admissions in summer (only 7 patients fulfilled the inclusion criteria between July and September 2020).

Data Collection and Score Validation
We retrospectively collected clinical, demographical, and outcome data for each cohort by using the clinical information and documentation systems. Comorbidities included were chronic respiratory disease, cardiovascular disease, chronic liver disease, chronic kidney disease, HIV (human immunodeficiency virus) infection or AIDS (acquired immunodeficiency syndrome), organ transplantation, diabetes mellitus, malignancy, and chronic neurological conditions. Demographical and epidemiological data collected were age, sex, body mass index, COVID-19 vaccination, and DNR/DNI status. The scores studied were the Quick Sequential Organ Failure Assessment score (qSOFA), National Early Warning Score (NEWS), CRB-65 score, COVID-GRAM risk score, and 4C-Mortality Score (see Table 2). Each of the scores was separately calculated for each patient using the admission data. In case of missing values at admission, we decided to collect the earliest parameter on the day of admission or day 1 after admission to improve the power of the study. This included the laboratory parameter serum urea concentration and radiological data. In case of missing direct bilirubin, we used total bilirubin. Glasgow Coma Scale was retrospectively calculated (for the workflow, see supplementary Figure S1).

Statistical Analysis
After the inclusion of patients by the above-defined inclusion and exclusion criteria, in a second step, patients with missing values were excluded. In the case of at least one missing value, we did not calculate the specific score (Supplementary Table S1, for characteristics of excluded patients, see Supplementary Table S2). The primary outcomes were endpoints of critical illness defined by ICU admission and in-hospital mortality. Discriminative indices of the selected scores including sensitivity, specificity, and positive and negative predictive values were calculated. Confidence intervals were assessed via a bootstrapping method. The discriminative abilities of the scores were assessed and compared by using the area under the receiver operating characteristic curve (AU-ROC) and the area under the precision-recall curve (AU-PRC). Equality of the AU-ROCs was tested using DeLong's method and a bootstrapping method as well [17,18]. We decided to show the bootstrap findings in the results. To correct for multiple testing, we used Bonferroni-Holm adjusted p-values. An adjusted p-value of 0.05 or less was regarded as significant. Data were analyzed by R Version 4.1.2 using the packages readxl, pROC, precrec, boot, and ggplot2 [19][20][21][22]. We reported continuous variables as mean with the first and third quartiles and categorical variables as a number with the percentage of the cohort. Continuous variables were compared using the Mann-Whitney U-test, and categorical variables were compared by the use of Fisher's exact test.

Comparison of Cohorts
The ICU-admission and in-hospital-mortality rates significantly decreased from the first to the second wave (p-values of 0.002 and 0.044, respectively). The mean oxygen saturation at admission was significantly higher in the second cohort (p-value of 0.034), whereas documented fever decreased (p-value of 0.036). We did not find significant differences between the first and the second wave considering age, sex, and comorbidities. Due to the low sample size of the third-wave cohort, statistical tests were less significant. Only the mean respiratory rate was significantly lower in the third cohort compared with the first cohort (p-value of 0.04). Nevertheless, we could observe a decreasing trend in the ICU-admission rate, in-hospital mortality rate, and average age also between the second and third cohorts.

Predictive Performance of the Scores
We included the whole study population in the comparison of the predictive performance of the scores. Due to the low sample size of the cohort, three statistical tests for the comparison between cohort 3 and the other two cohorts had a lower power ( Figure 1). Therefore, the third cohort was excluded from the comparison between the cohorts (discriminatory indices, Supplementary Table S3). Overall, the NEWS model performed the best regarding ICU admission, which was confirmed by ROC-AUC (0.83; CI 0.76-0.88), whereas the 4C-Score showed a higher PR-AUC (0.64; CI 0.50-0.78) than NEWS. We found significant differences in the ROC-AUC compared with qSOFA (0.70; CI 0.64-0.77); however, differences to the other models were not statistically significant. The 4C-Score had the highest ROC-AUC (0.81; CI 0.69-0.90) with regard to in-hospital mortality. Again, differences to the other scores were not statistically significant. qSOFA, CRB-65 score, and COVID-GRAM performed lower (Figure 2). REVIEW 6 differences between the first and the second wave considering age, sex, and comorbid Due to the low sample size of the third-wave cohort, statistical tests were less signifi Only the mean respiratory rate was significantly lower in the third cohort compared the first cohort (p-value of 0.04). Nevertheless, we could observe a decreasing trend i ICU-admission rate, in-hospital mortality rate, and average age also between the se and third cohorts.

Predictive Performance of the Scores
We included the whole study population in the comparison of the predi performance of the scores. Due to the low sample size of the cohort, three statistical for the comparison between cohort 3 and the other two cohorts had a lower power (Fi 1). Therefore, the third cohort was excluded from the comparison between the coh (discriminatory indices, Supplementary Table S3). Overall, the NEWS model perfor the best regarding ICU admission, which was confirmed by ROC-AUC (0.83; CI 0.88), whereas the 4C-Score showed a higher PR-AUC (0.64; CI 0.50-0.78) than NEWS found significant differences in the ROC-AUC compared with qSOFA (0.70; CI 0.64-0 however, differences to the other models were not statistically significant. The 4C-S had the highest ROC-AUC (0.81; CI 0.69-0.90) with regard to in-hospital mortality. A differences to the other scores were not statistically significant. qSOFA, CRB-65 score COVID-GRAM performed lower (Figure 2).

Comparison between the Cohorts
The performance of NEWS and 4C-Score was better in cohort 1 than in cohort 2. The NEWS had a ROC-AUC of 0.88 (CI 0.80-0.94) in the first cohort and 0.71 (CI 0.60-0.81) in the second cohort with regard to ICU admission, whereas the ROC-AUC was 0.75 (CI 0.62-0.86) in the first cohort and 0.73 (CI 0.53-0.89) in the second cohort with regard to in-hospital mortality. Differences were statistically significant between the cohorts concerning ICU admission (p-value of 0.011). However, differences were not statistically significant regarding in-hospital mortality. The 4C-Score showed a ROC-AUC of 0.84 (CI 0.73-0.93) in the first cohort and 0.58 (CI 0.42-0.72) in the second cohort concerning ICU admission. The ROC-AUC was 0.87 (CI 0.78-0.94) in the first cohort and 0.59 (CI 0.33-0.84) in the second cohort concerning in-hospital mortality. ICU admission differences between the cohorts were statistically significant (p-value of 0.002), as well as in-hospital mortality differences (p-value of 0.045). We did not find any significant difference within the cohorts regarding qSOFA, CRB-65, and COVID-GRAM (details in supplementary Figure S2).

Discussion
We evaluated the performance of various COVID-19-specific as well as commonly used risk scores in a retrospective cohort of COVID-19 patients over the course of three waves of the pandemic. In our study, COVID-specific scores including COVID-GRAM and 4C-Score failed to show significant superiority compared with the NEWS model. This was shown especially for the prediction of ICU admission. Liang et al. reported an AUC of 0.88 (CI 0.85-0.91) in the derivation cohort of COVID-GRAM for the composite endpoint of ICU admission, need for invasive ventilation, and death [2]. This was not well-reflected in our findings with AUC 0.75 for ICU admission and 0.65 for in-hospital mortality. Knight et al. reported an AUC of 0.79 (CI 0.78-0.79) in the derivation cohort of the 4C-Score for the endpoint in-hospital mortality [3]. In our first cohort-which might reflect the original derivation cohort-we found an AUC 0.84 for ICU admission and 0.87 for in-hospital mortality, thus confirming their results. The usage of COVID-GRAM and 4C-Mortality score in clinical assessment is more complex compared with NEWS using more variables and requiring radiological assessment in addition to the laboratory parameters as well as information on relevant comorbidities; COVID-GRAM needs to be calculated by a calculator available online. NEWS includes more vital signs than both the COVID-specific scores; the assessment can be easily made on the bedside. In every-day clinical life, this is a clear advantage. The different baseline characteristics (especially the average age, ICU admission rates, and mortality rates) of the derivation cohorts and differences in epidemiological conditions (e.g., healthcare systems, need for triage) might be the reason for differences of performance in our cohort. It has been demonstrated before that the impact of changing vital sign categories on prognosis in terms of mortality is larger in older patients [23]. This can be especially discussed for the derivation cohort of COVID-GRAM with a mean age of 48.9 years in comparison with our cohort with a mean age of 65 years. 4C performed better in our cohort than COVID-GRAM but did not show significant superiority to NEWS. Interestingly, we could not reproduce the AUC findings of the COVID-GRAM. This implies the importance of evaluating risk scores in different settings. Our findings are in line with a previous study conducted in Italy, which also described the differences between COVID-GRAM (AUC 0.785, CI 0.723-0.838), 4C-Score (AUC 0.799, CI 0.738-0.851), and NEWS (AUC 0.764, CI 0.700-0.819) as not statistically significant with regard to all-causes in-hospital death [7]. Another investigation figured out NEWS2 (an advanced version of the NEWS examined in this study, including hypercapnic respiratory failure instead of oxygen saturation) as prognosticating critical illness in COVID-19 better than COVID-GRAM (AUC of 0.87, CI 0.80-0.93 for NEWS2 and of 0.77, CI 0.68-0.85 for COVID-GRAM) [5]. CRB-65 and qSOFA as commonly used risk scores for pneumonia and sepsis were both clearly outperformed by NEWS-thus highlighting the COVID-specific importance of monitoring of oxygen saturation in addition to the respiratory rate possibly influenced by the silent hypoxemia that is characteristic for severe COVID-19 [24].
The in-hospital-mortality and ICU-admission rates significantly decreased during the study period comparing the three defined cohorts. Especially the ICU-admission rate in our cohort is comparable with national data-according to the data derived from a German federal hospital payment institute, the national ICU-admission rate of hospitalized patients was 30% in the first wave and 14% in the second wave [25]-in our cohort, 37% and 16%, respectively. The in-hospital mortality in our cohort was at 17% in the first and 5% and 0% in the second and third waves, respectively. In a study aggregating health insurance data of 158,490 German patients, the in-hospital mortality of the three waves was at 22.2%, 21.7% and 14.8% [26]. Of note, significant differences between hospitals were seen in this study.
One reason for the difference that can be discussed is a significantly lower mean age in our cohort than in the above-mentioned cohort (67 and 65 years versus 72 and 74 years in the first and second wave, respectively). Our hospital has a specific expertise in ARDS treatment, which might have modified the treatment outcome. For the third wave, patient numbers in our cohort were too low to provide statistically significant numbers.
A decrease in the in-hospital-mortality and ICU-admission rates has also been described in other countries, mostly interpreted as the composite effect of better treatment and the protective effect of immunization and vaccination within the population [27]. By the definition of the cohorts that we choose, this can also be assumed, investigating three different "waves" of patients with different treatment options and availability of vaccination. The effect of the vaccination rollout during cohort 3 is presumably reflected in the reduced average age of mostly unvaccinated hospitalized patients, since vaccination rollout was prioritized in the beginning for patients of higher age and with severe comorbidities. As the in-hospital-mortality and ICU-admission rates decreased, the discriminatory ability of the scores also decreased. Due to a small sample size, we did not calculate the AUC and PR-AUC of the third cohort, but it can be assumed that with further decreasing in-hospital-mortality and low-ICU-admission rates, the trend continues.
A further evaluation of risk assessment according to rapidly changing epidemiological circumstances is needed, especially considering new variants, treatment, and prevention options. According to our findings, a risk stratification of patients at admission could be recommended in Germany using the simple bedside score NEWS.

Limitations
Due to the retrospective nature of the study, we had to exclude patients with missing variables, thus possibly creating a bias toward more severely ill patients in which more parameters are documented. Furthermore, the sample size in the third cohort was respectably smaller than those in the other cohorts. The study was conducted at a university hospital (tertiary care hospital and center for extracorporeal membrane oxygenation), creating a further possible bias. Since the analysis was carried out until June 2021, the succeeding variants of concern Delta and Omicron are not reflected in our study. On the other hand, the hospitalization, mortality, and ICU-admission rates were steadily declining especially with the Omicron variants, so it can be assumed that the discriminatory ability of the scores decreases further. The strength of our study is a broad comparison of several widely used risk scores in a German population. Furthermore, we provide a comparison of the score performance spanning over the course of the pandemic with changing epidemiological and therapeutic conditions.

Conclusions
Interestingly, in our hospitalized cohort, the COVID-specific risk-scores COVID-GRAM and 4C were not superior to NEWS especially in predicting ICU admission, even though they use additional risk factors acknowledged for severe COVID-19 disease. As a simple bedside-use scoring system, NEWS has proven to be a useful tool to identify hospitalized patients at risk for critical illness in our study population. This finding is important especially in clinical real-life circumstances, and the usage of the NEWS score for the risk stratification of patients at hospital admission should be discussed. Decreasing mortality rates and ICU-admission rates affected the discriminatory ability of all scores, so further investigation will be needed to address the highly dynamic epidemiological evolution.
Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jpm12111775/s1, Figure S1: Schematic workflow of study; Figure S2: Plots of area under receiver operating characteristics of (a) qSOFA, (b) COVID-GRAM, (c) NEWS, (d) CRB-65, and (e) 4C-Score with regard to ICU admission and in-hospital mortality in the first and second cohorts; Table S1: Missing variables in calculation of scores; Table S2: Characteristics of excluded patients; Table S3: Discriminatory indices at several cutoffs of scores. Informed Consent Statement: Patient consent was waived according to German law and the Institutional Review Board due to the importance of this study subject and the retrospective study design. Clinical routine data were retrospectively collected and pseudonymized for analysis.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. Since informed patient consent was waived according to German law, he data are not publicly available.

Conflicts of Interest:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.