1. Introduction
Multiple studies have shown that early detection of deterioration and early intervention in ward patients can significantly improve patient outcomes and potentially save lives [
1,
2]. Early identification of deterioration followed by timely intervention can also shorten a patient’s length of stay in the hospital and avoid intensive care unit (ICU) transfers leading to reduced resource utilisation and substantial cost saving for the hospital [
3,
4].
Deteriorating physiological parameters are often present in patients around 24 h prior to ICU transfer/in-patient death, which may sometimes go unnoticed [
5]. To avoid this, various early warning systems (EWS) are used worldwide to track patient vitals and detect clinical deterioration. The National Early Warning System 2 (NEWS2) was developed based on expert opinion, incorporating clinical experience, to identify key physiological parameters indicative of patient deterioration. It is the standard system recommended by the Royal College of Physicians and is used for monitoring deterioration in adult patients across NHS England [
6]. NEWS2 score is calculated based on the vital signs of the patient and does not include other potentially important variables, such as the patient’s age, frailty, and laboratory test results.
Despite its widespread use, both published evidence and clinician experience indicate that NEWS2 generates a high number of false positives. Rather than signalling only acute deterioration, the system also repeatedly alerts for persistently abnormal but clinically stable observations. For example, a patient admitted with a chronically high respiratory rate may continuously trigger a high NEWS2 score despite showing no new clinical decline. Such repeated alerts place a substantial cognitive and emotional burden on clinical staff. Frequent false alarms increase workload, interrupt clinical tasks, and erode trust in early warning systems, leading to clinicians discounting or delaying responses to alerts. This phenomenon of alert fatigue has been widely recognised across clinical decision support systems [
7]. Unnecessary escalation due to inappropriate alerting may also negatively affect patients. Recurrent monitoring, overnight observations, and repeated clinical reviews can disrupt sleep, increase anxiety, and contribute to a perception of clinical instability even in the absence of true deterioration.
There is potential to improve early detection of deterioration by using real time patient-specific data collected by the Electronic Patient Record (EPR) system. Indeed, alternative EWS have recently been developed, such as the Rothman Index [
8], which takes a data-driven approach using statistical modelling to create a composite score, and eCARTv5 [
9], which uses discrete time survival analysis to predict deterioration, and have been incorporated alongside NEWS2 in healthcare settings. Machine learning techniques have also gained popularity in attempting to create a system that tracks patient data in real time and raises an alert when deterioration is detected. For example, the Advanced Alert Monitor score was created using a discrete-time logistic regression model [
10].
The objective of this study was to build upon this literature of bespoke EWS by developing a novel prognostic tool to detect patient deterioration in real time, issue alerts with higher precision than NEWS2, and thereby reduce alert fatigue burden for clinicians. This report outlines the data pre-processing steps, model development, and comparison of its performance to NEWS2. Additionally, the report will address the limitations, potential biases, and future enhancements to improve the model’s predictive capabilities.
2. Materials and Methods
2.1. Study Setting
This retrospective observational cohort study was conducted at Wrightington, Wigan and Leigh (WWL) Teaching Hospitals NHS Foundation Trust, UK, using data from adult inpatients admitted to the general medical and surgical wards of the Royal Albert Edward Infirmary, Wigan. The study received institutional approval in accordance with WWL research governance and ethical regulations (study reference: Corp/SE/2025-26/14). This study was reported in accordance with the STROBE guidelines for observational cohort studies and the TRIPOD guidelines for prognostic model development and validation.
2.2. NEWS2
National early warning system 2 (NEWS2) is the system recommended across NHS England and is currently used to assess the severity of a patient’s condition and detect potential clinical deterioration [
6]. It assigns scores based on six physiological parameters: respiratory rate, oxygen saturation, systolic blood pressure, heart rate, level of consciousness, and temperature. Each parameter is scored between 0 and 3 based on deviation from normal values. A total NEWS2 score is calculated by summing the scores for each parameter; the higher the total NEWS2 score, the greater the clinical risk.
The frequency of monitoring patient vital signs increases as the NEWS2 score rises. A NEWS2 score of 7 or higher warrants an emergency response and continuous monitoring of vital signs. In this study, NEWS2 thresholds of 7 and above are used to evaluate performance against the proposed model.
2.3. Data Collection
Data were sourced from the EPR (Sunrise, Altera Digital Health, Raleigh, NC, USA) at WWL, a comprehensive electronic health record platform that stores detailed patient information. The EPR contains patient demographic information (e.g., sex, age), clinical information (e.g., specialty, frailty), vital signs, laboratory results, and time-stamped records of key events such as admissions, discharges, ward transfers, surgery, etc. Patient observations and events are stored in the EPR in real time as they are recorded in the ward.
The study population comprised all adult patients (age ≥ 18 years) admitted to acute medical and surgical wards between 1 January 2022 and 31 October 2024. Only patients with a completed hospital stay were included in the analysis. Maternity patients and patients on end-of-life care were excluded.
We included three categories of outcome marker for patient deterioration: (i) transfer to the Intensive Care Unit (ICU) or High Dependency Unit (HDU), excluding patients admitted directly from Accident and Emergency (A&E); (ii) unplanned surgery, defined as emergency surgical intervention due to clinical deterioration; and (iii) in-patient death, defined as mortality occurring during the hospital stay.
For model development, clearly identified unplanned surgeries were included in the training outcome to reflect true deterioration and to modestly mitigate class imbalance by increasing the number of positive events. However, we did not use unplanned surgery when testing performance because the dataset did not reliably distinguish planned from deterioration-driven procedures. Only cases that could be confidently verified as unplanned and deterioration-related were incorporated during training; all others were excluded from the test outcome.
The study size was determined by the availability of routinely collected electronic health record data during the study period. We included all eligible adult inpatients to maximise the number of deterioration events, to capture a wide case-mix of patients, and to ensure stable model estimation. The start of the study period was selected to minimise the confounding effects of post-COVID operational recovery, while the end date corresponded to the latest date for which complete, validated data were available.
The data were split into training and validation cohorts using a temporal design to enable conservative model evaluation (
Table 1). The training set included patients admitted between 1 January 2022 and 31 July 2024 (n = 32,787; deterioration rate = 6.2%), while the temporal validation set included patients admitted between 1 August 2024 and 31 October 2024 (n = 2577; deterioration rate = 1.9%). The validation cohort was defined as the most recent calendar period to mimic prospective model deployment, rather than through random sampling. As a result, the validation period, being more recent, had a lower deterioration rate than the training period, thereby providing a more conservative estimate of performance.
The prospective test cohort comprised patients admitted between 1 November 2024 and 31 January 2025 (n = 2625; deterioration rate = 3.3%). To assess the stability of the model performance over a longer timeframe, we conducted an informal temporal validation by extending the test period to 30 September 2025 (n = 11,048, deterioration rate = 2.8%) (
Table A3).
2.4. Data Preprocessing
The data were extracted from the EPR as three separate tables: vital signs, laboratory tests, and clinical indicators. Following independent cleaning and addressing missing values, the tables were merged using the patient’s Visit ID as the unique identifier. Missing data were handled in the preprocessing pipeline without excluding patients, thereby reducing the risk of complete-case bias.
Each table contained multiple rows per patient, representing observations recorded at different time points during their hospital stay. To standardise the timing, observation times were rounded to the nearest hour. In cases where multiple observations were recorded within the same hour, the mean or maximum value was used, depending on the nature of the variable (for further details, see
Appendix A.1). For each hour, missing values were initially filled forward using the most recent available observation to ensure temporal continuity. For any remaining missing values after forward-filling, a nearest neighbour imputation method was applied. This method leveraged data from similar patients, identified based on shared non-missing attributes, to estimate missing values.
Among vital signs, missing data rates were all below 0.05%, while supplemental oxygen percentage had 2.16% missing data. In laboratory tests, BNP (B-type Natriuretic Peptide) had the highest rate of missingness at 92%, whereas all other laboratory tests had missing rates below 15%. For laboratory tests with a high proportion of missing values, a normal reference value was imputed, based on the assumption that these tests were not performed because they were not clinically indicated. The higher rate of missing data in laboratory tests was primarily because not all tests are routinely performed for every patient.
Despite its high missingness, BNP was retained as a predictor because it reflects cardiac and haemodynamic stress frequently associated with deterioration and its absence often reflects lack of clinical indication rather than data quality loss [
11]. However, to ensure robust model performance, we performed a sensitivity analysis by refitting the model without BNP. Model performance and hazard estimates for key covariates were materially unchanged, indicating that BNP is not required for robust model performance.
For clinical indicators such as palliative care flag and pain scores, missing values were assumed to indicate the absence of the condition or attribute. For instance, a missing palliative care flag was treated as zero, assuming the patient was not in palliative care. For frailty scores, which had 13% missing data, the median frailty score for patients within the same age group was used as the imputed value.
Appendix A.2 provides a detailed overview of the imputation techniques used for each variable with missingness, along with the percentage of missing data.
2.5. Feature Engineering
Predictors were selected based on their established relevance in previous early warning systems (EWS) and on clinical consensus regarding variables that reflect evolving physiological deterioration [
8,
9,
10]. The final set of predictors comprised vital signs, laboratory tests, and clinical indicators that are routinely and reliably recorded in the EPR. Each variable was included both as its current value and, where clinically meaningful, as derived features representing the minimum, maximum, and trend over the preceding 12 h to capture temporal dynamics. The NEWS2 score and its 12-h trend were also included given their widespread clinical use and known association with deterioration risk.
Table 2 lists all predictors used for model training.
Diagnosis and comorbidity are important predictors of patient outcomes, but these variables evolve during admission and were not consistently available in structured form. During hospital stay, comorbidities and diagnoses are often recorded only in ward round notes, which are unstructured free text and lack standardisation, making reliable extraction difficult. Final coded diagnoses and comorbidities are typically assigned retrospectively by clinical coders after discharge. For these reasons, and to reduce potential sources of bias due to inconsistencies in extraction, diagnosis and comorbidity were excluded from model training.
For completeness, we minimised bias by including all eligible patients during the study period, thereby reducing the risk of selection bias. Missing data were handled using a structured imputation pipeline to avoid complete-case bias. Diagnosis and comorbidity variables were excluded due to inconsistent availability in structured form, mitigating potential misclassification bias. Finally, a temporally separated validation dataset reduced the risk of optimistic performance estimates.
2.6. Model Selection
The training dataset exhibited significant class imbalance, with only 6.2% of patients experiencing a deterioration event. We evaluated several classification approaches, including logistic regression, random forest [
12], and XGBoost [
13], alongside common class-balancing techniques (oversampling, undersampling, and synthetic data generation with SMOTE [
14]). These methods yielded inferior performance compared with NEWS2 and were therefore not pursued further. The best performance was achieved with a Cox proportional hazards model incorporating time-varying covariates. This approach naturally accounts for censoring, a frequent issue in clinical data when deterioration has not occurred by the end of observation and leverages the time-to-event structure of the data to provide more nuanced predictions than traditional classifiers.
2.7. Cox Proportional Hazard Model with Time-Varying Covariates
The Cox proportional hazards model is a widely used statistical approach in survival analysis, particularly suited for time-to-event data; however, to our knowledge, it has not been applied to clinical deterioration prediction. Unlike traditional classification models, which provide binary predictions, the Cox model estimates the hazard or risk of an event occurring at a given time, allowing for a more dynamic assessment of patient deterioration [
15]. Time-varying covariates were incorporated to ensure that risk estimates reflected the evolving physiological state of patients rather than static baseline values [
16]. This allowed the Cox model to update the hazard of deterioration dynamically as new observations were recorded. The implications of using time-varying covariates for distinguishing stable chronic abnormalities from true deterioration, and how this compares with the behaviour of NEWS2, are considered further in the Discussion.
A key assumption of the Cox model is that predictors maintain a linear relationship with the logarithm of the hazard function. However, many physiological variables follow nonlinear trajectories over time, which can violate this assumption. To better capture temporal dynamics while preserving linearity in the model, we performed feature engineering: each vital sign was expanded into three additional features (minimum, maximum, and trend over the last 12 h). For example, in addition to the current respiratory rate, we included the maximum and minimum values observed in the past 12 h, and a trend feature defined as the difference between the latest and 12-h-earlier measurements.
In addition, we explored functional transformations of the covariates (logarithmic, quadratic, and spline terms) to address potential nonlinearities and evaluated time-dependent coefficients using interactions with time terms. These did not improve model performance, so the engineered features were retained but no further transformations were applied.
Backward feature selection was used to identify the optimal set of predictors for the model. The selection process utilised Akaike Information Criterion (AIC) to balance model complexity and fit. Features were iteratively removed, and the model was re-evaluated at each step to ensure that the final set of predictors maximised predictive performance. The final Cox model was fitted using these transformed predictors (
Table 2). Models were deployed using R version 4.2.2 with package Survival [
17].
The final model was tested for violations of the proportional hazard assumption using Schoenfeld residuals. Several predictors demonstrated evidence of non-proportional hazards, which is expected in short-term, dynamic physiological data. Consistent with recent studies showing that moderate proportional-hazards violations generally have little impact on the predictive calibration of Cox-based models [
18], and given that our primary aim is accurate 12-h risk prediction rather than estimation of time-constant effects, we prioritised discrimination and out-of-sample performance over strict proportional hazard adherence. Accordingly, estimated hazard ratios should be interpreted as average short-term associations with deterioration risk within the prediction window, rather than as constant effects operating over a patient’s entire hospital stay.
2.8. Validation and Alert Threshold
To utilise the model’s risk output in predicting patient deterioration and to create an intuitive clinical tool, we designed a tiered “traffic light” alert system, in which amber and red alerts represented increasing levels of risk. Predictions were generated from the most recent patient observations, and alerts were triggered when predicted risk exceeded the pre-defined amber or red thresholds. The model outputs a continuous, unitless prognostic score corresponding to the Cox model’s linear predictor, which increases monotonically with predicted short-term risk of deterioration and is recalculated dynamically as new patient observations become available.
The model was designed to predict deterioration 0–24 h before the event. Each prediction was assigned to either the 0–12 h or 12–24 h window. Accuracy metrics (recall, precision, and F1 score) were calculated across a range of model and NEWS2 thresholds (7–11). Predictions were classified as true positives (TP), false positives (FP), false negatives (FN), or true negatives (TN) according to the following rules: a TP was a correct prediction of deterioration within the 12–24 h window; a FP was a prediction of deterioration when none occurred; a FN was a missed deterioration within the 0–12 h window; and a TN was a correct prediction of no deterioration. Importantly, predictions below the threshold in the 12–24 h window were not considered false negatives, as the 12 h window was regarded as the hard limit for required predictions, while the 12–24 h window allowed for early warning without penalty. The 12-h threshold was selected to align with NICE guidance for inpatient monitoring, which recommends a minimum observation frequency of once every 12 h for patients with a NEWS2 score of 0 [
19]. Operationally, this guidance implies that updated physiological observations should be available at least within each 12 h cycle for stable inpatients. As such, failure to raise an alert within 12 h of deterioration was considered a missed prediction, whereas alerts generated 12–24 h in advance were treated as early warnings without penalty.
For comparability, the same classification rules were applied to both the model and NEWS2. Model and NEWS2 performance were assessed against ICU transfers and in-patient deaths. To facilitate comparison with NEWS2, model thresholds were mapped to approximate NEWS2 scores by aligning precision and recall across the validation dataset, allowing equivalence between model outputs and NEWS2 thresholds (
Table A2). The mapping demonstrated that a model risk score of 20 yielded performance comparable to NEWS2 ≥ 7. For operational deployment we selected more conservative alert thresholds to prioritise precision and reduce alert burden. The amber threshold was therefore set at a model risk score of 50, as a step above the NEWS2 equivalent performance, while the red threshold was set at a risk score of 80 to maximise precision.
Importantly, the comparison between the model and NEWS2 performance is based solely on ICU transfers and in-patient deaths. Emergency surgeries were excluded from the analysis as they range from minor to major procedures and were not labelled within the dataset. Including surgeries would require careful filtering on a case-by-case basis, making them unsuitable for this analysis.
2.9. Proximity of Alerts to Event
A primary aim in the model development was a reduction in clinician alert fatigue through producing a higher precision model. To assess this, we measured the proportion of alerts by lead-time to deterioration. For each alert, we located the next deterioration event for that patient and computed the time difference in hours between the alert timestamp and the event timestamp. If multiple events occurred, we used the first one after the alert. In this respect we measured the proportion of alerts as a function of hours before deterioration event. Alerts not followed by any deterioration during the admission were treated as no-event alerts and excluded from this figure—these are counted as false positives.
4. Discussion
This study set out to develop an early warning system (EWS) capable of predicting patient deterioration with greater precision than the conventional NEWS2 score. Using a Cox proportional hazards model with time-varying covariates, we sought to capture the dynamic nature of patient physiology and provide calibrated alerts. In our live hospital deployment, NEWS2 demonstrated higher recall, identifying a larger proportion of deteriorating patients, but this came at the expense of markedly lower precision and a substantially higher alert burden (
Figure 4). In contrast, our model’s red alert threshold achieved a threefold improvement in precision over NEWS2 (16% vs. 60%), while amber alerts provided intermediate recall. This trade-off is clinically important: frequent false alarms contribute to increased workload, unnecessary disruption to patients, erosion of trust in EWS, and alert fatigue, a well-documented concern in clinical monitoring [
7]. By reducing unnecessary alerts while maintaining temporal proximity to true events, our model may support more timely and actionable clinical responses. Although the planned evaluation period concluded in January 2025, an informal extension of the analysis through September 2025 demonstrated stable model performance and maintained superiority in precision over NEWS2 (
Figure A1 and
Figure A2), suggesting robustness to evolving case-mix and operational conditions.
A key advantage of the Cox framework with time-varying covariates is its ability to model deterioration relative to an evolving patient baseline. Unlike NEWS2, which repeatedly triggers alerts for persistently abnormal but stable parameters (e.g., patients admitted with chronically high respiratory rate), our model updates hazard estimates only when meaningful changes occur. Indeed,
Figure 3 displays the power of our model—the model-derived risk score returned to a new baseline below red-alert tier after clinical stabilisation, suggesting our model more accurately tracks true deterioration than the static score of NEWS2. In our setting, 82% of red alerts and 74% of amber alerts occurred within 24 h of deterioration, compared to only 43% of NEWS2 alerts, many of which fired more than 84 h before the event (
Figure 4b and
Figure A2).
Direct comparisons with other published early warning models, such as eCART and the Advanced Alert Monitor (AAM), are constrained by differences in study design, patient populations, and endpoints. Nevertheless, these studies provide useful context. For example, eCARTv5 achieved an AUROC of 0.895, outperforming AAM (AUROC 0.82) in predicting deterioration [
9,
10]. However, AUROC is limited in imbalanced datasets like ours (3.3% event rate) because it is heavily influenced by the abundance of true negatives. Precision–recall (PR) curves provide a more informative assessment in this context [
20,
21]. At comparable levels of precision, our model achieved higher recall than both eCARTv5 and AAM in the high-precision operating range (e.g., 42% precision with 18% recall at the amber threshold). These results suggest our model performs particularly well in scenarios where reducing false alarms is a priority, even if this comes at the cost of lower recall at higher thresholds.
From a clinical perspective, the high precision of the red alert threshold suggests that our model may be particularly useful in confirming deterioration, helping clinicians focus on patients with the highest probability of imminent deterioration. However, the lower recall at this threshold indicates that some deteriorations will be missed if the red alert is used in isolation. Integrating both amber and red tiers offers a stratified approach: amber alerts can prompt increased monitoring, while red alerts provide high-confidence triggers for emergency intervention. Importantly, this positions our tool not as a replacement for NEWS2 but as a synergistic layer—NEWS2 provides broad recall, while our model’s tiered alerts sharpen precision, supporting escalation decisions and reducing alert fatigue.
Local feedback from the Critical Care Outreach Team and acute medical consultants at WWL supports this interpretation, highlighting the potential of combining NEWS2 and model-based alerts in daily practice. Clinicians reported that the model’s alerting behaviour aligned with their clinical perception of deterioration and that the interpretation was intuitive. In particular, they appreciated the model’s behaviour following deterioration and clinical stabilisation, in which predicted risk returned to a new baseline, which was perceived as distinct from NEWS2’s tendency to repeatedly flag persistently abnormal but stable physiology. This behaviour may support more meaningful clinical responses by distinguishing new deterioration from previously recognised risk, thereby reducing redundant alerts and reinforcing clinician confidence in escalation decisions.
Notably, when the analysis was informally extended to 30 September 2025, the model demonstrated consistent performance and outperformed NEWS2 across all clinically relevant emergency response thresholds (
Figure A1). Specifically, a model threshold of ≥20 surpassed the NEWS2 threshold of ≥7, and this advantage persisted at every NEWS2 score > 7. These findings suggest that, as the model was exposed to a broader and more recent cohort, its discrimination improved relative to NEWS2, strengthening confidence in its clinical utility. While NEWS2 may still serve as a universal early screening tool, our model’s stability and superior performance across all operating points indicate that it could ultimately supersede or refine the NEWS2 role in escalation pathways rather than merely complement it.
An additional strength of our approach lies in its potential to identify pre-symptomatic features of deterioration. By incorporating temporal features (e.g., trends, minima, maxima) and dynamically updating covariates, the model may detect subtle shifts in physiology before overt clinical thresholds are breached. This raises the possibility of intervention earlier in the deterioration trajectory, potentially improving patient outcomes. With the addition of richer variables such as continuous monitoring data, treatment records, or natural language processing of clinician notes, this capacity for pre-symptomatic detection could be further enhanced.
While promising, our study has limitations. Our deterioration definition was limited to ICU transfer and in-patient death; potential competing events such as cardiac arrest or defibrillation outside ICU/HDU, initiation of vasopressors/inotropes, etc., were excluded and should be explored in future work. The dataset was imbalanced, with only a small proportion of patients experiencing deterioration, although this reflects the clinical reality. Additionally, while the Cox model is interpretable and transparent compared to many machine learning approaches, assumptions such as proportional hazards require careful consideration and further testing.
The study was conducted in a single NHS Trust, and external validation is needed to assess generalisability across other hospitals and patient populations. Further, diagnoses and comorbidities were excluded as a pragmatic decision due to challenges in their timely and reliable extraction from data sources, which may influence model transportability across institutions. Model performance may be affected when applied to patient populations with markedly different diagnostic or comorbidity profiles from those of our training cohort. Conversely, reliance on routinely recorded physiological variables may enhance generalisability between hospitals by reducing dependence on local coding practices and retrospectively assigned diagnoses.
Multicentre evaluation across NHS Trusts with differing patient demographics, clinical workflows, and escalation practices is a necessary next step to formally assess transportability in diverse real-world settings. In parallel, successful clinical impact will depend on effective operationalisation within routine workflows. Novel early warning systems may face barriers to adoption, including alert fatigue, workflow disruption, and clinician trust. Early operational experience suggests positive clinician engagement with our novel EWS. Future operationalisation-focused work should extend this start by evaluating usability, acceptability, and behavioural impact when integrating the model alongside NEWS2 in daily practice.