1. Introduction
Clinical data are highly heterogeneous, primarily because real-world data (RWD) are observational and lack the structured format found in controlled environments, such as randomized controlled trials [
1]. In addition, the completeness of RWD depends on various clinical settings and individual patient cases [
2]. For instance, some patients may have extensive records owing to frequent monitoring, whereas others may have sparse data owing to fewer interactions with healthcare providers.
Two challenges arise from the heterogeneity of clinical data: missing data and informative presence [
3]. Missing data are common in clinical datasets, which not only add to heterogeneity but also affect the reliability and accuracy of predictive models [
4,
5,
6]. The rate of missingness in clinical data can vary widely owing to factors such as incomplete patient records, loss of follow-up, or errors in data collection. Understanding the mechanisms underlying this missingness is essential to appropriately address this issue. Missing data mechanisms are typically categorized into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR occurs when missingness is entirely random, MAR occurs when missingness depends on observed data, and MNAR occurs when missingness is related to unobserved data itself [
7,
8].
Another important concept related to missing data and heterogeneity that warrants attention is informative presence, which refers to the idea that the act of observing data points is meaningful; the presence of data can carry significant information about a patient’s condition. However, despite its potential importance, a review by Goldstein et al. that encompassed 107 clinical predictive modeling studies found that none formally investigated the influence of informative presence on predictive models [
3].
Given the widespread heterogeneity that can adversely affect the performance of AI predictive models [
9,
10,
11], especially deep learning-based models, it is crucial to understand how these factors impact model performance. Developing robust clinical decision support systems (CDSS) that can reliably predict patient outcomes requires a thorough examination of how data heterogeneity affects these models.
The VitalCare–Major Adverse Event Score (VC-MAES) is a proprietary AI CDSS model that employs a deep learning algorithm to predict clinical deterioration events such as unplanned ICU transfers, cardiac arrests, or deaths in patients in general wards up to 6 h before they occur. VC-MAES requires six essential parameters, including age and five vital signs, to calculate the risk score (MAES), which assesses the risk of clinical deterioration. In addition, if available, the model uses 13 additional parameters, including SpO2, Glasgow Coma Scale (GCS), and laboratory data, to enhance its predictive accuracy.
In this study, we evaluated the performance of the VC-MAES model with varying levels of clinical data availability. Specifically, the effectiveness of the model was assessed when informative data were deliberately omitted by inducing missingness. Additionally, we examined the model performance when missing data were imputed using either mean-based random values drawn from the corresponding patient group or multiple imputation by chained equations (MICE), simulating laboratory values that might have been observed if the physicians had chosen to order those tests. These results were compared to those obtained using the default normal-value imputation method.
2. Materials and Methods
2.1. Study Setting
This study used real-world data from a prospective observational external validation study conducted at Keimyung University Dongsan Hospital (KUDH) in Daegu, Republic of Korea. Data were collected from patients admitted to six general wards in the Internal Medicine (IM) and Obstetrics and Gynecology (OBGYN) departments between June 2023 and January 2024. Patient monitoring continued until discharge or 30 April 2024, whichever occurred first. The model generated predictions in real-time; however, these predictions were neither disclosed to healthcare providers nor incorporated into clinical decision-making processes.
The study adhered to the ethical guidelines of the Declaration of Helsinki, 1975, was approved by the Institutional Review Board (IRB) of KUDH (IRB No. 2022-12-081), which waived the requirement for informed consent, and was registered with the Clinical Research Information Service (CRIS), operated by the National Institute of Health under the Korea Disease Control and Prevention Agency (CRIS Registration Number: KCT0008466).
2.2. Study Population
The study population comprised adult patients (≥19 years) who had complete documentation of basic vital signs—systolic and diastolic blood pressure (SBP and DBP), heart rate (HR), respiratory rate (RR), and temperature. Patients who were transferred directly to the ICU from the emergency department (ED) or operating room were considered to have planned ICU transfers and were excluded from the study.
2.3. Data Collection
All relevant patient demographic and clinical information, such as vital signs, code status, surgery start and end times, medication orders, and laboratory results, were obtained from electronic health records (EHR). Additional details regarding ICU admission and discharge, time of death, and CPR initiation and termination were collected.
2.4. VC-MAES Model Architecture and Data Handling
VC-MAES is a proprietary predictive model that employs a deep learning algorithm to handle time-series clinical data. This binary classification model, built on a bidirectional long short-term memory (biLSTM) framework, was designed to predict the likelihood of major adverse events in general ward inpatients. It incorporates two types of input data: (1) dynamic features represented as time-series data—sampled hourly to capture vital signs and blood test results, and (2) static features. The dynamic features were processed using the biLSTM network, whereas the static features were managed via fully connected layers. The outputs from both the biLSTM and the fully connected layers were then merged and fed into additional classification layers for final predictions. A schematic diagram of the VC-MAES model architecture is presented in
Supplementary Figure S1. Further details of this classification model are provided by Sung et al. [
12].
The model was trained using a comprehensive dataset from the Yonsei Severance Hospital in Seoul, Republic of Korea, encompassing over 300,000 hospitalizations across more than 35 medical and surgical specialties from 2013 to 2017. The primary objective of VC-MAES is to predict clinical deterioration events within a six-hour window for patients in medical-surgical wards.
The VC-MAES generates risk scores on a scale of 0–100 based on six core inputs: five standard vital signs, including SBP, DBP, HR, RR, temperature, and patient age. Higher scores indicate an increased likelihood of adverse events occurring within the subsequent 6 h. When 13 additional parameters are available (oxygen saturation, GCS, total bilirubin, lactate, creatinine, platelets, pH, sodium, potassium, hematocrit, white blood cell count, bicarbonate, and C-reactive protein), the model incorporates these parameters to calculate a more comprehensive MAES.
To address missing data, the system employs the last observation carried forward (LOCF) method [
13,
14], whereby missing values are replaced with the most recently observed values. When no prior data are available, the system defaults to a normal-value imputation method that incorporates standard reference values (
Supplementary Table S1).
Owing to institutional practices restricting mental status evaluations to ICU settings, all GCS values were standardized to 15 for the MAES calculations. Additionally, when determining the National Early Warning Score (NEWS) and the Modified Early Warning Score (MEWS), the level of consciousness was consistently recorded as “Alert”, corresponding to an Alert, Verbal, Pain, Unresponsive (AVPU) score of 0.
2.5. Statistical Analysis and Model Performance Evaluation
We compared the characteristics of the patients in the control and event groups using chi-square tests for categorical variables and t-tests or Wilcoxon rank-sum tests for continuous variables, as appropriate.
Model performance was assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision–recall curve (AUPRC) across three distinct scenarios.
In the first scenario, to eliminate the informative presence in the data, we intentionally induced missingness by omitting actual data points, except for the five essential vital signs and age information. We then calculated the AUROC using only vital signs and age and compared the outcomes with those from NEWS and MEWS.
In the second scenario, we reintroduced the available SpO2 and laboratory data to assess how the actual clinical data points influenced the performance of the model.
In the third scenario, to simulate the availability of all 11 laboratory test results, missing values at admission were imputed by randomly drawing values within one standard deviation (SD) of the mean observed values. This approach served as our primary comparison method and was applied separately to the event and non-event groups to more accurately reflect each group’s underlying distribution. To further validate this primary result, we conducted a secondary analysis comparing normal-value imputation with MICE, an advanced method shown in a previous study to outperform mean-based imputation [
15].
Furthermore, to investigate whether missing data patterns alone could provide an “informative presence” that enhanced model performance, we developed two experimental models with the same VC-MAES architecture but different input features. One model relied solely on five vital signs (SBP, DBP, HR, RR, and temperature) and age, while the other model incorporated those same vital signs plus a binary lab ordering indicator, denoting whether laboratory tests were ordered, without using the actual laboratory values.
We evaluated and compared the AUROC for each scenario to determine the impact of varying levels of missing data and informative presence on the predictive performance of the model. The DeLong test was used to compare the AUROCs of the predictive models.
3. Results
3.1. Baseline Characteristics
A total of 6478 initial patient encounters, representing 4846 patients admitted to general wards, were screened. After applying the exclusion criteria, 439 cases were excluded: 423 owing to direct ICU admissions from either the ED or operating room and 16 owing to incomplete scoring data. The final analysis included 6039 cases involving 4447 individuals.
Adverse events (AEs) occurred in 217 patients, representing 3.6% of all cases. These AEs included 102 unplanned ICU transfers, 13 instances of cardiac arrest, and 102 mortalities.
The demographic profile revealed a mean age of 53.10 years, with notable differences between groups. Patients experiencing AEs were significantly older (mean age, 74.04 years), compared to those without complications (mean age, 52.32 years) (p < 0.001).
The study population demonstrated a marked sex imbalance, with 78.90% of the patients being female. This skew is attributable to the departmental distribution, as the OBGYN unit accounted for approximately 64% of the study cases, resulting in a predominantly female sample.
Baseline demographic characteristics and vital signs of the event and control groups are summarized in
Table 1.
3.2. Missing Laboratory Test Frequencies
Table 2 illustrates the differences in the frequency of missing laboratory test results between the non-event and event groups. The non-event group (n = 5822) generally exhibited a higher proportion of missing values across several laboratory measures. For instance, over 80% of the lactate and pH data were missing in the non-event group, compared to only 7.37% and 6.91%, respectively, in the event group. In contrast, the event group (n = 217) had substantially fewer missing values for multiple laboratory parameters, which often approached zero. For the laboratory contribution to the MAES, hematocrit was identified as the most important laboratory feature associated with events, followed by HCO
3−, platelet count, and lactate. A detailed illustration of these feature-importance results is provided in
Supplementary Figure S2.
3.3. Vital-Signs-Only Performance with Induced Missingness
When VC-MAES utilized only vital signs while treating other data as missing and imputing normal values using the model’s default method, it achieved an AUROC of 0.896 (95% CI: 0.887–0.904) and an AUPRC of 0.336 (95% CI: 0.312–0.361). In comparison, the NEWS and MEWS scores had AUROCs of 0.797 (95% CI: 0.785–0.809) and 0.722 (95% CI: 0.708–0.737), respectively, with corresponding AUPRCs of 0.125 (95% CI: 0.110–0.140) and 0.079 (95% CI: 0.069–0.090) for MEWS. These results indicate that VC-MAES outperformed both NEWS and MEWS in terms of both AUROC and AUPRC (
Figure 1).
Upon reintroducing the full set of originally collected clinical data, previously treated as missing, the VC-MAES performance improved significantly compared with the vital-signs-only scenario. The AUROC increased from 0.896 to 0.918 (95% CI: 0.912–0.924; DeLong test p < 0.001), and the AUPRC increased from 0.336 (95% CI: 0.312–0.361) to 0.352 (95% CI: 0.329–0.373), further enhancing the model’s predictive capabilities.
3.4. Performance with Forced Mean-Based Imputation and MICE Imputation
When missing values were imputed using values randomly drawn from within one SD of the mean observed values for the corresponding non-event and event groups, the AUROC decreased to 0.883 (95% CI: 0.875–0.891) compared with 0.918 (95% CI: 0.912–0.924) under the model’s default normal-value imputation method (DeLong test, p < 0.001).
A similar trend was observed in the AUPRC, which declined from 0.352 (95% CI: 0.329–0.373) under the default method to 0.307 (95% CI: 0.285–0.329) under forced mean-based imputation (DeLong test, p < 0.001).
Furthermore, at a fixed sensitivity of 80%, the specificity under the default imputation method was 85.5%, which decreased to 78.3% under forced mean-based imputation. These results indicate that mean-based random value replacement negatively affects the discriminative ability of a model.
When analyzing the model’s prediction scores, MAES, the average score of the event group was 52.6 with the normal-value imputation, which remained similar at 52.8 with forced mean-based imputation. However, in the non-event group, the average score increased to 10.6 with forced mean-based imputation from 7.7 with the normal-value imputation. The distribution of the average scores in each group for each imputation method is shown in
Supplementary Figure S3.
In the secondary analysis using the MICE imputation method, the AUROC declined further to 0.827 (95% CI: 0.824–0.830), compared with both the model’s default normal-value imputation and the forced mean-based imputation method.
Supplementary Figure S4 illustrates the AUROC achieved by each method.
The baseline laboratory values are presented in
Table 3, comparing the raw data without imputation and the forced mean-based imputation approach for both non-event and event groups. The performance of VC-MAES under these two methods is illustrated in
Figure 2.
3.5. Performance with a Model Incorporating a Lab-Ordering Pattern
When comparing a model that relied solely on five vital signs and age with a second model that also included a binary lab-ordering indicator to denote whether laboratory tests were ordered without using the actual lab values, the latter model demonstrated superior performance. Specifically, the model incorporating the lab-ordering pattern achieved an AUROC of 0.849 (95% CI: 0.846–0.851), compared with 0.831 (95% CI: 0.828–0.834) for the vital-signs-only model (
Supplementary Figure S5).
4. Discussion
In this study, we evaluated the performance of the VC-MAES model under varying levels of clinical data availability by systematically adjusting the extent of missing data. Even when limited to vital signs and age, the model demonstrated robust predictive performance, outperforming established early warning scores such as NEWS and MEWS. Incorporating additional real-world clinical data, including SpO2 and laboratory values, further enhanced predictive accuracy. In contrast, forcing the imputation of missing values through mean-based estimates from both event and non-event groups or MICE reduced performance relative to the model’s default normal-value imputation. This decline underscores that ignoring the underlying reasons why tests were not ordered can obscure clinically significant signals about patient status, suggesting the value of preserving informative presence and emphasizing the importance of carefully handling missing data patterns.
In real-world clinical practice, data availability varies owing to factors such as loss of follow-up, documentation errors, practice patterns, resource constraints, and patient preferences. When data are absent, they are typically considered “missing”. However, this term can be misleading because it suggests that data are required but are simply unavailable. Certain tests may not have been indicated or were deemed unnecessary, thereby illustrating the concept of informative observations. The presence of data provides meaningful information about the patient’s condition and the clinical decision-making process [
3].
Our study exemplified this finding. For example, we observed substantial differences in the frequency of missing laboratory tests between the non-event and event groups. Patients who experienced adverse events had far fewer missing laboratory values (e.g., lactate and pH), likely because clinicians suspected potential deterioration and ordered more comprehensive testing. In contrast, patients in the non-event group often lacked these tests, reflecting the clinical judgment that they were stable and did not require further investigation. Such differences support the notion that the absence of data is not merely a void but an informative pattern tied closely to clinical decision-making. This interpretation is further supported by our experiment comparing two models: one using only vital signs and another incorporating both vital signs and a lab-ordering pattern. The model that included the lab-ordering pattern outperformed the vital-signs-only model, despite not using actual laboratory values. This finding underscores the significance of “patterns” in missing or present data for predictive accuracy.
We specifically examined the model’s performance under three scenarios: (1) using only the minimum required data (five vital signs and age), (2) reintroducing originally collected data addressed through normal value imputation, and (3) imputing missing data either by drawing random values from within one SD of the mean observed values of the event and non-event groups or by using MICE to approximate what values might have been if clinicians had chosen to order those tests.
The VC-MAES model demonstrated a strong predictive capability using only vital signs and age, achieving an AUROC of 0.896 and outperforming both NEWS (0.797) and MEWS (0.722). This suggests that even a limited set of essential features can yield effective predictions, possibly because the model’s deep learning architecture captures complex patterns that conventional scoring systems may miss [
16,
17].
Upon reintroducing the previously withheld clinical data, the performance of the model improved significantly, with the AUROC increasing from 0.896 to 0.918. This substantial enhancement underscores the importance of obtaining comprehensive clinical information. While basic vital signs provide a strong predictive foundation, incorporating additional real-world parameters can improve the ability of the model to identify at-risk patients, which is consistent with previous studies [
18,
19].
Conversely, forcibly imputing missing data with mean-based values reduced the model’s AUROC from 0.918 to 0.883, indicating that this approach failed to account for the valuable clinical context inherent in missingness patterns [
20,
21,
22]. The observed decline in performance was largely driven by an increase in false-positive rates within the non-event group, which had a significantly higher proportion of missing laboratory data. As a result, these patients were more susceptible to forced laboratory imputation, leading to substantial changes in average MAES scores and their distribution, as noted in
Section 3. In the secondary analysis, MICE imputation yielded an AUROC of 0.827, which was lower than the model’s default normal-value imputation. This finding aligns with our primary results using mean-based imputation and further underscores the importance of preserving clinically meaningful data patterns.
Normal-value imputation is designed to mirror a common clinical assumption: if a test was not ordered, it may be because nothing appeared abnormal enough to warrant it, implying that the patient might be “normal” for that parameter. By providing plausible baseline measurements rather than arbitrary replacements, normal-value imputation aligns more closely with clinicians’ decision-making processes [
23]. Although no imputation method should serve as the sole basis for clinical decision-making, this clinically informed approach may help preserve the meaningful structure derived from clinical judgment.
Clinicians order or forgo tests based on their professional assessments. The absence of certain measurements often indicates that no additional testing was deemed necessary, suggesting patient stability. In contrast, frequent or specialized testing may signal mounting concern about potential deterioration. This clinical assessment remains a cornerstone of medical practice, delivering substantial value in patient care and diagnosis by ensuring that care is both efficient and effective. Ignoring this and treating missing data merely as random gaps, rather than as meaningful indicators grounded in clinical decision-making, obscures these nuances and diminishes a model’s capacity to accurately reflect patient status [
24]. Additionally, from a practical standpoint, our findings suggest that models such as VC-MAES can be implemented effectively without exhaustive data collection or overly aggressive imputation strategies.
This study has the following limitations. First, it was conducted at a single center in South Korea, which may limit the generalizability of the findings to institutions with different patient populations or clinical practices. However, the novelty of this work lies primarily in investigating informative presence in a real-world dataset and comparing various imputation strategies. Additionally, multicenter external validations are currently underway, and the findings from this study will be assessed across broader datasets to strengthen and confirm its applicability in diverse clinical settings.
Second, this study did not assess the impact of varying proportions of missing data or informative presence on model robustness. In clinical environments, the amount and distribution of data can vary significantly, and understanding how these factors affect the model performance is crucial for ensuring generalizability and reliability.
Future studies should explore the generalizability of these findings across different clinical settings and populations. Additionally, developing advanced imputation methods that acknowledge the nonrandom nature of missing data and preserve informative patterns could further improve predictive performance.