In-Advance Prediction of Pressure Ulcers via Deep-Learning-Based Robust Missing Value Imputation on Real-Time Intensive Care Variables

Pressure ulcers (PUs) are a prevalent skin disease affecting patients with impaired mobility and in high-risk groups. These ulcers increase patients’ suffering, medical expenses, and burden on medical staff. This study introduces a clinical decision support system and verifies it for predicting real-time PU occurrences within the intensive care unit (ICU) by using MIMIC-IV and in-house ICU data. We develop various machine learning (ML) and deep learning (DL) models for predicting PU occurrences in real time using the MIMIC-IV and validate using the MIMIC-IV and Kangwon National University Hospital (KNUH) dataset. To address the challenge of missing values in time series, we propose a novel recurrent neural network model, GRU-D++. This model outperformed other experimental models by achieving the area under the receiver operating characteristic curve (AUROC) of 0.945 for the on-time prediction and AUROC of 0.912 for 48h in-advance prediction. Furthermore, in the external validation with the KNUH dataset, the fine-tuned GRU-D++ model demonstrated superior performances, achieving an AUROC of 0.898 for on-time prediction and an AUROC of 0.897 for 48h in-advance prediction. The proposed GRU-D++, designed to consider temporal information and missing values, stands out for its predictive accuracy. Our findings suggest that this model can significantly alleviate the workload of medical staff and prevent the worsening of patient conditions by enabling timely interventions for PUs in the ICU.


Introduction
Pressure ulcers (PUs) are prevalent skin injuries in patients who remain immobile or cannot change positions for extended periods.PUs arise due to prolonged pressure on bony prominences, such as the back of the head, shoulders, elbows, and heels, or from blood circulation disorders [1][2][3].In the early stages of PUs, simple interventions like wound dressings or patient repositioning are required.However, in advanced cases, the patient's condition may deteriorate significantly, and surgical procedures are required.Such interventions increase medical expenses [4,5] and the workload of the intensive care unit (ICU) staff [6].
In the ICU, various risk assessment tools, including the Braden scale, Gosnell scale, Norton scale, and Waterlow score, are used to gauge PU risks [7][8][9][10][11][12].Recently, machine learning (ML)-based PU prediction systems have been developed [13][14][15][16][17][18] to effectively predict PU occurrences.These systems employ logistic regression, random forest, boosting machines, or multi-layer perceptron.However, they suffer two major problems: (1) they cannot handle time series data even though real-world data are time series and (2) they cannot appropriately handle missing values even though real-world data contain a massive number of missing values.
To overcome these problems, we develop recurrent neural network (RNN)-based PU prediction systems.RNN is a deep learning (DL) method for time series data.We employ five different RNNs.Simple RNN, gated recurrent unit (GRU) [19], and long-short-term memory (LSTM) [20] are representative RNN models.However, they are not suitable for missing-value handling.Therefore, we also employ GRU with decay (GRU-D) [21], which is specialized to impute missing values.Furthermore, we propose an enhanced version of GRU-D named GRU-D++, suitable for time series with a high missing-value rate.
In the empirical experiments, Simple RNN, GRU, and LSTM outperform traditional ML systems, which indicates the importance of time series information for PU prediction.GRU-D and GRU-D++ outperform RNN systems, demonstrating that their missing-value imputation mechanisms are very effective.In addition, we conduct additional experiments of GRU-D++ with different numbers of input variables, data rescaling, and model finetuning.These experiments can provide meaningful insights to researchers who want to employ GRU-D++ in other medical centers.For the reproducibility of our experiments, we have publicly opened the source code of GRU-D++ at https://github.com/gim4855744/GRU-Dpp accessed on 6 November 2023.

Study Population
We used the Medical Information Mart for Intensive Care IV (MIMIC-IV) [22], a large public database containing de-identified patient information admitted to the Beth Israel Deaconess Medical Center (BIDMC), as the internal dataset.Note that, although the MIMIC-IV is a public database, the use of full MIMIC-IV requires approval from PhysioNet.Only ICU patients were included in this study.We classified patients who had records of PUs or who had been assigned PU grades in the nursing records as the PU group.The remaining patients were classified as the non-PU group.Patients with PUs before ICU admissions were excluded from this study.
The external validation dataset was collected from the Kangwon National University Hospital (KNUH) in the Republic of Korea.We used adult patients (age ≥ 18) who were admitted to the ICU between January 2016 and August 2022 and had at least 48 h of records after admission.The dataset has been approved by the Institutional Review Board of KNUH (IRB, KNUH-2022-09-013-00).

Data Collection
We used 48 variables.Our input variables included patient demographics, vital signs, laboratory findings, medication and treatment information, underlying diseases, the Braden scales, and the sedation scale.In the MIMIC-IV, some patients have the Richmond-Agitation Sedation Scale (RASS), and the others have the Riker Sedation-Agitation Scale (SAS).Therefore, we converted RASS to SAS. Figure 1 shows the list of our input variables.In this study, we use real-world datasets collected hourly.Therefore, some of the variables have large amounts of missing values; for instance, the laboratory events pH and lactate are missing 72% and 80%, respectively, whereas vital signs such SBP, DBP, and MBP are 0.6% missing.

Data Preprocessing
We performed the following preprocessing steps to prepare patients' time series:

•
We sampled patient data hourly from ICU admission to discharge; • If multiple sampling data exist in an hour, we selected the average values for continuous variables, the most negative values for categorical variables, and binarized medication information;

•
We applied the interquartile range method to discard outliers.
Figure 2 displays a flow chart of the preprocessing steps.
The input variables of a dataset have distinct scales, which is problematic for effectively training ML models.Thus, we performed min-max scaling with the range [−1, 1].We performed forward filling and then mean filling to impute missing values.This imputation was not performed when using GRU-D and GRU-D++ because they automatically impute missing values internally.

Prediction Models
In this study, we compare nine ML and DL models.Logistic regression (LR), decision tree (DT), random forest (RF), and extreme gradient boosting machine (XGBoost) are representative machine learning models that have been widely used in previous studies.However, they can only take 1 h of data as input and cannot capture time-varying information.Therefore, they are not suitable to handle real-time data.
RNN, GRU, and LSTM are deep learning models for time series data.Since they take information from the previous time step as input again, they can capture time-varying information.Due to this advantage, they are widely used in real-time applications.However, the missing value problem is another challenge in the medical domain.Real-world electronic health records (EHR) contain many missing values, and missing values must be imputed appropriately before being input into a machine learning model.
GRU-D automatically imputes missing values internally and has demonstrated good performance in EHR datasets compared to other ML and DL models with traditional imputation methods such as mean fill.This result indicates that the imputation mechanism of GRU-D is effective.However, we found that GRU-D still has a problem in high missing-value rate datasets.GRU-D requires the last observation of a variable to generate an imputation value.However, the last observation is unavailable when the first value of a variable is missing.
In recent years, various missing value imputation methods for multivariate time series have been proposed [23][24][25][26][27]. Notably, many of these methods adopt a two-phase learning process [23][24][25][26][27], wherein the entire dataset is first imputed using an imputation model, followed by the training of a classification model.However, this two-phase approach is timeconsuming for both training and inference.Some recent methods utilize generative adversarial networks (GANs) [24-26], but it is well known that training GANs is notoriously difficult.
To overcome this problem, we developed a novel deep learning model named GRU-D++ in this study.Our GRU-D++ is an end-to-end (training and inference of imputation and classification are simultaneously performed) and RNN-based model, practical in real-world scenarios.In addition, we focused on making GRU-D++ able to handle high missing-rate datasets effectively.To the best of our knowledge, GRU-D++ is the first model that explicitly considers high missing-rate data.Details of GRU-D++ can be found in Appendix A. Figure 3 depicts the overview of our proposed real-time in-advance PU prediction system.Here, x t represents the input variables updated every hour, and x t represents imputed complete variables.Furthermore, y t is the output of whether PU occurs or not, which was used for training and evaluating our model.The definition of y t can be changed according to the in-advance prediction setting.For example, in 24 h of in-advance prediction, if PU occurs 60 h after admission, the first "yes" should be y 36 .

Baseline Characteristics
The internal cohort, MIMIC-IV, consists of 67,175 patients, whereas the external cohort, KNUH, consists of 6876 patients.In the internal cohort, we used 53,740 patients for training and 13,435 for validation.Table 1 displays the baseline characteristics of both the internal and external cohorts.

Predictive Performances
We used the area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) to evaluate the experimental models.AUROC is a common evaluation metric for binary classification.AUPRC is a similar metric to AUROC, but it considers class imbalance.We not only evaluated the performances at the PU occurrence time but also assessed early prediction performances (i.e., 12, 24, and 48 h in-advance predictions).
Table 2 presents the performances of the experimental models on the internal validation set.The results indicated that DT performed poorly with an AUROC of 0.569 at PU occurrence time, whereas the other ML models, LR, RF, and XGBoost exhibited higher performances with AUROCs of 0.818, 0.818, and 0.814 at the PU occurrence time.All RNN models exhibited higher performances than the ML models.In particular, GRU exhibited an AUROC of 0.918 at the PU occurrence time.GRU-D outperformed other RNNs.Furthermore, GRU-D++ exhibited the best performances in all experimental settings, indicating the superiority of its imputation mechanism.
We evaluated the performance of the Braden scale, which is widely used to measure PU risks in the ICU.The AUROC of the Braden score showed lower results at all times compared to the AI model (Table 3).LR: logistic regression, DT: decision tree, RF: random forest, XGBoost: extreme gradient boosting machine, RNN: recurrent neural network, GRU: grated recurrent unit, LSTM: long short-term memory, GRU-D: gated recurrent unit with a decay.We used 48 variables in the experiment, but these variables may not be available in other medical centers.Since the utilizability of the prediction system is essential, we trained and evaluated GRU-D++ again using the top ten variables, which have the highest SHAP values.The top ten variables are displayed in Figure 4.In Table 3, GRU-D++ 10 indicates GRU-D++ with the top ten variables.GRU-D++ 10 outperformed other RNN models trained with 48 variables, even though it only uses ten variables.Additionally, we conducted the GRU-D++ with 42 variables out of 48 variables, excluding the six variables corresponding to the Braden scales, and the results were presented on GRU-D++ (w/o Braden) in Table 3. Table 4 shows the performances of GRU-D++ on the external validation set.GRU-D++ achieved a promising performance (an AUROC of 0.807 at the PU occurrence time).To improve the performance of GRU-D++ on the external validation set, we rescale the external set to have a similar distribution to the internal set.Consequently, GRU-D++ (rescale) exhibited a significantly higher performance than GRU-D++ (AUROC 0.895 vs. 0.807 at the PU occurrence time).In addition, we evaluated the performance of GUR-D++ after fine-tuning.Specifically, we randomly split the external set into training (10%), model selection (10%), and validation (80%) sets.As a result, GRU-D++ (fine-tune) exhibited the highest performances on the external set.A comparison of all experimental models on the external set is presented in Appendix A.

Analysis of Predictive Performances
We conducted extensive experiments using various ML and DL models to predict PU occurrences.DT exhibited poor predictive performance because the model is susceptible to overfitting, that is, it cannot accurately predict unseen data.By contrast, other baseline models, LR, RF, and XGBoost, exhibited good performances because of their strong regularization abilities.This is a well-known phenomenon, and previous studies have demonstrated that XGBoost has good predictive power.
The RNN models, Simple RNN, GRU, and LSTM, demonstrated considerably higher performances than the ML models, highlighting the importance of considering time-varying information.Many researchers still widely use boosting machine for their studies, but we showed that RNN models are very helpful for time series data.In addition, GRU and LSTM significantly outperformed Simple RNN, which is unsuitable for capturing long-term temporal information.This result suggests that patients' long-term information is important in predicting PU occurrences.GRU outperformed LSTM, which is a wellknown phenomenon.LSTM requires more data samples than GRU and is susceptible to the overfitting problem.
GRU-D and GRU-D++ significantly outperformed the other ML and RNN models, demonstrating that conventional imputation techniques are insufficient to handle missing values and that the trainable imputation methods of GRU-D and GRU-D++ are effective.In addition, GRU-D++ outperformed GRU-D, which indicates that GRU-D++ is more effective in handling high missing-value rate datasets than GRU-D.We also evaluated GRU-D++ with fewer variables (i.e., ten of the highest SHAP scores) and found that it outperformed the other RNNs with 48 variables.This is an important finding because other medical centers may be unable to collect all 48 variables.
We evaluated GRU-D++ on the external validation set and determined that it exhibited good predictive performances (AUROC >0.8).An issue with directly applying GRU-D++ trained with the internal validation set to the external validation set is that the imputation mechanism of GRU-D++ was trained to predict the internal validation set accurately.However, the distributions of the internal and external validation sets may differ.Thus, the imputation mechanism may fail to generate accurate imputation values on the external validation set, which can result in a deterioration in the predictive performance of GRU-D++.To address this problem, we rescaled the external validation set to have a distribution similar to the internal validation set as much as possible.GRU-D++ with rescaling exhibited a considerably improved performance over basic GRU-D++, demonstrating that GRU-D++ can achieve predictive performances close to those of the internal validation set on any other external sets (e.g., datasets from other medical centers) via rescaling.Furthermore, we retrained (fine-tuned) GRU-D++ with a small part of the external validation set.It demonstrated superior predictive performance with a few data samples and is effective for other medical centers.

Clinical Findings
Similar to previous studies, we developed PU prediction systems using various ML and DL models, classifying data into PU and non-PU groups.However, although previous studies [3,8] have predicted PU occurrences using average values of each variable as input, we used time series data to predict PU occurrences.This study overcomes the performance degradation problem of existing studies using time series information and a novel missing value imputation mechanism.Our system is very effective in real-time in-advance prediction.
The MIMIC-IV dataset used for model development has been widely used in various studies because of its diverse variables.However, the dataset has a disproportionate representation, with Whites and Blacks accounting for approximately 78.56% and Asians making up only approximately 2.94% of the total dataset.Due to this racial imbalance, applying the results of studies only performed with the MIMIC-IV to Asians is inappropriate.One advantage of this study is presenting a prediction system suitable for diverse ethnicities by validating the system developed using the MIMIC-IV with Asian data (KNUH).
An analysis of the SHAP values revealed that, as expected, the Braden scores were among the top ten essential variables for predicting PU occurrences.Interestingly, PU prediction using only 42 variables without the Braden score demonstrated only 1.2-1.4% degradation in AUROC to that with 48 variables (Table 3).This finding indicates that the 42 variables contain valuable information about PU occurrences.Furthermore, using GRU-D++ allows for accurate predictions without the Braden scales, given that these scales inherently have been measured based on the patient's condition related to the 42 variables.Clinically, this could potentially reduce the workload of nurses who measure and record the Braden score.We anticipate that this model could be a valuable tool for future clinical use.

Conclusions
PU adversely affects patient health and increases the workload of medical staff in the ICU.Therefore, early prediction of PU is crucial.This study compared various ML and DL models to develop accurate PU prediction systems.As a result, we developed the GRU-D++ model, which covers a high missing-value rate.This model demonstrated better performance than the Braden scale, and it even showed high performance when implemented without the Braden scale.It is expected to contribute significantly to reducing the workload of medical staff in the future.
GRU-D++ can be helpful to other researchers aiming to predict PU occurrences accurately.Furthermore, in future work, we plan to develop a system that integrates the precise prediction of PU occurrence region and grade predictions for PU.

Figure 1 .
Figure 1.List of input variables.

Figure 2 .
Figure 2. Flow chart of the preprocessing steps.

Figure 3 .
Figure 3. Overview of the proposed in-advance PU prediction system.The blue cell indicates the patient's admission time, and the red cell indicates the PU occurrence time.Here, x t represents the input variables updated every hour, and x t represents imputed complete variables.Furthermore, y t is the output of whether PU occurs or not, which was used for training and evaluating our model.The definition of y t can be changed according to the in-advance prediction setting.For example, in 24 h of in-advance prediction, if PU occurs 60 h after admission, the first "yes" should be y 36 .

Figure 4 .
Figure 4. SHAP values of top ten influential variables.

Table 2 .
Performance comparison of various models on the internal validation set.

Table 3 .
Performance comparison of different input variables on the internal validation set.GRU-D++ 10 : GRU-D++ trained with top ten important variables, GRU-D++ (w/o braden): GRU-D++ trained without the Braden scales.

Table 4 .
Performances of GRU-D++ on the external validation set.

Table A1 .
Performance comparison on the external validation set.

Table A2 .
Performance comparison on the external validation set with rescaling.

Table A3 .
Performance comparison on the external validation set with finetuning.

Table A4 .
Performance comparison on the external validation set without Braden scales.

Table A5 .
Performance comparison on the external validation set without Braden scales after rescaling.