D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction

Chen, Binyue; Liu, Guohua

doi:10.3390/app15116054

Open AccessArticle

D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction

by

Binyue Chen

^1,2,3

and

Guohua Liu

^1,2,3,4,*

¹

College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300350, China

²

Tianjin Key Laboratory of Optoelectronic Sensor and Sensing Network Technology, Tianjin 300350, China

³

General Terminal IC Interdisciplinary Science Center, Nankai University, Tianjin 300350, China

⁴

Engineering Research Center of Thin Film Optoelectronics Technology, Ministry of Education, Nankai University, Tianjin 300350, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6054; https://doi.org/10.3390/app15116054

Submission received: 19 April 2025 / Revised: 20 May 2025 / Accepted: 22 May 2025 / Published: 28 May 2025

Download

Browse Figures

Versions Notes

Abstract

With the advancement of information technology, artificial intelligence (AI) has demonstrated significant potential in clinical prediction, helping to improve the level of intelligent medical care. Current clinical practice primarily relies on patients’ time series data and clinical notes to predict health status and makes predictions by simply concatenating cross-modal features. However, they not only ignore the inherent correlation between cross-modal features, but also fail to analyze the collaborative representation of multi-granularity features from diverse perspectives. To address these challenges, we propose a deep dynamic memory-driven cross-modal feature representation network for clinical outcome prediction. Specifically, we use a Bi-directional Gated Recurrent Unit (BiGRU) network to capture dynamic features in time series data and a dual-view feature encoding model with sentence-aware and entity-aware capabilities to extract clinical text features from global semantic and local concept perspectives, respectively. Furthermore, we introduce a memory-driven cross-modal attention mechanism, which dynamically establishes deep correlations between clinical text and time series features through learnable memory matrices. In addition, we also introduce a memory-aware constrained layer normalization to alleviate the challenges of multi-modal feature heterogeneity. Besides, we use gating mechanisms and dynamic memory components to enable the model to learn feature information of different historical-current patterns, further improving the model’s performance. Lastly, we combine the integrated gradients for feature attribution analysis to enhance the model’s interpretability. Finally, we evaluate the model’s performance on the MIMIC-III dataset, and the experimental results demonstrate that the model outperforms current advanced baselines in clinical outcome prediction tasks. Notably, our model maintains high predictive accuracy and robustness even when faced with imbalanced data. It can also provide a new perspective for researchers in the field of AI medicine.

Keywords:

artificial intelligence; electronic health records; clinical outcome prediction; cross-modal attention

1. Introduction

In recent years, deep learning-based medical assistance has attracted significant attention due to its powerful capabilities in data fusion and feature extraction [1,2,3,4,5]. Within the expansive domain of healthcare big data, there has been an accumulation of substantial volumes of multi-modal medical data, including medical images, electronic health records (EHRs), and so on. In-depth analysis of these data helps to uncover their core value and can provide auxiliary support for medical decision-making. EHRs, serving as the primary repository of patients’ health conditions and treatment processes, mainly consist of time series data generated by vital signs monitoring devices and clinical text notes written by clinicians, playing a crucial role in clinical decision support systems [6,7,8,9,10,11,12]. Recently, AI-assisted clinical prediction based on EHRs has achieved remarkable progress [13,14,15,16,17,18,19]. The approach of constructing predictive models by integrating patients’ multi-source medical features is conducive to continuously monitoring patients’ health trends and providing decision support for clinicians [20,21,22,23,24,25,26]. Previous approaches mainly focused on single-modality modeling, utilizing temporal data or clinical notes for clinical prediction. Some studies have used time series data to predict ICU mortality or length of stay [27,28,29,30]. However, these methods fail to exploit the complementarity of multi-modal data, resulting in limited prediction performance. Advances in multi-modal learning have led to new paradigms in clinical prediction. Studies show that by learning the joint representation of patients’ temporal and clinical text features, the accuracy of clinical predictions can be significantly enhanced, thereby providing stronger support for clinical decision-making [31,32,33].

Although existing research has made significant progress, there are still some limitations. Firstly, the unstructured nature of clinical notes, combined with the spatiotemporal heterogeneity of diverse data modalities, increases the complexity of data processing and limits the model’s ability to capture cross-modal feature correlations. Second, the existing methods mostly rely on simplistic feature concatenation strategies, thus failing to account for the latent semantic associations among multi-modal features. Third, clinical diagnosis requires a comprehensive analysis of the contextual relationships between symptom descriptions. However, current research mostly focuses on the word-level analysis, lacking modeling of semantic associations between sentences. Fourth, EHRs data display significant information redundancy, and directly utilizing all clinical knowledge as input may introduce noise interference. Lastly, existing methods usually adopt static data for modeling, fail to consider the dynamic evolution trajectory of patients’ health status, and ignore the impact of historical status.

To mitigate the aforementioned problems, we propose D4Care, a novel clinical outcome prediction model. First, we use a BiGRU network to capture time series features. Second, we adopt a dual-view text feature extraction strategy to learn the global semantic feature associations of clinical notes through a sentence-aware attention model and obtain fine-grained structured features by extracting medical entities in clinical notes. This approach not only maintains the semantic integrity of the text but also enhances the model’s capacity to identify critical information, thus improving model’s performance. Furthermore, we design a memory-driven cross-modal attention mechanism to dynamically learn the deep associations between the features of different modalities. At the same time, we introduce a memory-aware constrained layer normalization to handle heterogeneity of data and model the dependencies of historical–current patterns. In addition, we use gating mechanisms and dynamic memory components to learn and update critical features. Finally, with the collaboration of all models, D4Care effectively extracts the cross-modal features of patients to achieve accurate prediction. The overall framework of the proposed model is shown in Figure 1.

2. Methods

In this section, we will introduce the proposed model in detail. Specifically, D4Care consists of a time series model (TS), a dual-view feature encoding model (DVFE), and a memory-driven cross-modal attention model (MDCA). The TS model mainly uses the BiGRU network to encode time series features and extract patient’s potential health information; the DVFE model utilizes dual-view strategy with sentence-aware and entity-aware capabilities to extract clinical text features from global semantic and local concept perspectives, respectively; the MDCA model integrates a dynamic memory model (DM) and a memory-aware constrained layer normalization (MacLN), so that the model can not only dynamically adjust the normalization parameters according to the memory state to adapt to different input features, but also consider historical pattern information, thereby enhancing the joint expression of cross-modal features.

2.1. Time Series Model

In this paper, we use a dataset containing N patient samples, denoted as

{[(X^{(i)}, C^{(i)}, y^{(i)})]}_{i = 1}^{N}

, where

X^{(i)}, C^{(i)}

represent the time series data and clinical text information of the i-th patient, respectively, and

y^{(i)}

represents the ground truth. Furthermore, the time series features in the total time step T are denoted as

X^{(i)} = [X_{1}, X_{2}, \dots, X_{T}]

, where

X_{t}^{i} = [x_{t, 1}, x_{t, 2}, \dots, x_{t, k}]

and the clinical notes are denoted as

C_{t}^{i} = [c_{1}, c_{2}, \dots, c_{N_{c}}]

, where k is the number of monitoring features, and

N_{c}

represents the number of clinical notes. This paper uses the BiGRU [34] network as a time series feature encoder to obtain potential health information in time series data. BiGRU utilizes forward and backward time series features to fully capture the dependencies in time series data. The principle is shown in Equation (1)

BiGRU (X) = {h_{1}, h_{2}, \dots, h_{T}}

(1)

where

h_{T}

represents the feature representation of the hidden layer. This bidirectional feature extraction mechanism can take into account the model’s past and future health trends and can provide relevant temporal feature representation for subsequent cross-modal fusion.

2.2. Dual-View Feature Encoding Model

In addition to time series features, we utilize medical-related information extracted from clinical notes to enhance the performance of clinical outcome prediction. We design a DVFE model to extract the global semantic information and local structural information of clinical notes from the sentence level and entity level perspectives, respectively. Specifically, we use sentence-aware attention model (SAA) to obtain the potential features and contextual associations of clinical notes from a global perspective. Clinical text consists of sentences, represented as

N_{s} = {s_{1}, s_{2}, \dots, s_{N_{s}}}, s_{i}

denotes sentences. Here, we use the model pre-trained on clinical corpus ClinicalBERT [35] to extract the embedding of the [CLS] tag of each sentence as the global semantic information, as shown in Equation (2).

E_{s} = Clinial BERT {(s_{i})}_{[C L S]}

(2)

Furthermore, we use the self-attention mechanism to learn the association between sentences and use weighted aggregation to obtain context-enhanced sentence representation. Finally, the global representation of clinical text processed by maximum pooling is shown in Equation (3).

{\tilde{C}}_{s} = Max-pooling (TransfE (E_{s_{1}}, E_{s_{2}}, \dots, E_{s_{N_{s}}}))

(3)

Here,

TransfE (*)

represents the encoder model of Transformer. Sentence-level modeling based on attention mechanisms can effectively identify cross-sentence semantic associations in clinical texts and improve the representation ability of clinical texts through the importance weighting method.

From the perspective of entity-level granularity, we use the Med7 [36] as a medical entity extractor to obtain medical-related keywords containing structured information from clinical notes. Med7 is a clinical named entity recognition model pre-trained on annotated EHR data. Here, we use Med7 to extract a series of medical entities from clinical notes and then use the Bio + ClinicalBERT (BCB) [37] to embed these entities. The principle is shown in Equation (4).

\begin{matrix} {\tilde{C}}_{e} = BCB (Med 7 ([c_{1}, c_{2}, \dots, c_{N_{c}}])) \\ {\tilde{C}}_{e} = {e_{1}, e_{2}, \dots, e_{N_{e}}} \end{matrix}

(4)

where

e_{1}

represents the medical entity and

N_{e}

represents the number of entities.

2.3. Memory-Driven Cross-Modal Attention Model

Clinical practice shows that patients’ clinical notes and health monitoring time series data have significant clinical correlation. When making diagnosis and treatment decisions, clinicians usually need to combine data from these two different modalities to comprehensively analyze the patient’s health status. Therefore, we propose a memory-driven cross-modal attention model (MDCA), which can adaptively learn fine-grained correlation features between cross-modal data. Specifically, MDCA enables cases with similar clinical patterns to share critical information during feature fusion, thereby enhancing disease prediction performance through dynamic cross-modal interactions. The architecture of the MDCA model is illustrated in Figure 2.

The MDCA model is mainly composed of a multi-subspace attention model, a dynamic memory model (DM), and a memory-aware constrained LN model (MacLN). We first feed the time series data and clinical text embedding into the Dense layer for linear mapping, and then the feature sequence is mapped to the same dimensional space after the LeakyReLU function, which can be expressed as Equation (5).

\begin{array}{l} {\tilde{X}}_{t} = LeakyReLU (λ_{a} \cdot X_{t}) \\ {\tilde{C}}_{e} = LeakyReLU (λ_{e} \cdot C_{e}) \end{array}

(5)

Here,

λ_{a}, λ_{e}

represents the learnable mapping parameter, and

\tilde{X}

represents the time series feature representation after mapping. Furthermore, the multi-modal features will be fed into the multi-subspace attention layer to learn the correlation between different modal features.

Q, K, V

represent three different feature channels, and

K \cdot θ_{K} = V \cdot θ_{V} = {\tilde{X}}_{t}

,

Q \cdot θ_{Q} = {\tilde{C}}_{e}

. Different feature channels have independent mapping parameters, which increases the expressiveness of the features. Then, the model further learns the interdependence between cross-modal features by calculating the attention weights between different modal features in each head. The principle is shown in Equation (6).

\begin{matrix} {\tilde{H}}_{1} = V \cdot σ (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) \\ {\tilde{H}}_{m} = [{\tilde{H}}_{1} \oplus {\tilde{H}}_{2} \oplus, \dots, \oplus {\tilde{H}}_{n}] \cdot λ_{m} \end{matrix}

(6)

where

σ (*)

represents the softmax function,

d_{k}

represents the dimension of K,

\oplus

represents the concatenation operation,

λ_{m}

represents the learnable linear mapping parameter,

{\tilde{H}}_{1}

represents the output of the single-head attention mechanism, and

{\tilde{H}}_{m}

represents the output of the final multi-head attention mechanism. This design enables the model to learn the dynamic association between time series signals and clinical text in parallel in different feature subspaces, significantly enhancing the cross-modal feature interaction capability.

In addition, this paper introduces a residual connection structure to further improve the model’s ability to capture multi-scale features and alleviate the gradient vanishing problem, as shown in Figure 2. Furthermore, the feature processed by the residual connection will be fed into the MacLN for feature optimization. The MacLN integrates patient’s historical information into the normalization process to improve the feature representation capabilities. Due to the heterogeneity of cross-modal data, classic layer normalization will encounter scale constraints and parameter offset problems during feature fusion. The designed MacLN has the following advantages: (1) dynamically adjusts the normalization parameters based on the memory state, which can adapt to different input features; (2) considers historical pattern information and enhance the joint expression cross-modal features. The principle is shown in Figure 3.

At time step t, the output

m_{t}

of the DM model passes through a multi-layer perceptron to obtain the predicted values

Δ η_{t}, Δ ϕ_{t}

corresponding to the normalized parameters

ϕ_{t}

and

η_{t}

, respectively, as shown in Equation (7).

\begin{array}{l} Δ η_{t} = MLP (m_{t}) \\ Δ ϕ_{t} = MLP (m_{t}) \end{array}

(7)

Here,

ϕ_{t}

and

η_{t}

represent the key parameters used to adjust the bias in the process of learning feature representation. In addition, the key parameters are updated by Equation (8).

\begin{array}{l} {η^{'}}_{t} = η_{t} + Δ η_{t} \\ {ϕ^{'}}_{t} = ϕ_{t} + Δ ϕ_{t} \end{array}

(8)

Finally, the normalized A is fused with the updated parameter

{η^{'}}_{t}, {ϕ^{'}}_{t}

to obtain the final output of MacLN, as shown in Equation (9).

F_{o u t} = {η^{'}}_{t} + {ϕ^{'}}_{t} ⊙ A_{n o r m}

(9)

where

A_{n o r m}

can be expressed by Equation (10).

A_{n o r m} = \frac{A - μ (A)}{σ (A)}

(10)

where

μ (A)

and

σ (A)

represent the mean and standard deviation of A vector, respectively.

The DM model designed in this paper enables the model to adaptively learn feature distribution according to the time series state and sample characteristics by adjusting the scaling factor and offset parameters in the layer normalization in real time, thereby enhancing the model’s ability to model long-term dependencies and improving the recognition performance of sequence patterns. Specifically, this paper designs a dynamic memory matrix

M_{t}

to represent the state transition in the feature fusion process, where each memory slot is used to encode specific clinical pattern information and store feature representations related to key medical concepts. At time step t, the memory state of the previous moment is used as the feature channel

Q

, which is concatenated with the current output feature

z_{t - 1}

to form an enhanced feature representation, where each channel is projected through an independent learnable parameter. Finally, the feature sequence is input into the multi-subspace attention model for cross-modal interaction modeling. The principle is shown in Equation (11).

\begin{matrix} Q = λ_{Q} \cdot M_{t - 1} \\ K = λ_{K} \cdot [M_{t - 1} | | z_{t - 1}] \\ V = λ_{V} \cdot [M_{t - 1} | | z_{t - 1}] \end{matrix}

(11)

Here,

| |

represents the concatenation operation, and

λ_{Q}, λ_{K}, λ_{V}

represent the learnable channel mapping parameters. This enables the model to improve its flexibility. Different mapping methods for the same input can be viewed as projections of the same input in different feature spaces, which helps to improve the flexibility and robustness of the model. Furthermore, the model will use the feature channels

Q

and

K

to learn the association between different patterns. The principle is shown in Equation (12).

S A T T = softmax (\frac{Q \cdot K^{T}}{\sqrt{d}}) \cdot V

(12)

where d represents the dimension of the vector, and

K

and

S A T T

represents the output of the single-head attention mechanism. Finally, multiple single-subspace attention mechanisms form a multi-subspace attention mechanism, as shown in Equation (13).

L = Concat (S A T T_{1}, S A T T_{2}, \dots, S A T T_{n}) \cdot λ_{L}

(13)

Among them, L is the output of the multi-subspace attention model,

S A T T_{n}

represents the output of the n-th subspace, and

λ_{L}

represents the mapping parameters of different subspaces. The multi-head attention mechanism uses a parallel architecture to process the memory matrix and realizes cross-modal association modeling through multi-subspace collaborative feature extraction. Since the DM utilizes a cyclic feature encoding mechanism, it may cause gradient vanishing or gradient exploding. Therefore, we introduce a residual connection structure in DM, as shown in Figure 2. The identity mapping across time steps helps improve the stability of the model. The principle is shown in Equation (14).

H_{t} = MLP (L + M_{t - 1}) + L + M_{t - 1}

(14)

where

MLP (*)

represents a multi-layer perceptron.

To enhance the model’s attention to the key features of the disease, we introduced a dynamic information gate mechanism (DGM) in the DM to achieve dynamic selection and updating of features through the forget gate and input gate. The forget gate and input gate are used to constrain the input of

z_{t - 1}

and

M_{t - 1}

, respectively, so that the model can adaptively optimize the memory content and accurately capture the core feature patterns of disease diagnosis. The structure of the DGM model is shown in Figure 4.

The forget gate is responsible for regulating the retention and discarding of information in the memory unit, while the input gate determines the writing and updating of new information, as shown in Equations (15) and (16).

i_{t} = V_{f} \cdot \tanh (M_{t - 1}) + U_{i} \cdot Z_{t - 1} + b_{i}

(15)

f_{t} = U_{f} \cdot \tanh (M_{t - 1}) + V_{i} \cdot Z_{t - 1} + b_{f}

(16)

where

i_{t}

and

f_{t}

represent the outputs of the input gate and the forget gate at time step t, respectively;

V_{f}, V_{i}, U_{i}, V_{i}

represent the trainable weight parameters, respectively;

Z_{t - 1}

is the vector matrix of

z_{t - 1}

after the broadcast operation;

b_{i}

and

b_{f}

represent the bias constants.

Furthermore, the output gate and forget gate use sigmoid activation to generate gating weights within the [0, 1] interval, which regulates the retention intensity of historical memory states and the integration degree of new information, respectively. Ultimately, the update of the memory unit state is jointly determined by the forget gate’s filtering of historical memory and the input gate’s encoding of candidate information. The principle is shown in Equation (17).

M_{t} = σ (f_{t}) ⊙ H_{t} + σ (i_{t}) ⊙ M_{t - 1}

(17)

Here,

σ (*)

represents the sigmoid function,

⊙

represents the Hadamard product operation, and

M_{t}

indicates the dynamic information output at time step t, which will be fed into all MacLN models.

Finally, the output of the first MacLN model is fed into the feedforward network for further processing, ultimately generating temporally aware cross-modal feature embedding. These multi-modal embeddings

E_{m u l}

are then concatenated with sentence-level semantic embeddings

{\tilde{C}}_{s}

to form the final fused representation

{\tilde{E}}_{o} = {\tilde{C}}_{s} \oplus {\tilde{E}}_{m u l}

, where

\oplus

represents the concatenation operation. This hierarchical feature fusion strategy not only retains the fine-grained semantic information of the original input, but also integrates the cross-modal temporal dynamic features, providing rich feature representations for prediction tasks.

2.4. Clinical Prediction

Finally, we construct an end-to-end clinical outcome prediction model based on the multi-modal fusion features with multi-perspectives. The concatenation of multi-modal features is fed into feedforward neural networks to compute class prediction probabilities, as shown in Equation (18).

{{\hat{y}}^{(1)}, {\hat{y}}^{(2)}, \dots, {\hat{y}}^{(N_{C})}} = FFNN ({\tilde{E}}_{o}; θ)

(18)

where

N_{c}

denotes the number of clinical outcome labels, and FFNN (*) represents the feedforward neural network, and

θ

denotes the learnable parameters in FFNN. We perform end-to-end joint training by minimizing the cross-entropy loss between the target prediction value and the true label. As shown in Equation (19).

L = - \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} {\hat{y}}^{(i)} \log y^{(i)}

(19)

Here,

y

and

\hat{y}

represent the ground truth and predicted value, respectively.

3. Experiment and Discussion

In this section, we primarily describe the experimental data, evaluate our proposed model’s performance, and also analyze and discuss the experimental results. The proposed approach is implemented by PyTorch 1.6.0 on a workstation with NVIDIA GeForce GTX2080Ti (Nvidia, Santa Clara, CA, USA).

3.1. Data Description

We conducted experiments on the MIMIC-III dataset, which is a publicly available, multicenter, de-identified medical information database, which contains comprehensive clinical time series data and rich clinical notes for 46,520 patients, including important information such as vital sign monitoring signals, clinical notes, ICD-9 codes, disease severity, and clinical diagnosis. We refer to MIMIC-Extract [38] to screen the original data, using the patient’s data from the previous 24 h and only considering clinical records with at least 30 h of current data. In addition, this paper preprocessed the samples; for missing values in the monitoring data, we used the mean interpolation method to fill in. In addition, we used the Z-score method to standardize all numerical features. Finally, categories that are relatively irrelevant to diagnosis, such as case management, rehabilitation services, and nutrition, were excluded, and patient samples without clinical records were removed. Finally, the total number of selected patient samples is 21,080, which are divided into training set (80%), validation set (10%), and test set (10%). The distribution information for the dataset is shown in Table 1 and Table 2.

We use a pre-trained clinical NER model med7 to extract medical-related keywords to improve the prediction performance of clinical tasks. The seven different medical entities extracted from clinical notes are shown in Table 3.

This work refers to MIMIC-Extract to define the target tasks as follows: (1) In-hospital mortality: patients who die during hospital stay after ICU admission. (2) In-ICU mortality: patients who die during ICU stay after ICU admission. (3) Length of stay > 3 (LOS > 3): patients who stay in the ICU longer than 3 days. (4) Length of stay > 7 (LOS > 7): patients who stay in the ICU longer than 7 days. Furthermore, since the dataset used in this paper suffers from class imbalance, we use AUROC and AUPRC to evaluate the performance of the model on imbalanced tasks. Here, AUROC is a commonly used robust metric for imbalanced datasets; AUPRC excludes true negatives from its calculation, making it particularly suitable for datasets with abundant negative instances. In order to evaluate the performance of the model more comprehensively, in addition to using the above metrics, we also compare the F1-score of different models in the same test set. We run all experiments five times with different initializations and report the mean and standard deviation values of the results. In this study, we use comparison models including MTL [39], MDCNN [40], CMN [41], BCB + LSTM [42], PM²F²N [43], and DESAM-cp [44]. The performance of different models is shown in Table 4 and Table S1.

3.2. Results and Analysis

The experimental results show that the proposed model significantly outperforms all baseline models across all prediction tasks. For In-hospital mortality prediction, our D4Care model achieves improvements of 2.7–7.8% in AUROC and 3–15.8% in AUPRC compared to baseline models. Compared with the best comparative model DESAM-cp, AUROC and AUPRC increased by 2.7% and 3%, respectively, indicating that the proposed model can effectively extract multi-modal features for clinical outcome prediction. To comprehensively evaluate model performance across multiple clinical prediction tasks, we use radar chart visualization, as shown in Figure 5. The results demonstrate that D4Care exhibits comprehensive and consistent performance advantages across all prediction tasks, with its radial axes on the radar chart significantly outperforming other models, while maintaining balanced extension across all dimensions without any performance dips. Specifically, in the mortality prediction task, the model achieved AUROC scores of 0.9203 and 0.9189, with standard deviations below 0.005. This demonstrates exceptional sensitivity and reliability in critical care risk identification. In the task of predicting the length of stay, the prediction performance for short-term hospitalization (LOS > 3) was particularly outstanding (AUROC = 0.8484). Notably, the performance of the model in this paper shows differentiated characteristics in multiple tasks. The performance improvement in mortality prediction is mainly reflected in the accurate recognition of minority samples (AUPRC increased by 4.7%), while in the prediction of hospital stay, it significantly surpassed the comparison model by integrating cross-modal features.

From the experimental results, we can analyze that the CNN-based models, MTL, MDCNN, and CMN, rely on CNN to extract clinical features, but are constrained by the long-term dependency problem, making it difficult for the model to learn the contextual relationship between long sentences, resulting in limitations in the model’s prediction performance. Moreover, the attention-based models, BCB + LSTM, PM²F²N, and DESAM-cp, all achieved good performance in the same test set. We speculate that this is because the attention-based models all use the attention mechanism to extract feature information, allowing the model to better learn the interdependence between features. This also proves the rationality of using the cross-modal attention mechanism when designing the MCA model in this paper, which helps the model further optimize the cross-modal interaction of clinical notes and time series data. Although the attention-based model can achieve better performance through feature dependency learning, the BCB + LSTM model lacks full consideration of the dependency between sentences, and limits the model’s ability to capture key information by simply averaging sentence embeddings. We use the BiGRU network to process time series data. BiGRU processes time series data in both forward and backward directions, and can capture past and future information at the same time. This bidirectional information flow enables it to more comprehensively understand the dependencies in the time series, thereby providing more accurate feature representation in clinical outcome prediction. LSTM mainly processes data in the forward direction. Although it can capture long-term dependencies, it may miss some key features due to the lack of backward information flow. In medical data, the health status of patients is not only affected by past medical history, but also related to future examination and treatment trends. Therefore, this feature of BiGRU is particularly important for capturing changes in the health status of patients. The experimental results show that BiGRU has achieved excellent performance in multiple clinical outcome prediction tasks, which further verifies its theoretical advantages. However, BiGRU requires more computing resources and training time, and may face the problem of gradient vanishing or gradient exploding when processing very long sequences. In future work, strategies such as gradient clipping and optimization can be considered to alleviate these problems.

From the perspective of feature learning, PM²F²N and CMN extract key medical features and combine them with time series data, improving the performance of the model to some extent, but there are certain limitations. Firstly, such models only focus on discrete medical entities and fail to fully learn the fine-grained interactions between cross-modal data. Secondly, due to the large number of entity categories in clinical notes, the training process is prone to over-reliance on specific entity types. Finally, the model ignores the structural associations and temporal dependencies between entities, which ultimately limits the prediction performance of such models. Based on the above analysis, this paper proposes a dual-granularity attention mechanism, which captures key medical concepts through entity-level features from a local perspective and models global semantic relationships using sentence-level features from a global perspective. This local–global collaborative feature learning mechanism can more comprehensively learn the hierarchical semantic information of clinical notes compared to the single-granularity modeling method, thereby significantly improving the prediction performance. Experimental results show that the multi-granularity modeling strategy that considers both entity details and contextual relationships is more in line with the cognitive process of clinical diagnosis.

Thanks to the designed sentence-level self-attention mechanism for effectively learning the joint expression of features, the performance of DESAM-cp is second only to the model in this paper, which not only proves the importance of learning global feature interdependence, but also provides a theoretical basis for the design of our model. However, the performance of DESAM-cp did not surpass the model in this paper. We speculate that this is because the model in this paper constructs an “entity-sentence” multi-granularity feature learning mechanism, which not only inherits the sentence-level modeling advantages of DESAM-cp, but also introduces an entity-level attention mechanism to more completely simulate the cognitive process of clinical diagnosis. Secondly, compared with DESAM-cp, this paper also introduces a dynamic memory component, which enables the model to simultaneously model the patient’s current clinical manifestations and historical disease evolution characteristics. This temporal perception ability is more in line with the real clinical decision-making process. In addition, due to the heterogeneity of multi-modal data and the complexity of its parameter space, it is difficult to effectively improve the efficiency of feature fusion through simple heterogeneous concatenating. This paper designs the MacLN to learn the expression of heterogeneous inputs, which effectively alleviates the challenges of multi-modal data fusion. At the same time, MacLN realizes the continuous tracking and integration of historical features through a dynamic memory mechanism, and encodes the temporal dependency into the feature representation, simulating the doctor’s progressive diagnostic thinking and verifying the rationality of MacLN. In addition, the DGM model designed in this paper uses a feature selection mechanism based on attention weights, which enables the model to ignore noise information, automatically focus on diagnosis-related features, and adaptively update memory information. The designed models work together to enable the model to fully understand the patient’s complete medical history, accurately integrate heterogeneous features, and capture deep cross-modal correlations, thereby improving the accuracy of clinical predictions.

3.3. Performance on Imbalanced Data

In the MIMIC-III dataset, the proportion of hospitalized patients who died is much lower than that of surviving patients, indicating a sample imbalance. In order to further verify the robustness of our model under imbalanced conditions, we constructed sub-datasets with different positive and negative sample ratios. According to the prediction tasks of hospital mortality and ICU mortality, we selected 19,240 patients who did not die as positive samples and 1860 patients who died as negative samples, and further randomly down-sampled the positive samples according to the positive-negative sample ratios of 1:1, 1:3, 1:5, 1:7 and 1:10. Finally, we selected PM²F²N and DESAM-cp as baseline models for comparison. The experimental results are shown in Figure 6.

As shown in Figure 6, the experimental results show that as the degree of data imbalance increases (from 1:1 to 1:10), the AUROC and AUPRC performance of all models show a downward trend, and the degree of decline is positively correlated with the degree of data imbalance. Among them, PM²F²N decreases most significantly, followed by DESAM-cp, while the proposed model declines the slowest, showing a significant robustness advantage. We speculate that this performance difference is mainly due to two core designs: first, the MacLN designed in this paper effectively stabilizes the feature learning process of the model under different data distributions through an adaptive parameter adjustment mechanism. Second, the cross-modal attention mechanism introduced in this paper can automatically focus on the key features in the minority class samples to avoid the model being dominated by the majority class samples. It is worth noting that even in the case of extreme imbalance (1:10), the proposed model can still maintain high AUROC and AUPRC values, significantly better than PM²F²N and DESAM-cp, proving the application value of the proposed method in real clinical scenarios.

3.4. Ablation Study

To evaluate the effectiveness of the module designed in this paper on the proposed D4Care, we conducted ablation experiments on the test set. The experimental results are shown in Table 5 and Table S2, and Figure 7.

We evaluate the individual impact and joint contribution of models such as SAA, MDCA, DM, and MacLN on the model performance. In addition, we compared the performance of the model using single-modal data and multi-modal data; Only-TS means that only time series data is used. From the ablation experiment results, we can see that D4Care outperforms other comparison models in all tasks, which proves the effectiveness of proposed models in this paper. Compared with Only-TS, which only uses time series data for modeling, the cross-modal fusion model demonstrates superior performance. In addition, when the model does not use the MDCA to learn the joint expression of cross-modal features, but simply fuses the temporal features and medical entities, the performance of the model decreases by 1%~4%, which proves that the MDCA designed in this chapter helps the model learn the potential correlation between cross-features and enhances the joint expression of multi-scale features. When the model does not consider the dependencies between sentences, that is, lacks the SAA, the performance of the model decreases. We speculate that if simple average pooling is performed on sentences, the model will tend to learn general features, while ignoring potential key information and important relationships in the text. The SAA allows the model to learn feature mappings of different subspaces, which helps capture the correlation between different sentences and enables the model to pay more attention to the fine-grained key features in clinical notes. In addition, the model further improves the joint expression of features by learning the global semantic information and local structural features of the text from different perspectives, thereby improving the clinical prediction performance of the model.

At the same time, we also analyze MacLN and DM models. The experimental results show that both models can improve the prediction performance of the model to a certain extent. If the historical information in the memory is not introduced into the layer normalization, the performance of the model will decline. We speculate that this is because the DM uses attention and gating mechanisms to take into account the important historical information of the patient, so that the model can not only pay attention to the patient’s current health information, but also consider the patient’s historical information, which is in line with the doctor’s reasoning process. Similarly, when the MacLN is missing, the performance of the model is also limited. We conclude that this is because MacLN can dynamically adjust the current normalized scaling and offset parameters according to the key information output by DM to adapt to different inputs and contexts, thereby improving the stability and generalizability of the model.

3.5. Interpretability Analysis

In order to improve the interpretability of the model, this paper uses the integrated gradient (IG) [45] method to analyze the impact of input features on the prediction model. Integrated gradient is a technique used to explain the prediction results of deep neural networks. It performs linear interpolation between the input features and the reference point and calculates the gradient integral of the model output along this path to determine the importance score of each feature, which is represented by IG. The higher the IG value, the more important the feature is to the model prediction. We calculated the IG values of all words in the clinical notes of the test set and obtained the important feature set sorted by value. As shown in Table 6, we select the top 18 clinically significant terms to analyze their impact on mortality prediction performance.

As shown in Table 6, clinically significant words such as “pain”, “fever”, and “cough” exhibit high IG values. These terms represent common symptoms in ICU care and are strongly correlated with the severity of a patient’s condition, making them important indicators for model predictions. Other words, such as “fever” and “seizure”, are often considered to be obvious manifestations of acute clinical symptoms and also have prognostic significance in predicting mortality. In contrast, common words in clinical notes, such as “will”, “possible”, and “the” lack specialized medical information and provide little useful semantic value for model predictions. By calculating the IG values of different words and then analyzing the top 18 clinically significant words that contribute to the model’s mortality prediction, it is possible to provide a rational and reliable basis for the results of clinical prediction, assisting clinicians in making more accurate treatment and diagnostic decisions.

4. Conclusions

To address the key challenges in clinical outcome prediction, we propose D4Care, a novel, deep dynamic memory-driven cross-modal feature representation network for clinical outcome prediction. Specifically, we use a BiGRU network to capture dynamic features in time series data and a dual-view feature encoding model with sentence-aware and entity-aware capabilities to extract clinical text features from global semantic and local concept perspectives, respectively. Furthermore, we introduce a memory-driven cross-modal attention mechanism, which dynamically establishes deep correlations between clinical text and time series features through learnable memory matrices. In addition, we also introduce a memory-aware constrained layer normalization to alleviate the challenges of cross-modal feature heterogeneity. In addition, we use gating mechanisms and dynamic memory components to enable the model to learn feature information of different historical–current patterns, further improving the model’s performance. Finally, experimental results on the MIMIC-III dataset show that the proposed model outperforms existing advanced models in prediction tasks. Even under imbalanced sample conditions, the model maintains high accuracy and robust performance. In future work, we will add significance test analysis to the model and conduct in-depth research on model complexity, training cost, inference time, system latency, etc., to enhance the comprehensive auxiliary ability of the model in clinical applications. We will also further explore interpretable analysis at the case level to improve the reliability and practicality of the model.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15116054/s1, Table S1: Performances of different models on the same testing set (F1-score); Table S2: The ablation experiment results of D4Care (F1-score).

Author Contributions

Conceptualization, B.C.; methodology, B.C.; software, B.C.; validation, B.C.; formal analysis, B.C.; data curation, B.C.; writing—original draft preparation, B.C.; writing—review and editing, G.L.; supervision, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial intelligence
BiGRU	Bi-directional Gated Recurrent Unit
BCB	Bio + ClinicalBERT
EHRs	Electronic health records
DM	Dynamic memory model
DGM	Dynamic-info gate mechanism
DVFE	Dual-view feature encoding model
MacLN	Memory-aware constrained layer normalization
MDCA	Memory-driven cross-modal attention model
NER	Named entity recognition
SAA	Sentence-aware attention model
TS	Time series model

References

Yang, C.; Kors, J.A.; Ioannou, S.; John, L.H.; Markus, A.F.; Rekkas, A.; Ridder, M.A.J.d.; Seinen, T.M.; Williams, R.D.; Rijnbeek, P.R. Trends in the conduct and reporting of clinical prediction model development and validation: A systematic review. J. Am. Med. Inform. Assoc. 2022, 29, 983–989. [Google Scholar] [CrossRef]
Awad, A.; Bader-El-Den, M.; Mcnicholas, J.; Briggs, J. Early Hospital Mortality Prediction of Intensive Care Unit Patients Using an Ensemble Learning Approach. Int. J. Med. Inform. 2017, 108, 185–195. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, B.; Liu, G. Artificial intelligence can dynamically adjust strategies for auxiliary diagnosing respiratory diseases and analyzing potential pathological relationships. J. Breath Res. 2023, 17, 046007. [Google Scholar] [CrossRef]
Chaudhry, B.; Wang, J.; Wu, S.; Maglione, M.; Mojica, W.; Roth, E.; Morton, S.C.; Shekelle, P.G. Systematic Review: Impact of Health Information Technology on Quality, Efficiency, and Costs of Medical Care. Ann. Intern. Med. 2006, 144, 742–752. [Google Scholar] [CrossRef]
Burger, M.; Rätsch, G.; Kuznetsova, R. Multi-modal Graph Learning over UMLS Knowledge Graphs. In Proceedings of the Machine Learning for Health (ML4H), New Orleans, LA, USA, 10 December 2023; PMLR: Birmingham, UK, 2023; pp. 52–81. [Google Scholar]
Gomes, B.; Pilz, M.; Reich, C.; Leuschner, F.; Konstandin, M.; Katus, H.A.; Meder, B. Machine learning-based risk prediction of intrahospital clinical outcomes in patients undergoing TAVI. Clin. Res. Cardiol. 2020, 110 (Suppl. S1), 343–356. [Google Scholar] [CrossRef]
Zhang, X.; Qian, B.; Li, Y.; Liu, Y.; Chen, X.; Guan, C.; Li, C. Learning robust patient representations from multi-modal electronic health records: A supervised deep learning approach. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), Alexandria, Egypt, 29 April–1 May 2021; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2021; pp. 585–593. [Google Scholar]
Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. DeepCare: A Deep Dynamic Memory Model for Predictive Medicine; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Huang, S.C.; Pareek, A.; Seyyedi, S.; Banerjee, I.; Lungren, M.P. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. npj Digit. Med. 2020, 3, 136. [Google Scholar] [CrossRef]
Soenksen, L.R.; Ma, Y.; Zeng, C.; Boussioux, L.; Carballo, K.V.; Na, L.; Wiberg, H.M.; Li, M.L.; Fuentes, I.; Bertsimas, D. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 2022, 5, 149. [Google Scholar] [CrossRef]
Yang, B.; Wu, L. How to Leverage Multimodal EHR Data for Better Medical Predictions? arXiv 2021. [CrossRef]
Ma, M.; Ren, J.; Zhao, L.; Tulyakov, S.; Wu, C.; Peng, X. SMIL: Multimodal Learning with Severely Missing Modality. arXiv 2021. [Google Scholar] [CrossRef]
Msosa, Y.J.; Grauslys, A.; Zhou, Y.; Wang, T.; Buchan, I.; Langan, P.; Foster, S.; Walker, M.; Pearson, M.; Folarin, A.; et al. Trustworthy Data and AI Environments for Clinical Prediction: Application to Crisis-Risk in People With Depression. J. Biomed. Health Inform. (J-BHI) 2023, 27, 11. [Google Scholar] [CrossRef]
Yang, Z.; Mitra, A.; Liu, W.; Berlowitz, D.; Yu, H. TransformEHR: Transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 2023, 14, 7857. [Google Scholar] [CrossRef]
Van Aken, B.; Papaioannou, J.M.; Mayrdorfer, M.; Budde, K.; Gers, F.; Loeser, A. Clinical outcome prediction from admission notes using self-supervised knowledge integration. arXiv 2021, arXiv:2102.04110. [Google Scholar]
Chandak, P.; Huang, K.; Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 2023, 10, 67. [Google Scholar] [CrossRef] [PubMed]
Sauer, C.M.; Chen, L.C.; Hyland, S.L.; Girbes, A.; Elbers, P.; Celi, L.A. Leveraging electronic health records for data science: Common pitfalls and how to avoid them. Lancet Digit. Health 2022, 4, e893–e898. [Google Scholar] [CrossRef]
Jiang, P.; Xiao, C.; Cross, A.; Sun, J. GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs. arXiv 2023, arXiv:2305.12788. [Google Scholar]
Zhao, Y.; Hong, Q.; Zhang, X.; Deng, Y.; Wang, Y.; Petzold, L. Bertsurv: Bert-based survival models for predicting outcomes of trauma patients. arXiv 2021, arXiv:2103.10928. [Google Scholar]
Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094. [Google Scholar] [CrossRef]
Ye, X.; Wu, J.; Mou, C.; Dai, W. Medlens: Improve mortality prediction via medical signs selecting and regression. In Proceedings of the 2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI), Taiyuan, China, 26–28 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 169–175. [Google Scholar]
Jain, S.; Burger, M.; Rätsch, G.; Kuznetsova, R. Knowledge Graph Representations to enhance Intensive Care Time-Series Predictions. arXiv 2023, arXiv:2311.07180. [Google Scholar]
Zhang, K.; Niu, K.; Zhou, Y.; Tai, W.; Lu, G. MedCT-BERT: Multimodal Mortality Prediction using Medical ConvTransformer-BERT Model. In Proceedings of the 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), Atlanta, GA, USA, 6–8 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 700–707. [Google Scholar]
Thilak, V.; Huang, C.; Saremi, O.; Dinh, L.; Goh, H.; Nakkiran, P.; Susskind, J.M.; Littwin, E. LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures. arXiv 2023, arXiv:2312.04000. [Google Scholar]
Niu, K.; Zhang, K.; Peng, X.; Xiao, N. Deep multi-modal intermediate fusion of clinical record and time series data in mortality prediction. Front. Mol. Biosci. 2023, 10, 1136071. [Google Scholar] [CrossRef]
Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2020, arXiv:2010.16056. [Google Scholar]
Lyu, W.; Dong, X.; Wong, R.; Zheng, S.; Abell-Hart, K.; Wang, F.; Chen, C. A multimodal transformer: Fusing clinical notes with structured EHR data for interpretable in-hospital mortality prediction. In Proceedings of the AMIA Annual Symposium Proceedings, Washington, DC, USA, 5–9 November 2022; American Medical Informatics Association: Bethesda, MA, USA, 2022; Volume 2022, p. 719. [Google Scholar]
An, Y.; Li, R.; Chen, X. Merge: A multi-graph attentive representation learning framework integrating group information from similar patients. Comput. Biol. Med. 2022, 151, 106245. [Google Scholar] [CrossRef] [PubMed]
Sun, C.; Chen, D.; Jin, X.; Xu, G.; Tang, C.; Guo, X.; Tang, Z.; Bao, Y.; Wang, F.; Shen, R. Association between acute kidney injury and prognoses of cardiac surgery patients: Analysis of the MIMIC-III database. Front. Surg. 2023, 9, 1044937. [Google Scholar] [CrossRef]
Park, Y.; Ho, J.C. Califorest: Calibrated random forest for health data. In Proceedings of the ACM Conference on Health, Inference, and Learning, New York, NY, USA, 2–4 April 2020; pp. 40–50. [Google Scholar]
Xia, Z.; Xu, P.; Xiong, Y.; Lai, Y.; Huang, Z. Survival Prediction in Patients with Hypertensive Chronic Kidney Disease in Intensive Care Unit: A Retrospective Analysis Based on the MIMIC-III Database. J. Immunol. Res. 2022, 2022, 3377030. [Google Scholar] [CrossRef]
An, Y.; Cai, G.; Chen, X.; Guo, L. PARSE: A personalized clinical time-series representation learning framework via abnormal offsets analysis. Comput. Methods Programs Biomed. 2023, 242, 107838. [Google Scholar] [CrossRef]
Liu, R.; Gutiérrez, R.; Mather, R.V.; Stone, T.A.D.; Mercado, L.A.S.C.; Bharadwaj, K.; Johnson, J.; Das, P.; Balanza, G.; Uwanaka, E.; et al. Development and prospective validation of postoperative pain prediction from preoperative EHR data using attention-based set embeddings. npj Digit. Med. 2023, 6, 209. [Google Scholar] [CrossRef]
Ni, P.; Li, Y.; Zhu, J.; Peng, J.; Dai, Z.; Li, G.; Bai, X. Disease diagnosis prediction of emr based on BiGRU-ATT-capsnetwork model. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6166–6168. [Google Scholar]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019. [Google Scholar] [CrossRef]
Kormilitzin, A.; Vaci, N.; Liu, Q.; Nevado-Holgado, A. Med7: A transferable clinical natural language processing model for electronic health records. Artif. Intell. Med. 2021, 118, 102086. [Google Scholar] [CrossRef] [PubMed]
Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar]
Wang, S.; Mcdermott, M.B.A.; Chauhan, G.; Ghassemi, M.; Hughes, M.C.; Naumann, T. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. In Proceedings of the ACM Conference on Health, Inference, and Learning, Toronto, ON, Canada, 2–4 April 2020. [Google Scholar] [CrossRef]
Si, Y.; Roberts, K. Deep patient representation of clinical notes via multi-task learning for mortality prediction. AMIA Summits Transl. Sci. Proc. 2019, 2019, 779. [Google Scholar]
Khadanga, S.; Aggarwal, K.; Joty, S.; Srivastava, J. Using clinical notes with time series data for icu management. arXiv 2019, arXiv:1909.09702. [Google Scholar]
Bardak, B.; Tan, M. Improving clinical outcome predictions using convolution over medical entities with multimodal learning. Artif. Intell. Med. 2021, 117, 102112. [Google Scholar] [CrossRef] [PubMed]
Deznabi, I.; Iyyer, M.; Fiterau, M. Predicting in-hospital mortality by combining clinical notes with time-series data. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP, Online Event, 1–6 August 2021; Volume 2021, pp. 4026–4031. [Google Scholar]
Zhang, Y.; Zhou, B.; Song, K.; Sui, X.; Zhao, G.; Jiang, N.; Yuan, X. PM²F²N: Patient multi-view multi-modal feature fusion networks for clinical outcome prediction. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Volume 2022, pp. 1985–1994. [Google Scholar]
Lee, S.; Jang, G.; Kim, C.; Park, S.; Yoo, K.; Kim, J.; Kim, S.; Kang, J. Enhancing Clinical Outcome Predictions through Auxiliary Loss and Sentence-Level Self-Attention. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Türkiye, 5–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1210–1217. [Google Scholar]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International conference on machine learning, Sydney, NSW, Australia, 6–11 August 2017; PMLR: Birmingham, UK, 2017; pp. 3319–3328. [Google Scholar]

Figure 1. The overall framework of the proposed model.

Figure 2. The architecture of the MDCA model.

Figure 3. The architecture of the MacLN model.

Figure 4. The architecture of the DGM model.

Figure 5. Radar chart of model performance metrics.

Figure 6. Performance of different models on imbalanced data.

Figure 7. Heatmap of ablation experiment results.

Table 1. Statistical information of the MIMIC-III dataset.

DATA	# Patient	# Hospital	# ICU
MIMIC-III (>15 years old)	38,597	49,785	53,423
MIMIC-Extract	34,472	34,472	34,472
MIMIC-Extract (at least 24 + 6 (gap) hours patient)	23,937	23,937	23,937
Final cohort	21,080	21,080	21,080

Table 2. Class distribution of final cohort used in this paper.

Type	Mortality		Length of Stay (LOS)
Type	In-Hospital Mortality	In-ICU Mortality	LOS > 7	LOS > 3
ratio	89.5%:10.5%	93%:7%	56.8%:43.2%	92.1%:7.9%

Table 3. The statistics of entities extracted from clinical notes.

Entity Type	Total Entity	Unique Entity	Example
Drug	742,231	18,204	Magnesium
Strength	152,234	10,680	400 mg/5 mL
Route	207,876	1192	PO
Dosage	126,756	7230	30 mL
Form	40,885	597	suspension
Frequency	71,285	3279	bid
Duration	5830	1185	next 5 days

Table 4. Performances of different models on the same testing set.

Model	In-Hospital Mortality		In-ICU Mortality		LOS > 7		LOS > 3
Model	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC
MTL	0.8623 (±0.0143)	0.5243 (±0.0140)	0.8611 (±0.0123)	0.4712 (±0.0088)	0.8211 (±0.0012)	0.6423 (±0.0123)	0.7605 (±0.0043)	0.8113 (±0.0034)
MDCNN	0.8423 (±0.0042)	0.5067 (±0.0051)	0.8402 (±0.0044)	0.4548 (±0.0062)	0.8102 (±0.0012)	0.6236 (±0.0052)	0.7385 (±0.0095)	0.7925 (±0.0026)
CMN	0.8678 (±0.0092)	0.5403 (±0.0012)	0.8670 (±0.0012)	0.5032 (±0.0014)	0.8402 (±0.0012)	0.6332 (±0.0016)	0.7608 (±0.0048)	0.8308 (±0.0053)
BCB + LSTM	0.8850 (±0.0021)	0.5860 (±0.0048)	0.8670 (±0.0012)	0.5322 (±0.0036)	0.8335 (±0.0019)	0.6520 (±0.0085)	0.7886 (±0.0023)	0.8210 (±0.0019)
PM2F2N	0.8827 (±0.0034)	0.6178 (±0.0034)	0.8834 (±0.0023)	0.5750 (±0.0028)	0.8609 (±0.0008)	0.6974 (±0.0011)	0.8135 (±0.0010)	0.8492 (±0.0036)
DESAM-cp	0.8933 (±0.0016)	0.6344 (±0.0023)	0.8968 (±0.0035)	0.5822 (±0.0021)	0.8549 (±0.0094)	0.6820 (±0.0028)	0.8043 (±0.0028)	0.8345 (±0.0032)
D4Care	0.9203 (±0.0028)	0.6645 (±0.0014)	0.9189 (±0.0045)	0.6038 (±0.0033)	0.8820 (±0.0056)	0.7048 (±0.0012)	0.8484 (±0.0011)	0.8557 (±0.0022)

Table 5. The ablation experiment results of D4Care (“w/o” means not included, “Only-TS” means only using time series data, while other models use multi-modal data).

Model	In-Hospital Mortality		In-ICU Mortality		LOS > 7		LOS > 3
Model	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC
Only-TS	0.8420 (±0.0065)	0.5002 (±0.0082)	0.8436 (±0.0032)	0.4587 (±0.0010)	0.8010 (±0.0013)	0.6189 (±0.0034)	0.7432 (±0.0027)	0.7824 (±0.0031)
w/o MDCA	0.8940 (±0.0008)	0.6358 (±0.0014)	0.8964 (±0.0024)	0.5702 (±0.0027)	0.8501 (±0.0092)	0.6822 (±0.0023)	0.8110 (±0.0024)	0.8322 (±0.0019)
w/o SAA	0.9147 (±0.0011)	0.6448 (±0.0027)	0.9045 (±0.0031)	0.5842 (±0.0018)	0.8719 (±0.0010)	0.6914 (±0.0011)	0.8303 (±0.0015)	0.8446 (±0.0011)
w/o DM	0.9138 (±0.0017)	0.6460 (±0.0028)	0.9092 (±0.0012)	0.5851 (±0.0008)	0.8752 (±0.0016)	0.6920 (±0.0044)	0.8326 (±0.0023)	0.8490 (±0.0017)
w/o MacLN	0.9132 (±0.0082)	0.6401 (±0.0056)	0.9032 (±0.0019)	0.5788 (±0.0021)	0.8718 (±0.0032)	0.6811 (±0.0024)	0.8267 (±0.0020)	0.8434 (±0.0024)
D4Care	0.9203 (±0.0028)	0.6645 (±0.0014)	0.9189 (±0.0045)	0.6038 (±0.0033)	0.8820 (±0.0056)	0.7048 (±0.0012)	0.8484 (±0.0011)	0.8557 (±0.0022)

Table 6. The top 18 words used for model interpretability analysis calculated by IG.

IG Rank	Clinically Significant Words	Common Words	IG Rank	Clinically Significant Words	Common Words
1	pain	the	10	conditions	diseases
2	fever	with	11	failure	from
3	cough	medical	12	drug	history
4	respiratory	diagnosis	13	pulses	admitted
5	pneumonia	year	14	sob	end
6	heart	old	15	acute	let
7	brain	will	16	insulin	visit
8	clear	not	17	increasing	unspecified
9	mental	possible	18	seizure	status

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, B.; Liu, G. D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction. Appl. Sci. 2025, 15, 6054. https://doi.org/10.3390/app15116054

AMA Style

Chen B, Liu G. D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction. Applied Sciences. 2025; 15(11):6054. https://doi.org/10.3390/app15116054

Chicago/Turabian Style

Chen, Binyue, and Guohua Liu. 2025. "D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction" Applied Sciences 15, no. 11: 6054. https://doi.org/10.3390/app15116054

APA Style

Chen, B., & Liu, G. (2025). D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction. Applied Sciences, 15(11), 6054. https://doi.org/10.3390/app15116054

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction

Abstract

1. Introduction

2. Methods

2.1. Time Series Model

2.2. Dual-View Feature Encoding Model

2.3. Memory-Driven Cross-Modal Attention Model

2.4. Clinical Prediction

3. Experiment and Discussion

3.1. Data Description

3.2. Results and Analysis

3.3. Performance on Imbalanced Data

3.4. Ablation Study

3.5. Interpretability Analysis

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI