1. Introduction
Clinical practice now accumulates a volume of data that was not previously available at this scale, driven by the widespread adoption of electronic health records (EHRs) and bedside physiological monitoring systems. These data reflect changes in patient status, diagnostic processes, treatment decisions, and intervention trajectories at multiple levels. When used appropriately, they can support more precise clinical judgment. At the same time, in settings such as the intensive care unit (ICU), where patient conditions may change rapidly and timely response is critical, it is difficult for clinicians to immediately integrate and interpret all available information. As a result, the demand for data-driven clinical decision support systems (CDSSs) continues to increase. Recent artificial intelligence-based CDSSs have expanded in response to this demand, with growing emphasis on the early detection of diagnostic and therapeutic risks and the prediction of intervention timing and necessity [
1]. However, real-world EHR-based clinical data have both structural and representational limitations. Numerical measurements are distributed across multiple tables, are observed at irregular intervals, and frequently contain missing values. Diagnostic and progress records are also often recorded in unstructured forms, including free text and abbreviations. In particular, the International Classification of Diseases (ICD) [
2], which provides a standardized coding framework for diseases and clinical conditions, is a central mechanism for normalizing disease states and medical activities. In actual documentation practice, however, codes are often overlapping, incomplete, or only partially recorded, resulting in an imperfect label structure. As the ICD system becomes more fine-grained, automated ICD coding is reduced to an extreme multi-class classification problem involving thousands to tens of thousands of categories. Simple rule-based methods and conventional classifiers therefore face clear limitations in both performance and scalability. Although a broad range of studies has addressed automated ICD coding through pretrained language models, code-structure learning, and improved document-level semantic representation, studies on CDSSs that organically integrate such diagnostic information with ICU time-series-based intervention prediction remain limited.
Interventions in the ICU include therapeutic actions such as mechanical ventilation, medication administration, and procedures, all of which are directly related to patient safety and prognosis. Models that predict whether an intervention will be performed, or when a transition in intervention status will occur, have become an important component of CDSSs because they can identify patient deterioration in advance and provide clinicians with more time to respond. In particular, the initiation, discontinuation, and maintenance of mechanical ventilation are closely associated with major prognostic indicators such as reintubation, ICU length of stay, mortality, and ventilation-related complications. For this reason, estimating the likelihood of a ventilation transition within a future time window, rather than merely determining the current ventilation status at each time point, has independent clinical value as a prediction task. Patients at high risk of unplanned ventilation initiation may benefit from advance preparation of staff and equipment, along with additional clinical assessment, which may reduce emergency intubation events. Patients with a high probability of ventilation discontinuation may also benefit from a more active re-evaluation of extubation timing in order to reduce the risks associated with unnecessary continuation of mechanical ventilation, including pulmonary complications, muscle atrophy, and delirium. Nevertheless, ICU time-series data are characterized by irregular measurement intervals, low observation density, and substantial sparsity, while intervention events themselves are also highly imbalanced. These properties make it difficult to reliably distinguish intervention states using only the temporal patterns of physiological signals and weaken decision boundaries, particularly for rare transition states such as initiation and discontinuation. Motivated by these issues, this study defines an intervention transition prediction task that integrates ICU time-series data with diagnostic information and predicts, at each time point, whether a mechanical ventilation transition will occur within a predefined prediction horizon or whether the current state will be maintained using a four-class formulation. Effective intervention prediction for clinical decision support therefore requires more than temporal pattern learning alone; it also requires a method that incorporates disease and diagnostic context as conditioning information and systematically improves discrimination in rare-state intervals.
To address this need, we propose a diagnosis-time-series-intervention integrated clinical decision support system for use in the ICU. The proposed framework is designed not only to summarize retrospective patient data or classify the current ventilation state, but also to continuously estimate the patient-specific risk of mechanical ventilation transition along the temporal axis, thereby directly supporting decisions on intervention timing. The framework consists of a two-stage prediction structure. First, we design an automated ICD coding module that infers representative admission-level ICD labels by combining diagnosis code sequences extracted from electronic health records with patient-level static and numerical features. This module incorporates the practical structure of clinical records into the learning process through a partial-label learning strategy that reflects incomplete diagnostic labels, extreme class imbalance, and weak supervision in which the ground-truth label is assumed to exist within a candidate set. Second, we construct an ICU time-series tensor that includes the automated diagnostic labels as categorical context variables, and combine a multi-branch temporal convolutional network (TCN) [
3], which reflects the clinical grouping of numerical features, with an ICD-based gating mechanism to predict mechanical ventilation status in four categories. This design enables conditional intervention prediction, such that identical physiological signal patterns may receive different clinical interpretations depending on the diagnostic context.
Although the present study focuses on mechanical ventilation prediction in the ICU as a clinically important application setting, the proposed framework is not limited to this task alone. Its central idea is to use higher-level diagnostic context to condition the interpretation of lower-level time-series patterns, allowing the same physiological trajectory to be interpreted differently depending on the underlying clinical state. In this sense, the proposed “context + time-series” modeling strategy is readily extendable to other ICU intervention prediction problems, such as vasopressor administration, renal replacement therapy, or other treatment-transition events that also depend on both evolving physiological signals and broader disease context. More broadly, the same principle may be applicable to other sequential decision problems, including non-medical domains, whenever temporal patterns must be interpreted in light of higher-level contextual information. The main contributions of this study are as follows.
A hierarchical intervention prediction framework based on diagnostic context : We inject admission-level diagnostic context inferred from automated ICD coding into ICU time-series intervention prediction as a conditioning variable. This enables the model to assign different clinical meanings to the same physiological signal pattern according to the patient’s disease context, and implements an integrated structure in which higher-level diagnostic information modulates lower-level time-series interpretation. This design is not restricted to mechanical ventilation prediction, but is applicable in principle to other intervention prediction tasks that require contextualized interpretation of temporal clinical signals.
Robust diagnostic distribution estimation based on Adaptive CLPL : We formulate automated ICD coding as a candidate-set-based partial-label learning problem. To address the extreme multi-class and long-tail characteristics of ICD coding, we introduce an Adaptive CLPL learning strategy that stably estimates representative diagnostic distributions from structured EHR data alone and produces probabilistic clinical context representations that can be used in downstream tasks.
Mechanical ventilation prediction optimized for rare transition event detection : We redefine mechanical ventilation prediction as a four-class ventilation state transition prediction problem rather than a simple state classification task. By combining feature-group-based representation learning with imbalance-aware training, the proposed method improves discrimination for clinically important rare transition events such as ONSET and WEAN and increases the practical utility of intervention timing support in real ICU settings.
2. Related Works
2.1. Clinical Decision Support System
A clinical decision support system (CDSS) is an information system that assists in diagnostic and therapeutic decision-making by combining patient-specific clinical data with medical knowledge, including rules, guidelines, and evidence [
4]. Early CDSSs were often implemented as standalone or rule-based systems that provided recommendations, alerts, reminders, and order support, but their broader adoption was limited because they required separate data entry and were not well aligned with routine clinical workflows [
5,
6]. As hospital information systems and electronic health records (EHRs) became more widely integrated, CDSSs increasingly shifted toward point-of-care systems embedded in real clinical environments, where they could support medication safety, duplicate order checking, and follow-up management more directly [
7,
8,
9].
More recently, CDSS research has expanded from rule-based support toward data-driven learning systems trained on large-scale clinical data [
5,
6]. Deep learning-based CDSSs have reported strong performance in areas such as medical imaging, physiological signal analysis, and medication-related safety monitoring [
10,
11,
12,
13]. At the same time, recent studies have increasingly moved beyond isolated single-task systems and have explored architectures that connect multiple prediction stages or share information across modules [
14,
15]. Within this broader trend, an important open question is how higher-level diagnostic information can be linked to lower-level time-series interpretation in a clinically meaningful way. The proposed Context-Adaptive Gated Embedding (CAGE) framework addresses this question by constructing a diagnosis-aware integrated CDSS in which diagnostic context inferred from automated ICD coding is reinjected into downstream intervention prediction.
2.2. Automated ICD Coding
Automated ICD coding is not a simple document classification problem, because ICD codes exhibit hierarchical structure, co-occurrence patterns, and extreme imbalance across a very large label space. Early studies relied on rule-based, information-retrieval, or clinical NLP approaches. Representative examples include systems that combined structured retrieval with clinical text search [
16] and NLP-based coding pipelines that explicitly handled contextual modifiers such as negation, temporality, and certainty [
17]. Later work moved toward statistical and multilabel learning approaches, including structured feature fusion [
18], semantic similarity modeling [
19,
20], label-aware embedding models [
21], and sequence generation formulations [
22]. Recent ICD coding research has been dominated by deep multilabel and multimodal models that combine text, structured variables, and code-structure information. Examples include attention-based multilabel models such as CAML [
23], multimodal fusion models [
24], hierarchical attention models [
25], and graph- or geometry-aware approaches such as HyperCore [
26]. These studies have substantially improved coding performance, but several challenges remain central in ICU settings: extreme long-tail label distributions, unstable learning under partial-label or under-coded supervision, and limited use of predicted diagnostic information in downstream clinical decision support.
For the present study, the key issue is not only how to improve ICD prediction accuracy itself, but also how to obtain a diagnostic representation that can function as higher-level clinical context for downstream intervention prediction. Because prior work does not provide a standard comparison set tailored to this ICU diagnostic-context setting, the comparative experiments in
Section 4 are organized around representative model families with different inductive biases, including recurrent, Transformer-based, continuous-time, and structured-feature interaction models. This comparison is intended to test which type of representation is most suitable for stable diagnostic-context formation under large label spaces, structured ICU inputs, and partial-label supervision. In contrast to prior work that treats ICD codes as terminal outputs, CAGE uses ICD prediction as an intermediate representation learning stage and converts the predicted diagnostic distribution into admission-level context for downstream time-series interpretation.
2.3. Extreme Multi-Class Classification
A major difficulty in automated ICD coding is the size of the label space. ICD-11 contains tens of thousands of entities, and even partially covered EHR datasets often contain hundreds to thousands of clinically relevant codes. Such settings are commonly described as extreme multi-class or extreme multi-label classification problems. In these settings, naive one-vs.-all approaches become computationally expensive and tend to perform poorly on rare labels. To address this issue, prior work has studied hierarchical classification and scalable label-partitioning methods that reduce computational burden while preserving semantic structure. Representative examples include LOMtree [
27], which combines online learning with purity-balanced hierarchical splitting, and more recent approaches that jointly learn category similarity and hierarchical structure from data [
28,
29].
These studies are directly relevant to ICD coding because they clarify how large label spaces, long-tail distributions, and structural dependencies interact during learning. However, most existing XMC approaches still treat labels as final classification targets and focus primarily on improving prediction efficiency or top-level accuracy. In this study, the large ICD label space is instead used as the basis for constructing a probabilistic diagnostic context representation. Accordingly, the later experiments compare model families that differ not only in scalability, but also in how they capture temporal dynamics, sequential dependencies, or structured feature interactions under large label spaces. In this sense, CAGE extends the role of extreme multi-class prediction from endpoint classification to a context construction stage for downstream clinical prediction.
2.4. Intervention Prediction
Intervention prediction aims to estimate future diagnostic or therapeutic actions from evolving patient data and is an important function of ICU-oriented CDSSs. Earlier studies often focused on specific tasks such as ventilation weaning or extubation risk using classical machine learning methods, including support vector machines, logistic regression, and tree-based models [
30,
31,
32,
33]. As larger ICU datasets became available, later work increasingly adopted deep learning models that explicitly model temporal structure. A representative study [
34] compared logistic regression, LSTM, and CNN models for forward-facing ICU intervention prediction and defined clinically meaningful intervention states such as onset, weaning, stay on, and stay off across multiple interventions, including invasive ventilation and vasopressors [
35]. This line of work established an important baseline framework for intervention prediction from ICU time-series data. Subsequent studies extended this direction by using sequence models and graph-based architectures to capture more complex temporal and inter-variable dependencies [
35,
36,
37]. These models improved predictive performance for mechanical ventilation-related tasks and other ICU interventions, but most of them still relied primarily on physiological time-series patterns themselves. As a result, disease background and diagnostic context were often treated only implicitly or were not reinjected into the prediction process in an explicit hierarchical manner. In addition, irregular sampling and event imbalance remain central obstacles for robust deployment in real ICU environments.
To reflect these methodological differences, the comparative experiments in
Section 4 consider representative baseline families for intervention prediction, including classical statistical classifiers, sequential deep learning models, and graph-based extensions. The aim is not only to compare predictive performance, but also to examine whether explicit diagnostic-context conditioning provides an advantage over models that rely mainly on temporal signals alone. In contrast to prior approaches, CAGE integrates admission-level diagnostic context with time-series intervention prediction through diagnosis-aware gating, allowing the same physiological pattern to be interpreted differently depending on the patient’s disease background.
3. Methods
The proposed CAGE (Context-Adaptive Gated Embedding) network is a hierarchically integrated framework in which admission-level diagnostic context conditionally modulates the interpretation of time-series features. Rather than treating diagnostic estimation and intervention prediction as two independent procedures, CAGE links them within a single end-to-end training workflow in which diagnostic context is learned as a higher-level conditioning signal for downstream temporal prediction.
As shown in
Figure 1, the overall CAGE framework consists of three functional stages, while Algorithm 1 summarizes how these stages are jointly executed during training. First, for each ICU stay, irregular numerical measurements are organized into a fixed-window three-channel tensor that preserves observed values, missingness patterns, and temporal freshness through VAL, MSK, and DELTA encoding, while the prediction target is formulated as a four-class ventilation transition outcome. Second, the admission-level diagnostic context is estimated from structured clinical information under a partial-label ICD coding setting, and the resulting probability distribution is transformed into a continuous diagnostic embedding. Third, this higher-level diagnostic context is used to generate a diagnosis-aware gate that modulates TCN-based temporal features extracted from the numerical time-series branch. The final prediction is then produced from this context-adjusted representation, allowing the same physiological pattern to be interpreted differently depending on the patient’s disease background. During training, diagnostic context extraction and intervention prediction are optimized jointly through a combined objective composed of the diagnostic-context loss and the intervention prediction loss. The following subsections describe these three stages in detail:
Section 3.1 presents the clinical task formulation and data representation,
Section 3.2 explains diagnostic context extraction, and
Section 3.3 describes the context-adaptive intervention prediction mechanism.
3.1. Clinical Task Formulation and Data Representation
This section formulates the ICU prediction problem as a mathematical modeling problem that reflects a clinically meaningful decision-making setting. ICU electronic health records are not fully observed regularly sampled time-series; rather, they are closer to irregular observation sets in which measurement intervals differ across variables and missingness occurs structurally. The prediction target should also be defined not as a binary classification of the current state, but as the identification of future transitions in intervention status within a predefined time horizon. Accordingly, CAGE is designed to preserve irregularity and information freshness in the input time-series, while explicitly separating clinically meaningful transition events, rather than ventilation status itself, as the prediction targets. To this end, this section describes, in sequence, the sample construction principle that guarantees patient-level independence, the three-channel representation of irregular time-series data, and the formulation of the four-class ventilation state transition prediction task.
| Algorithm 1 Training workflow of CAGE framework |
- Require:
Raw patient cohort , candidate ICD label set L, learning rate - Ensure:
Trained model parameters - 1:
▹ Patient-level independence - 2:
- 3:
for each training epoch do - 4:
for each sample do - 5:
▹ - 6:
▹ Partial-label supervision - 7:
- 8:
- 9:
- 10:
▹ Context-adaptive gating - 11:
▹ Generate logits - 12:
▹ Class probabilities - 13:
- 14:
- 15:
- 16:
- 17:
end for - 18:
end for - 19:
return
|
3.1.1. Patient-Level Sample Definition and Leakage Control
The basic unit of analysis in this study is neither the patient level nor the admission level, but a sample defined with respect to a single ICU stay. Each sample is represented by a tuple including the patient identifier
, hospital admission identifier
, and ICU stay identifier
, defined as follows:
This ICU-stay-level formulation makes it possible to align time-series observations, diagnostic information, and intervention timing within the same clinical event unit, and ensures that the diagnostic context extraction layer and the intervention prediction layer share a consistent reference unit. Consistent data extraction and standardization were performed by integrating MIMIC-Extract [
38] and MIMIC-Code [
39], thereby maintaining consistently defined inputs and labels for reproducibility. However, in large-scale EHR environments such as MIMIC-III [
40], the same patient may have multiple admissions or repeated ICU stays. If random splitting is performed at the admission level or ICU-stay level, records from the same patient may appear simultaneously in the training and validation or test sets. Under such a setting, the model may memorize chronic underlying diseases, long-term physiological characteristics, or repeatedly observed diagnostic histories of individual patients. As a result, the reported performance may overestimate patient re-identification ability rather than true generalization performance.
To prevent such identifier-based contamination, the training, validation, and test splits were constructed to satisfy patient-level independence. In addition, when multiple ICU stays were available for the same patient, only the earliest ICU stay was retained in the cohort so that each patient contributed only one independent sample. Accordingly, the analyzable cohort was restricted from the full raw sample set
to cases satisfying the following conditions:
In other words, only adult patients were included. Only the first ICU stay was used to remove dependency and information redundancy arising from repeated stays of the same patient. Length of stay (LOS) was restricted to more than 24 h and no more than 240 h in order to reduce distortion caused by extremely short or long hospitalization periods. This cohort definition secures a sufficient observation window while preventing extreme cases from dominating the learning process. By explicitly incorporating the patient-level independence constraint into the evaluation design, this study ensures that the reported performance reflects genuine patient-level generalization.
3.1.2. Irregular Time-Series Encoding
A central property of ICU electronic health records is irregular sampling and structural missingness. Physiological signals and laboratory measurements are collected at different frequencies across variables, and missingness at a given time point may reflect not only absent documentation but also clinical judgment, such as the absence of need for additional testing or a stable patient condition. Imputing missing values by simple methods such as mean substitution may therefore distort both the clinical meaning carried by missingness and the temporal freshness of information. To reflect this property, this study uses a three-channel vector for each variable
d and time point
t, jointly encoding the measured value, the observation indicator, and the elapsed time since the last observation.
Here,
denotes the normalized measured value (VAL),
denotes the observation mask (MSK) indicating whether variable
d was actually observed at time
t, and
denotes the elapsed time since the last observation (DELTA). Among these components, the
channel plays a role beyond simple missingness marking. Even when two observations have the same value, a recently measured value and a value measured a long time earlier may differ substantially in how well they reflect the current patient state. Therefore,
explicitly provides information freshness, allowing the model to learn the decreasing reliability of stale observations. The
channel is defined as follows:
That is, if variable
d has been observed before time
t, the elapsed time is computed from its most recent observation time. Otherwise, to avoid an undefined
value when the variable has never been observed within the current observation window, it is initialized as the elapsed time from the start of the window,
. This initialization provides numerical stability and also allows variables that have not yet been measured at all to be represented with a consistent temporal encoding. The actual model input is constructed by stacking this three-channel representation over a fixed-length observation window. For each sample
i and reference time point
t, a sliding observation window of length
T is extracted to form the following dynamic time-series tensor:
Here, T denotes the observation window length, and denotes the number of dynamic numerical variables. The last dimension corresponds to VAL, MSK, and DELTA, respectively. As a result, is not a simple value matrix, but a multi-channel time-series tensor that preserves observed values, missingness patterns, and temporal freshness simultaneously. This representation retains the irregular structure of ICU data while providing a structured input format that can be directly processed by the subsequent TCN-based temporal representation learner.
3.1.3. 4-Class Intervention Transition Formulation
This study formulates intervention prediction not as a binary classification of whether mechanical ventilation is currently being applied, but as a prediction problem over future ventilation state transitions. A setting that only determines ventilation status at a single time point is insufficient to capture clinically important transition events such as future ventilation initiation or discontinuation. In contrast, ICU decision-making is more directly concerned with detecting the direction of near-future intervention change in advance, rather than only the present state itself. For this reason, this study defines a four-class transition prediction task in which the input consists of the observation window up to the current time point
t, and the target is the ventilation state that will occur within a future interval after a prediction lag
. The target label is therefore defined as follows:
The four classes correspond to ONSET, WEAN, STAY ON, and STAY OFF, respectively. Let the binary ventilation state over the future prediction interval
be denoted by
. The final target class is then assigned according to the following conditions:
This definition is important because the label is determined by the pattern of state change over the entire future interval rather than by the state value at a single time point. ONSET refers to cases in which a transition from the non-ventilated state to the ventilated state occurs at least once within the future interval, while WEAN refers to the opposite transition. STAY ON and STAY OFF correspond to cases in which the entire future interval remains fully ventilated or fully non-ventilated, respectively. The model is therefore trained not merely to classify the current ventilation state, but to predict the direction of near-future intervention change. This formulation is not simply a matter of label engineering; it is a task definition that explicitly isolates clinically important transition timing as a target for prediction in real ICU settings. In particular, ONSET and WEAN are intrinsically much rarer than STAY ON or STAY OFF, which induces severe class imbalance. This imbalance is nevertheless part of the problem setting that must be accepted in order to identify rare but clinically important events explicitly.
The sample construction principle, input representation, and output target definition introduced in this section form the foundation of the overall CAGE architecture. At the sample construction stage, patient-level independence was enforced to prevent identifier-based contamination. On the input side, the irregularity and structural missingness of ICU electronic health records were preserved through a three-channel time-series representation. On the output side, the mechanical ventilation problem was reformulated as a transition-centered four-class prediction task. Based on this design, the next section describes how admission-level diagnostic information is extracted not as a simple code prediction output, but as higher-level clinical context for downstream intervention prediction.
3.2. Diagnostic Context Extraction via Partial Label Learning
This section treats automated ICD coding not as a simple multi-class classification problem, but as a process for quantitatively extracting higher-level clinical context at the ICU-stay level. Intervention prediction cannot be sufficiently explained by low-level time-series changes alone. Even the same physiological signal pattern may have very different clinical meanings depending on the disease background and admission context under which it is observed. The purpose of the diagnostic context extraction stage is therefore not to predict a single final diagnosis code itself, but to form a continuous representation that summarizes the patient’s disease background at the admission level and passes it to the downstream intervention prediction stage. To this end, this study combines a partial-label learning setting, high-order interaction modeling, and a loss design suitable for extreme multi-class environments to estimate ICU diagnostic distributions in a stable manner, and then transforms them into probability-weighted diagnostic embeddings for downstream intervention prediction. From the perspective of the overall framework, this output is not an isolated ICD prediction result, but an admission-level diagnostic context variable shared across the full CAGE architecture.
3.2.1. ICD Prediction as a Partial Label Problem
In EHR settings, the ICD codes associated with a single admission or ICU stay are usually multiple, and it is often difficult to determine which of them is directly associated with a specific time-series pattern. This is especially true in early prediction settings, where the final diagnosis may not yet have been fully established, and where clinical documentation itself is often completed or refined after treatment has ended. Accordingly, the traditional multi-class classification setting, which assumes a single ICD code as the unique correct answer for an observed input
, does not adequately reflect the diagnostic uncertainty and delayed documentation structure of ICU environments. To address this issue, this study adopts a partial-label learning setting in which the candidate set of clinically possible diagnosis codes is used as the supervision signal for sample
i. Specifically, the candidate set for sample
i is defined as follows:
Here, denotes the full ICD class space, and denotes the candidate set of observed ICD codes for sample i. Under this setting, the goal of the model is not to strongly select a single class, but to assign high scores and probability mass within the candidate set while learning separation from classes outside it. This formulation provides a more realistic form of supervision that reflects the inherent multiplicity and incompleteness of ICU diagnostic information. Given that the ultimate objective of this study is to extract diagnostic context for downstream intervention prediction, it is more important to obtain a stable and structured probability distribution over the candidate region than to achieve an exact match for a single code. In other words, this stage functions not as a final diagnostic classifier, but as a higher-level representation learner that probabilistically summarizes admission-level disease context.
3.2.2. High-Order Feature Interaction Modeling
Diagnostic prediction is less a matter of estimating the independent effect of each variable than of interpreting combinatorial interactions among heterogeneous clinical variables; for example, the same decrease in blood pressure may have a different diagnostic meaning depending on age, underlying disease, oxygen saturation, abnormal laboratory findings, and treatment history. Automated ICD coding therefore requires a base architecture that can efficiently model high-order interactions among structured clinical variables. To this end, this study estimates diagnostic distributions using a structure that combines explicit cross-feature learning with nonlinear representation learning. By jointly using feature crossing and deep representation learning, this structure can effectively capture important interaction patterns in environments such as the ICU, where the data are strongly characterized by tabular-like clinical representations. Let
denote the input feature vector for sample
i. The model outputs the following logit for each ICD class
k:
The resulting logit vector is then transformed into a probability distribution over the full class space by applying softmax:
Here,
denotes the posterior probability that sample
i belongs to class
k. An important point is that this probability distribution is not merely an intermediate result for top-1 prediction. Rather, it is the key input used in the next stage to generate the diagnostic context embedding, and the relative probability structure across classes is itself meaningful. In this study, DCNv2 [
41] is used as the base estimator for high-order interaction modeling. The central point of this section, however, is not the use of a specific backbone itself, but the modeling of interactions among structured clinical variables in a form suitable for diagnostic context estimation.
3.2.3. Adaptive CLPL for Extreme Multi-Class Environment
ICD prediction is more difficult than general multi-class learning because it involves both a very large class space and an extreme long-tail distribution. A small number of frequent codes account for a substantial portion of the full sample set, whereas most codes are observed only rarely. In addition, supervision is given not as a single correct answer but as a candidate set. Under these conditions, direct application of standard cross-entropy loss or a simple softmax-based ranking loss may lead to excessive computational cost and may fail to adequately reflect consistency under partial-label supervision. To address this problem, this study proposes an Adaptive CLPL loss that combines the computational efficiency of Adaptive Softmax [
42] with the partial-label consistency of Convex Learning from Partial Labels(CLPL) [
43]. The main idea is to decompose the full class space into a frequency-based head set
and tail set
, and to compute the approximate negative penalty for the tail region using only a subset of sampled negative classes
. This design avoids the need to consider all negative classes at every step in long-tail settings, while preserving the central requirement of candidate-set-based learning in partial-label settings. Specifically, let
Y denote the candidate set for sample
i, and let
denote the number of sampled tail negatives. The Adaptive CLPL loss is then defined as follows:
The first term is the candidate term, which increases the average score of classes inside the candidate set. This is well aligned with partial-label supervision because it encourages the candidate region as a whole to receive relatively high scores, rather than selecting a single candidate. The second term is the head negative term, which suppresses the scores of classes outside the candidate set in the frequent-class region. The third term is the tail negative term, which assigns penalties to a sampled subset of negative classes instead of directly computing over the full rare-class region, and rescales the contribution by the factor so that the expected scale is preserved. Here, is a surrogate loss function. In this study, a monotone decreasing logistic-form function is used so that positive candidate scores are increased and negative scores are decreased.
The significance of this loss function extends beyond computational reduction alone. Adaptive CLPL is structurally designed to push up the candidate set while suppressing classes outside it according to the statistical characteristics of the head and tail regions. It therefore reflects both partial-label consistency and computational efficiency in extreme multi-class environments, and constitutes the main mathematical contribution of this study in diagnostic context extraction. In particular, this loss more directly supports the formation of a stable probability distribution suitable for downstream conditioning than the selection of a single final ICD code.
3.2.4. Probability-Weighted Diagnostic Embedding
The output of the diagnostic context extraction stage is not a single final predicted class, but a probability distribution over the full ICD class space. Accordingly, this distribution is not reduced to a single hard label through
, but is converted into a continuous diagnostic context vector by computing the probability-weighted sum of ICD class embeddings. The diagnostic embedding for sample
i is defined as follows:
Here, is the learnable embedding vector of ICD class k. This formulation places greater weight on diagnosis candidates in which the model has higher confidence, while still partially reflecting candidates that remain uncertain but are not excluded. As a result, the generated is not the output of a single-code decision, but a smooth summary in continuous space of the admission-level disease background of sample i. Such a probability-weighted embedding can preserve multiple possible clinical interpretations even in early prediction settings where the diagnosis has not yet fully converged, and can provide richer and more stable information than a single hard label when used as higher-level conditioning context in downstream intervention prediction. In CAGE, diagnostic context is therefore not a symbolic tag that fixes the patient to one disease category, but a soft clinical context that reflects the mixture of multiple possible diagnostic states. This embedding is shared across time windows belonging to the same admission or ICU stay, and is later used in the context-adaptive gating mechanism as a higher-level signal that modulates the interpretation of lower-level numerical time-series data. It therefore functions not as the endpoint of ICD coding, but as a bridge that transforms diagnostic distributions into a form that can be connected to intervention prediction.
Taken together, this stage reframes automated ICD coding not as a terminal classification objective, but as a means of constructing an admission-level diagnostic context representation. Through Adaptive CLPL and probability-weighted embedding, the resulting output is transformed into a continuous context signal that can be passed to the downstream intervention prediction stage.
3.3. Context-Adaptive Intervention Prediction
The diagnostic context extracted in the previous section is not the final objective in itself, but a higher-level clinical representation used to conditionally reinterpret downstream intervention prediction. This section describes the mechanism by which this diagnostic context is combined with time-series features to produce the final intervention prediction. The core idea of CAGE is not a simple parallel combination of the two representations. Instead, the higher-level diagnostic context extracted at the admission level dynamically modulates the importance of lower-level numerical features formed along the temporal axis, so that the same physiological signal pattern can be interpreted differently depending on the disease background. To this end, this study integrates TCN-based temporal feature extraction, diagnosis-aware gating, and an objective function that is sensitive to rare transition events into a single prediction procedure.
3.3.1. TCN-Based Temporal Feature Extraction
The numerical input for intervention prediction is
, constructed from irregular ICU time-series observations, and includes the value, observation indicator, and elapsed time since the last observation for each variable. To extract clinically meaningful patterns from this input, this study uses a TCN. Because TCN is based on causal convolutions, it guarantees that the representation at the current time point does not reference information from future time points. This property is important for preventing information leakage in prediction tasks. In addition, by using dilation, the receptive field grows rapidly as the network depth increases, allowing the model to capture both abrupt changes over short time scales and gradual trends over longer time scales. The receptive field of a TCN is generally expressed as follows:
where
L denotes the number of levels,
k denotes the kernel size, and
denotes the dilation factor of the
l-th block. If the dilation increases exponentially such that
, the receptive field can be rewritten as
This expression shows that even a relatively shallow network can cover a wide temporal range. This property is particularly important in ICU settings. Some interventions are determined by acute deterioration signals arising over minutes or hours, whereas in other cases the need for clinical intervention increases because gradual physiological decline accumulates over several days. TCN has the advantage of learning such multi-scale temporal patterns in a parallel and more stable manner than recurrent structures. After passing through the TCN, the numerical time-series branch is summarized into a batch-wise representation, denoted as
Here, B denotes the batch size and K denotes the hidden representation dimension. is a summarized temporal feature representation extracted from the full input time-series, and it remains a purely physiological signal-based feature before diagnostic context is incorporated.
3.3.2. Context-Adaptive Gating Mechanism
The core fusion operation in CAGE is a context-adaptive gating mechanism in which diagnostic context dynamically modulates the channel-wise importance of time-series features. Unlike simple feature concatenation, in which time-series and diagnostic information are merely placed side by side and left for a downstream classifier to combine, the gating mechanism in CAGE allows higher-level context to directly control the activation strength of lower-level representations. In other words, the same time-series pattern can be emphasized or suppressed depending on the disease background. To implement this, the diagnostic embedding generated in the previous stage and the remaining categorical variables are maintained as a separate categorical representation
. Average pooling is then applied to summarize the global context over the temporal axis into a single representation:
Here,
can be regarded as an admission-level context summary shared over the full temporal axis. This representation contains the patient’s diagnostic background and more stable categorical information, and serves as the basis for gate generation. It is then transformed into a gate vector through a linear transformation followed by a sigmoid nonlinearity:
Here,
and
are learnable parameters, and
denotes the sigmoid function. As a result,
g has the same dimensionality as the numerical representation
, and each dimension can be interpreted as a continuous weight indicating how strongly the corresponding time-series feature channel should be emphasized or suppressed. The gate is then applied to the numerical representation through element-wise multiplication (Hadamard product):
If
is a general temporal signature extracted from the time-series, then
is a condition-aware representation in which that signature has been reinterpreted through diagnostic context. Accordingly, the same physiological signal change may highlight certain channels more strongly in one disease context and suppress them in another. The reweighted numerical representation is then passed to the final classifier to generate the logit vector
o for the four-class intervention transition task, and the probability of each class is computed by softmax:
Here, denotes the class probability for . The final prediction is therefore generated not from numerical time-series data alone, but from a representation adjusted by diagnostic context.
3.3.3. Class-Balanced Focal Loss for Rare Event Detection
The final prediction problem consists of four classes, ONSET, WEAN, STAY ON, and STAY OFF, but their frequencies are highly imbalanced. In particular, ONSET and WEAN are clinically important transition events because they correspond to actual intervention changes, yet they are observed far less frequently than STAY ON or STAY OFF. Under such conditions, if a standard cross-entropy loss is used without modification, the model tends to converge toward improving discrimination for the more frequent classes, while showing low recall for the relatively rare transition events. As a result, the model may achieve high overall accuracy while still failing to detect the clinically most important events. To mitigate this issue, this study uses a class-balanced focal loss, defined as follows:
Here, is the class weight used to correct for class-frequency imbalance in class c, such that rarer classes can be assigned larger weights. In addition, is the focusing parameter, which reduces the loss contribution from already easy samples and allocates more learning signal to samples that remain difficult. The term decreases the influence of easy samples that the model predicts with high confidence, while relatively increasing the importance of difficult samples that receive low confidence, such as rare transition events. This loss design is directly aligned with the problem setting of this study. The proposed task is not a simple classification of the current state, but an explicit prediction of clinically important state transitions. As a result, ONSET and WEAN must be treated as separate classes, but are also intrinsically rare events. Without correcting this imbalance at the loss function level, the model is likely to underlearn the most important clinical events. The class-balanced focal loss addresses this issue by reducing the dominance of easy samples from majority classes and allocating more effective learning capacity to rare event detection.
Taken together, the intervention prediction stage combines temporal feature extraction, diagnosis-aware gating, and imbalance-sensitive optimization into a single context-adaptive prediction process. In this way, the same physiological pattern can be interpreted differently depending on the underlying disease context, while rare but clinically important transition events receive greater emphasis during learning.
4. Experiment
This section presents a comprehensive set of experiments on automated ICD coding and intervention prediction in order to quantitatively evaluate the effectiveness of the proposed framework. These two tasks correspond to the automatic transformation of patient diagnostic information into a structured code space and the prediction of future therapeutic interventions, respectively, both of which represent core functionalities required in ICU clinical decision support systems. The goal of these experiments is not merely to report the single best-performing model. Instead, we aim to answer the following three research questions:
RQ1. Is stable prediction in a large-scale ICD code space possible using ICU-derived structured and temporal representations?
RQ2. In the automated ICD coding task, how do model architecture and loss function contribute to candidate-set ranking performance, probability concentration, and long-tail stability, respectively?
RQ3. In an imbalanced intervention-event setting, how strongly does the choice of loss function affect the detection performance of rare transition events?
To answer these questions, we conduct a stepwise analysis consisting of absolute performance comparisons for each task, controlled experiments that separate the effects of model architecture and loss function, and detailed metric analysis at the event and class levels. In particular, automated ICD coding simultaneously involves extreme multi-class classification, long-tail distributions, and partial-label supervision, making it difficult to interpret performance differences from a single best-score comparison alone. We therefore analyze the effects of architecture and loss separately for ICD coding, while focusing on the impact of loss functions on rare-event detection performance for intervention prediction. For clarity, the main tabular comparisons are additionally visualized in graphical form using plots derived from the same test results reported in
Table 1,
Table 2,
Table 3 and
Table 4. These controlled comparisons do not replace a full sequential module ablation of every component, but they do provide partial quantitative evidence regarding the relative importance of representational design and objective function choice under the present experimental setting.
4.1. Experimental Setup
All experiments were conducted under the same hardware environment. The CPU was an AMD Ryzen Threadripper PRO 3955WX, and the GPU was an NVIDIA A100 40 GB. The implementation was based on PyTorch 2.10.0 and the random seed was fixed to 42 for reproducibility. To ensure fair comparison, the data split, training protocol, and input representation were kept identical across all experiments so that performance differences could be attributed to model architecture or loss function choice. This study evaluates two ICU prediction tasks. Automated ICD coding predicts ICD codes from structured ICU representations and is characterized by a very large class space and a partial-label constraint in the form of candidate sets. Because these structural difficulties may cause tail-class learning collapse and bias in the predicted distribution, the analysis treats loss design as a core variable in addition to model comparison. Intervention prediction forecasts mechanical ventilation-related events from ICU input sequences, and reports both macro-level metrics and event-wise metrics to account for event imbalance and differences in per-event difficulty. The data were split at the ICU-stay level using a fixed partition with a training:validation:test ratio of 70:10:20. Early stopping was applied on the basis of validation set performance.
In addition to predictive accuracy, we also recorded the execution context of the full experimental workflow. The implementation environment consisted of Python 3.12 on Linux, with PyTorch and CUDA enabled. The pipeline was executed with a fixed random seed of 42 and a unified configuration in which the input window, gap, and prediction window were set to 6 h, 6 h, and 4 h, respectively. The complete end-to-end run required about 30,536 s in total, of which about 307 s were spent in ETL, about 25,120 s in automated ICD coding, and about 5073 s in intervention prediction. The execution logs further show that the processed cohort size was 30,932 ICU stays, the hourly outcome grid contained 2,632,893 rows, the ICD coding task involved 6130 classes, and the final tensorized input contained 98 three-channel features.
4.2. Task Definitions
4.2.1. Automated ICD Coding Task
The automated ICD coding task was defined as the prediction of ICD codes from structured ICU input data. The class space is very large, and code frequency follows an extreme long-tail distribution, which may lead to bias toward head classes and collapse in tail-class learning. This study also adopts a partial-label learning setting. Specifically, although each sample is assumed to have one correct ICD code, only a candidate set is provided during training, and the true label is contained within that candidate set. The model is therefore trained to concentrate probability mass on the correct code within the candidate set, while also being encouraged to provide negative evidence for classes outside the candidate set. Under this setting, the choice of loss function can directly affect learning stability and tail performance. Accordingly, we conduct a separate comparison along the loss function axis in addition to the comparison of model architectures.
4.2.2. Intervention Prediction Task
Intervention prediction was defined as the prediction of mechanical ventilation-related events from ICU time-series input. In this study, the target events are ONSET, WEAN, STAY ON, and STAY OFF. Because the class distribution is imbalanced across events, both macro-level metrics and event-wise metrics are reported. In particular, the detection performance for sparse events is clinically important, and AUC- and AUPRC-based measures are used as the main evaluation metrics. To analyze performance differences induced by the choice of loss function, we compare the results obtained with different loss functions under the same model architecture and training protocol.
4.3. Comparative Models
For the comparative experiments on automated ICD coding, representative baseline families were selected to reflect different assumptions about representation learning under large ICD label spaces, structured ICU inputs, and partial-label supervision. Rather than following a fixed benchmark lineage from prior ICD coding studies, the comparison was designed to contrast recurrent, Transformer-based, continuous-time, and structured-feature interaction models within the present diagnostic-context setting. MedBERT [
44] is a Transformer-based representation model pretrained on medical code data and emphasizes contextual code semantics. GRU-D [
45] is a recurrent architecture specialized for irregular clinical time-series with missingness and time-gap information. TST [
46] and iTransformer [
47] represent Transformer-based temporal sequence modeling approaches for long-range dependency learning. Latent ODE [
48] models continuous-time latent dynamics and is naturally suited to irregular observation intervals. TabNet [
49] and DCNv2 [
41] emphasize structured feature interaction learning and are included to test whether explicit feature-cross modeling is advantageous for diagnostic-context estimation from structured clinical variables. By comparing these families together, we examine which inductive bias is most suitable for stable ICD-context formation under the present task setting.
For intervention prediction, the baseline selection follows the broader methodological progression summarized in
Section 2.4, from classical classifiers to temporal deep learning and graph-based extensions. Random Forest and Logistic Regression were included as classical baselines that do not explicitly model temporal order and instead operate on aggregated or flattened fixed-window features. CNN-based models capture local temporal patterns through convolution along the time axis and provide a strong sequence-modeling baseline with relatively stable optimization. LSTM is a standard recurrent model for ICU sequence prediction and is particularly relevant given prior intervention-prediction studies that compared logistic regression, CNN, and LSTM under onset- and weaning-oriented task settings [
35]. To further represent models that explicitly incorporate inter-variable dependency structure, we also included graph-based extensions such as LSTM-GNN and MTS-GCNN [
37]. These baselines allow comparison across classical statistical learning, sequential deep learning, and graph-aware temporal modeling under the same intervention prediction setting.
All comparative models shared the same input representation, data split, training protocol, and early stopping criterion. Hyperparameters were selected through validation-set-based tuning or followed the recommended settings in the original papers. Accordingly, the reported performance differences arise from model architecture rather than from differences in data processing or training procedure.
4.4. Evaluation Metrics
4.4.1. Result of Automated ICD Coding
Because the automated ICD coding task is defined as a partial-label classification problem in which only a candidate set containing the correct answer is provided, fully supervised classification metrics such as top-1 accuracy or macro-F1 do not directly reflect the problem setting. In particular, whether the prediction falls within the candidate set and how highly it is ranked within that set are closely related to practical clinical usability. Accordingly, this study evaluates the task primarily using metrics that directly account for the candidate-set constraint. As a basic metric, we use candidate accuracy, defined as the proportion of samples for which the model’s top prediction
is included in the candidate set
, which is computed as
This metric measures whether the predicted answer lies within the candidate set and is equivalent to hit@1 under a partial-label setting. In addition, to evaluate ranking performance within the candidate set in greater detail, we also report hit@k for . Hit@k measures whether the correct code is included in the top-k predictions within the candidate set and reflects the probability that the model includes the true diagnosis among multiple recommended codes. It therefore evaluates practical detection ability under a condition less restrictive than single-label prediction.
In addition to ranking-based metrics, we also use probability-based auxiliary measures to quantify how strongly the predicted probability distribution is concentrated on the candidate set itself. Candidate Probability Mass (CPM) is defined as the total probability mass assigned to the full candidate set,
, and indicates how strongly the predicted distribution is concentrated in the candidate region. A high CPM value indicates that the model concentrates confidence on clinically meaningful code regions rather than dispersing probability over non-candidate classes. Candidate Margin (CMG) is defined as the difference between the maximum probability inside the candidate set and the maximum probability outside the candidate set:
This metric evaluates the sharpness of the decision boundary. A larger value indicates clearer separation between candidate and non-candidate codes, which is directly related to reduced misclassification risk during inference. Meanwhile, because the ICD code space follows an extreme long-tail distribution, the model may exhibit shortcut learning by concentrating excessively on a small number of frequent codes. To diagnose such collapse of the predicted distribution, we additionally measure the Normalized Dominance Index (NDI) and the Effective Prediction Ratio (EPR). NDI quantifies the normalized concentration of predictions on a particular class and increases as predictions become excessively dominated by a single code. EPR is defined on the basis of the entropy of the predicted probability distribution by computing the effective number of classes as and normalizing it by the total number of candidate classes. It quantifies the diversity of classes that the model effectively uses. These two measures are used to assess whether the model relies only on head classes or makes broader use of tail codes as well. By combining ranking-based metrics, probability concentration, and distributional diversity, we evaluate not only predictive accuracy, but also confidence concentration, decision-boundary stability, and long-tail generalization ability.
4.4.2. Result of Intervention Prediction
Intervention prediction is a multi-class classification problem that predicts future mechanical ventilation-related events, and it has an imbalanced structure in which the frequencies of individual events differ substantially. In such a setting, simple accuracy can be dominated by majority classes and may fail to properly reflect detection performance for sparse events. We therefore evaluate this task using metrics that remain reliable and interpretable under class imbalance. As a basic metric, we use the F1 score, the harmonic mean of precision and recall. Because it jointly considers false positives and false negatives, the F1 score helps mitigate the extreme precision- or recall-biased behavior that often arises in sparse-event detection. In clinical intervention prediction, false negatives may lead to particularly serious consequences, and the F1 score therefore reflects practical clinical usefulness more appropriately than simple accuracy. In addition, to assess global classification performance independently of a specific threshold choice, we adopt Macro AUC and Macro AUPRC as the main metrics. Macro AUC is the equally weighted average of class-wise ROC-AUC values and measures overall discriminative ability without being dominated by class imbalance. Macro AUPRC is the average area under the precision-recall curve and is especially sensitive to performance changes when the positive class ratio is low. Because ICU data contain highly sparse classes, AUPRC is used as a key metric for evaluating the model’s practical detection ability.
Finally, to analyze differences in the characteristics and difficulty of individual events in greater detail, we report event-wise AUC separately for ONSET, WEAN, STAY ON, and STAY OFF. This makes it possible to assess not only average performance over all classes, but also whether improvement is achieved for specific sparse events, and helps interpret in which clinical situations the model is most effective. By jointly using F1 score, Macro AUC, Macro AUPRC, and event-wise AUC, we evaluate detection performance under imbalance, threshold-independent discrimination, and event-specific generalization ability from multiple perspectives.
4.5. Results
For readability, the main comparative results are shown not only in tabular form but also in graphical form. The additional figures are direct visualizations of
Table 1,
Table 2,
Table 3 and
Table 4 and are included to make cross-model and cross-loss performance differences easier to interpret, especially for rare-event-sensitive metrics and candidate-set-based diagnostic metrics.
Table 1 and
Figure 2 jointly summarize the comparative results for the automated ICD coding task and directly address RQ1. Specifically, to evaluate whether stable prediction in a large-scale ICD code space is possible using ICU-derived structured and temporal representations, we found that the proposed model achieved the best performance across all hit@k metrics. This indicates that clinically meaningful diagnostic information can be stably recovered within the candidate set using structured ICU representations derived from temporal observations and admission-level clinical variables. The second-best performance was achieved by iTransformer, while the remaining models fell short of this level. In particular, GRU-D and TST showed hit@1 values of around 0.2, and although Latent ODE achieved a relatively high hit@10, its top-rank precision remained low, indicating limited ability to place the correct code among the highest-ranked candidates. The same pattern is consistently confirmed by the probability-distribution-based auxiliary metrics. The proposed model achieved the highest CPM, indicating that its predicted probability was most stably concentrated within the candidate set, and it was also the only model to achieve a positive CMG, showing the clearest decision boundary between candidate and non-candidate codes. In addition, it achieved the lowest NDI and the highest EPR, suggesting that its predictions did not collapse onto a small set of frequent head codes and instead made broader use of the code space. Taken together, the results in
Table 1 provide a positive answer to RQ1. Stable automated ICD prediction is possible in a large-scale code space using structured ICU representations, and the proposed model achieved the best performance in both candidate-set ranking and distributional stability.
Table 1.
Test results for the automated ICD coding models.
Table 1.
Test results for the automated ICD coding models.
| | MedBERT | GRU-D | TST | Latent ODE | iTransformer | TabNet | Ours |
|---|
| hit@1 | 0.4125 | 0.1904 | 0.1781 | 0.2670 | 0.4459 | 0.4125 | 0.4863 |
| hit@3 | 0.4127 | 0.1905 | 0.3048 | 0.4434 | 0.6814 | 0.6414 | 0.7302 |
| hit@5 | 0.4128 | 0.1905 | 0.3051 | 0.5350 | 0.7506 | 0.6968 | 0.8063 |
| hit@10 | 0.4131 | 0.1915 | 0.3720 | 0.7740 | 0.8345 | 0.7884 | 0.8801 |
| CPM | 0.0020 | 0.0008 | 0.0011 | 0.0976 | 0.2983 | 0.0086 | 0.3698 |
| CMG | −0.0009 | −0.0007 | −0.0010 | −0.1085 | −0.0084 | −0.0001 | 0.0425 |
| NDI | 1.0000 | 0.9996 | 1.0000 | 1.0000 | 0.7970 | 1.0000 | 0.4112 |
| EPR | 0.0002 | 0.0002 | 0.0002 | 0.0002 | 0.0006 | 0.0002 | 0.0017 |
Table 2 and
Figure 3 address RQ2 by separating the contributions of model architecture and loss function in the automated ICD coding task. To this end, each model was trained using CLPL, Adaptive Softmax, and Adaptive CLPL, and the average performance was compared. This averaged comparison reduces bias toward any single loss function and allows the structural effect of the model itself to be examined more directly. Even under this loss-averaged condition, the proposed model maintained the best overall performance in candidate accuracy, hit@k, CPM, CMG, NDI, and EPR, while iTransformer achieved the second-best performance. This suggests that the main basis of ICD coding performance lies in model architecture, and that ranking quality within the candidate set and the stability of the probability distribution are strongly influenced by representational capacity. At the same time, the loss function also made a non-negligible additional contribution. A detailed inspection of the loss-specific results shows that Adaptive CLPL improved hit@k and probability concentration metrics for most models, and that this effect was largest when combined with the proposed architecture. This indicates that model structure and loss function play different roles. Model architecture is the primary factor that determines how effectively high-order clinical feature interactions can be represented, whereas the loss function stabilizes learning under partial-label and long-tail conditions.
Table 2 therefore provides the following answer to RQ2: both architecture and loss function are important in automated ICD coding, but their contributions are asymmetric. Architecture forms the primary basis of performance, and Adaptive CLPL provides an additional improvement that is well aligned with the task setting.
Table 2.
Average comparison across loss functions for the automated ICD coding models.
Table 2.
Average comparison across loss functions for the automated ICD coding models.
| | MedBERT | GRU-D | TST | LatentODE | iTransformer | TabNet | Ours |
|---|
| hit@1 | 0.3632 | 0.2398 | 0.3343 | 0.2397 | 0.4335 | 0.4125 | 0.4687 |
| hit@3 | 0.4322 | 0.2986 | 0.4491 | 0.3814 | 0.6463 | 0.6155 | 0.7108 |
| hit@5 | 0.4922 | 0.3005 | 0.4726 | 0.5956 | 0.7106 | 0.6973 | 0.7881 |
| hit@10 | 0.5240 | 0.3015 | 0.5247 | 0.7085 | 0.7873 | 0.7732 | 0.8653 |
| CPM | 0.0114 | 0.0018 | 0.0077 | 0.0655 | 0.2260 | 0.0065 | 0.3520 |
| CMG | −0.0045 | −0.0012 | −0.0032 | −0.0689 | −0.0218 | −0.0001 | 0.0195 |
| NDI | 1.0000 | 0.9996 | 1.0000 | 1.0000 | 0.8556 | 1.0000 | 0.4595 |
| EPR | 0.0002 | 0.0002 | 0.0002 | 0.0002 | 0.0005 | 0.0002 | 0.0015 |
Table 3 and
Figure 4 present the model comparison results for the intervention prediction task and provide the performance context for interpreting RQ3. This task is a multi-class classification problem that simultaneously predicts four states—ONSET, WEAN, STAY ON, and STAY OFF—and exhibits a typical imbalanced structure in which the frequencies of the states differ substantially. Under this setting, the proposed model achieved the highest performance in Accuracy, Macro AUC, F1 score, and Macro AUPRC. The improvement was particularly pronounced for F1 score and Macro AUPRC, suggesting that the proposed model maintained basic discrimination for the majority classes while also achieving more balanced detection performance for minority transition events. In the event-wise AUC results, the proposed model achieved the best performance for ONSET and STAY OFF while remaining competitive for WEAN and STAY ON. This shows that the proposed framework is particularly strong in clinically important transition events without sacrificing overall predictive stability.
Table 3 therefore highlights the importance of rare-event-sensitive metrics that cannot be fully captured by simple accuracy alone, and serves as the basis for interpreting the effect of loss function choice in RQ3.
Table 3.
Test results for the intervention prediction models.
Table 3.
Test results for the intervention prediction models.
| | RF | LR | CNN | CNN Notes | LSTM | LSTM Words | LSTM-GNN | MTS-GCNN | Ours |
|---|
| Onset AUC | 87.5 | 71.9 | 77.6 | 62.0 | 78.5 | 75.0 | 84.4 | 89.9 | 97.2 |
| Wean AUC | 98.9 | 93.2 | 98.2 | 91.0 | 98.7 | 90.0 | 98.7 | 99.4 | 97.7 |
| STAY ON AUC | 98.5 | 98.4 | 98.4 | 96.0 | 98.6 | 97.0 | 98.5 | 99.3 | 99.0 |
| STAY OFF AUC | 93.5 | 98.3 | 93.3 | 95.0 | 93.6 | 97.0 | 91.0 | 93.3 | 98.8 |
| Macro AUC | 81.6 | 78.5 | 70.9 | - | 77.8 | - | 85.2 | 91.9 | 98.2 |
| Accuracy | 94.6 | 90.4 | 91.9 | 86.0 | 92.4 | 90.0 | 93.2 | 95.5 | 99.2 |
| F1 Score | 52.4 | 47.7 | 46.5 | - | 47.7 | - | 60.6 | 60.6 | 79.4 |
| Macro AUPRC | 43.9 | 43.1 | 42.7 | - | 42.8 | - | 48.0 | 52.5 | 77.4 |
Table 4 and
Figure 5 directly answer RQ3 by comparing the effect of loss function choice on intervention prediction performance under the same model architecture. Among the compared objectives, the class-balanced focal loss achieved the best performance in Accuracy, F1 score, Macro AUC, and Macro AUPRC. The gain was especially pronounced in F1 score and Macro AUPRC, showing that the detection of rare transition events under an imbalanced setting depends strongly on the choice of loss function. In other words, for intervention prediction it is more important to detect clinically meaningful minority events reliably than to achieve high overall accuracy alone, and the class-balanced focal loss was the most effective objective for this purpose.
In contrast, the use of Cross Entropy Loss or Focal Loss alone produced only limited improvement in some metrics, and Negative Log-Likelihood Loss showed abnormally low accuracy, which made stable comparison difficult. This suggests that a simple optimization objective is insufficient to balance majority and rare classes simultaneously. By combining class-frequency correction with hard-sample focusing, the class-balanced focal loss allocated more learning capacity to clinically important minority transition events such as ONSET and WEAN and produced the most stable overall results.
Table 4 therefore provides a clear answer to RQ3: in an imbalanced intervention-event setting, the choice of loss function has a decisive effect on clinically meaningful detection performance, and the class-balanced focal loss provides the most consistent improvement.
Table 4.
Average comparison across loss functions for the intervention prediction model.
Table 4.
Average comparison across loss functions for the intervention prediction model.
| | Cross Entropy [50] | KL Divergence [51] | Negative Log- Likelihood [52] | Focal [53] | Ours |
|---|
| Onset AUC | 0.9166 | 0.7780 | 0.4711 | 0.5002 | 0.9721 |
| Wean AUC | 0.9338 | 0.9087 | 0.9101 | 0.8030 | 0.9769 |
| STAY ON AUC | 0.9820 | 0.9837 | 0.9812 | 0.9520 | 0.9904 |
| STAY OFF AUC | 0.9805 | 0.9915 | 0.9976 | 0.9790 | 0.9884 |
| Accuracy | 0.9858 | 0.9795 | 0.0364 | 0.8052 | 0.9917 |
| F1 Score | 0.6936 | 0.6535 | 0.3940 | 0.3741 | 0.7940 |
| Macro AUC | 0.9532 | 0.9155 | 0.8400 | 0.8086 | 0.9819 |
| Macro AUPRC | 0.6872 | 0.6328 | 0.6626 | 0.4676 | 0.7742 |
Taken together, the results across the two tasks provide consistent answers to the three research questions. For RQ1, stable prediction in a large-scale ICD code space was shown to be possible using ICU-derived structured and temporal representations. For RQ2, model architecture and loss function were found to contribute in different ways to automated ICD coding: architecture provided the main basis for ranking quality within the candidate set and for distributional stability, while Adaptive CLPL provided an additional improvement that was well aligned with partial-label and long-tail settings. Finally, for RQ3, the detection of rare events in intervention prediction was found to depend strongly on the choice of loss function, and the class-balanced focal loss produced the most consistent performance gain. These findings show that the dominant determinant of performance differs according to the statistical characteristics and learning objective of each task. In automated ICD coding, structural representational capacity for high-order interactions is central, whereas in intervention prediction, the loss design that directly addresses class imbalance plays the more decisive role. Although these controlled comparisons do not constitute a full module-wise ablation in which the ICD coding module, gating mechanism, and temporal backbone are removed one by one, they nevertheless provide partial quantitative evidence about which design factors dominate performance under the present experimental conditions.
4.6. Computational Cost and Practical Considerations
Because CAGE combines diagnostic-context estimation, context-adaptive gating, and temporal intervention prediction, its overall workflow is more complex than that of single-branch baseline models. To provide a practical sense of this complexity, we report the execution context of the full experimental run in addition to predictive performance. The implementation was executed on a Linux system with Python 3.12, PyTorch, and CUDA support, using an AMD Ryzen Threadripper PRO 3955WX CPU and an NVIDIA A100 40 GB GPU. The complete run required 30,536.12 s, corresponding to approximately 8.48 h wall-clock time. Stage-wise logs show that ETL required 307.60 s, automated ICD coding required 25,120.38 s, and intervention prediction required 5073.47 s. This indicates that the main computational burden of the present workflow arose from the ICD coding stage rather than from the final intervention prediction stage. This profile is consistent with the structure of the task. The ICD component operated over 6130 classes and, in the present experimental workflow, evaluated seven model families across three loss function settings. In addition, the configured Adaptive CLPL setting used frequency-based head–tail decomposition and sampled tail negatives, with head and tail sample sizes both set to 800, rather than exhaustively traversing the full negative class space at every step. This design choice was intended to keep large-label learning computationally tractable while preserving the partial-label objective in the tail region.
From the standpoint of deployment, the runtime values reported above should be interpreted as characterizing the full experimental workflow, including repeated model comparison and training, rather than an optimized online inference service. In practical use, only forward inference with a trained model would be required. Moreover, the task itself is defined on a fixed-window prediction setting with a 6 h input window, a 6 h gap, and a 4 h prediction horizon. Accordingly, the intended operational mode is periodic risk updating within ICU decision support rather than sub-second hard real-time control. Under this interpretation, CAGE is better viewed as a near-real-time clinical decision support framework that can refresh risk estimates as new patient data accumulate. At the same time, fine-grained profiling of per-sample inference latency, peak GPU memory usage, and production-level throughput was not separately instrumented in the present run and remains an important topic for future deployment-oriented evaluation.
5. Applications and Discussions
This section interprets the design principles and performance patterns of CAGE on the basis of the preceding experimental results, and discusses its potential clinical applications and limitations. CAGE adopts a hierarchical structure in which automated ICD coding estimates admission-level diagnostic context as a probability distribution and injects it into the downstream intervention prediction stage as a conditioning variable, allowing the same physiological signal to be interpreted differently depending on the disease background. The purpose of this section is therefore not to repeat the numerical results, but to connect the observed findings for each research question with their underlying causes and with their implications for practical clinical decision support. First, with respect to RQ1, the automated ICD coding results showed that stable estimation of diagnostic context in a large-scale ICD code space is possible using structured ICU representations derived from temporal observations and admission-level clinical variables. This suggests that quantitative information collected in the ICU, such as vital signs, laboratory results, and treatment history, can serve not merely as auxiliary data but as sufficiently informative proxy indicators of a patient’s disease background. In other words, specific disease groups form characteristic combinations of physiological trajectories and laboratory patterns, and the model was able to concentrate probability mass on plausible diagnostic regions from these recurring patterns. This observation is particularly important in real hospital settings. If higher-level diagnostic context can be recovered to a meaningful extent from structured ICU representations even when clinical text or final diagnosis records are delayed or incomplete, then the practical value of early-prediction CDSSs increases substantially.
With respect to RQ2, the automated ICD coding task showed that model architecture and loss function contribute in different ways. Structural differences produced consistent performance gaps in ranking quality within the candidate set, probability concentration, and overall stability of the predicted distribution, while the loss function provided an additional corrective effect on top of this base. This indicates that the core of diagnostic context estimation depends less on the optimization technique itself than on how effectively the model can represent high-order interactions among structured clinical variables. The same decrease in blood pressure or oxygen saturation may have different diagnostic meanings depending on age, underlying disease, laboratory findings, and treatment history, and the capacity to capture such interactions at the model level forms the basis of ICD coding performance. By contrast, Adaptive CLPL reflects both the partial-label constraint and the long-tail structure, helping a model with sufficient structural expressiveness concentrate probability more stably on the candidate region. In this sense, for ICD coding, architecture determines the upper bound of performance, whereas loss design functions as a refinement mechanism that helps the model reach that level more reliably. With respect to RQ3, the intervention prediction results clearly showed that the choice of loss function has a decisive effect on the detection of rare transition events. This is directly related to the statistical nature of the intervention prediction task. ONSET and WEAN are clinically important but infrequent, so a general-purpose objective naturally allocates more learning signal to majority classes such as STAY ON and STAY OFF. As a result, a model may achieve high overall accuracy while still failing to detect clinically important transition events. The superiority of Class-Balanced Focal Loss observed in the experiments can be interpreted as arising from its combination of class-frequency correction and hard-sample focusing, which effectively counteracted this structural imbalance. That is, in intervention prediction, it is more important to design an objective function that allocates sufficient learning capacity to clinically important rare events than simply to use a more complex architecture.
Taken together, these results show that the hierarchical design of CAGE does not simply place two tasks in parallel, but integrates them as complementary components. The upper-stage automated ICD coding module functions not as an independent auxiliary classifier, but as a diagnostic context extractor that summarizes admission-level disease background, while the lower-stage intervention predictor uses this context as a conditioning variable to interpret the same time-series pattern differently depending on the disease background. The main contribution of CAGE therefore lies not only in achieving strong performance on an individual task, but in linking diagnostic context estimation and intervention transition prediction through a single information flow, thereby more faithfully reflecting the hierarchical structure of real clinical reasoning. This architecture can be used in actual clinical settings as a near-real-time early warning system. Rather than only classifying the current state or detecting abnormalities retrospectively, CAGE integrates accumulated observations and diagnostic context to predict future state changes over a predefined time horizon, thereby signaling risk in advance. This is clinically meaningful because it supports a shift from a reactive management paradigm, in which action is taken only after deterioration occurs, to a preventive management paradigm, in which preparation occurs before risk materializes. In real intensive care practice, abrupt deterioration often occurs over minutes to hours, and delays in response can directly worsen prognosis. A system that provides a predictive time window therefore gives clinicians meaningful preparation time and offers operational value beyond statistical accuracy alone. More specifically, by identifying patients whose risk probabilities are increasing in advance, clinicians can proactively adjust treatment plans by preparing ventilators and related equipment beforehand, reallocating nursing staff to patients requiring closer observation, and initiating additional tests or interventions earlier; for example, if a patient is predicted to have a high ONSET risk, equipment and personnel can be prepared and additional evaluation can be performed before emergency intubation becomes necessary. If a patient is predicted to have a high probability of WEAN, extubation timing can be reviewed more actively in order to reduce complications associated with unnecessary continuation of mechanical ventilation. Such actions can mitigate the urgency of post-event decision-making and contribute to more systematic optimization of resource use. Beyond the level of individual patients, the predictions may also be useful for bed management and ICU workflow planning. If an increase in high-risk patients is expected during a certain time period, the ward or ICU can establish advance strategies for adjusting staffing and equipment availability. In this setting, the output of CAGE functions not merely as a reference indicator, but as an operational decision-support tool that advances intervention timing and resource allocation. In other words, the proposed framework shows the possibility that a CDSS can function not only as a means of improving patient-level predictive accuracy, but also as a mechanism that induces meaningful behavioral change in clinical practice.
Despite this potential, the present study has several limitations. First, the experimental results should be interpreted as evidence obtained from a retrospective evaluation on a single public dataset derived from one clinical environment. Although patient-level independence was enforced to prevent identifier-based leakage, this design alone does not establish transferability across institutions, care periods, or documentation systems. In practice, clinical protocols, patient case mix, measurement frequency, variable availability, and missingness patterns may differ substantially across hospitals and time periods, and such differences can alter both the structure of ICU time-series data and the statistical conditions under which the model operates. Because CAGE integrates diagnostic context extraction with intervention prediction, its performance may also be affected by changes in coding practice and diagnostic granularity, including transitions from ICD-9 to more recent coding systems such as ICD-10 or ICD-11. These considerations imply that the present results should be understood as promising single-center retrospective evidence rather than as definitive proof of broad cross-site generalizability. Future validation should therefore include multi-center external evaluation, temporally separated validation schemes that better reflect prospective deployment, and robustness analyses under changes in coding systems and data structure. Second, although the present evaluation includes controlled comparisons that separate architecture- and loss-level effects, it does not include a full sequential ablation in which the ICD coding module, the diagnosis-aware gating mechanism, and the temporal backbone are removed or simplified one by one. The current evidence should therefore be interpreted as a partial contribution analysis rather than as a complete module-wise attribution of all performance gains. In addition, although the present evaluation is mainly based on quantitative metrics, the success of a real CDSS is determined by its acceptance and usability in clinical settings. In particular, a framework that uses diagnostic context as a conditioning variable would have greater value if it could explain not only the prediction probability itself, but also which higher-level clinical context strengthened or weakened a given transition prediction. Future work should therefore include prospective user studies with clinicians to evaluate, in both quantitative and qualitative terms, how prediction outputs are integrated into actual decision-making processes, what forms of explanation and visualization produce the highest trust, and how the balance between alert frequency and clinical fatigue can be optimized. Finally, although this study hierarchically links diagnostic context extraction and intervention prediction, it does not explicitly model temporal changes in the context itself. The current diagnostic embedding is designed to function as a relatively stable higher-level context at the admission level but, in real clinical settings, the context may itself change over time as new disease information is confirmed and treatment responses accumulate. Future studies should therefore go beyond the current structure, in which admission-level context is treated as a static conditioning variable, and explore how it may be combined with dynamic diagnostic context that is updated along the temporal axis. Fine-grained deployment-oriented profiling, including per-sample inference latency, peak memory usage, and throughput under online updating, is also still needed before the framework can be fully characterized for production use. These issues define the main directions for extending the present framework toward a more generalizable and practically deployable clinical decision support system.
6. Conclusions
In this study, we proposed CAGE, a context-adaptive gated embedding framework that hierarchically integrates admission-level diagnostic context with ICU time-series data for mechanical ventilation prediction. By formulating automated ICD coding as a partial-label learning problem, the framework estimates higher-level diagnostic context from structured ICU observations and injects this information into downstream intervention prediction as a conditioning signal. Through this design, CAGE improved the detection of clinically critical rare transition events, particularly ONSET and WEAN, while maintaining stable overall predictive performance compared with pure time-series baselines. These findings indicate that the main contribution of CAGE lies not only in performance improvement, but also in providing a unified modeling framework that connects diagnostic context estimation and intervention transition prediction in a manner more consistent with the hierarchical structure of clinical reasoning. Nevertheless, several limitations remain. The present study was evaluated retrospectively on a single public dataset from one clinical environment, and its findings should therefore be interpreted with caution when considering transferability to other institutions or deployment settings. Although the evaluation design reduced identifier-based contamination through patient-level independence, its generalizability across hospitals, time periods, documentation practices, and coding systems remains unverified. In particular, differences in variable composition, observation density, missingness patterns, and diagnostic coding granularity may affect both diagnostic context formation and downstream intervention prediction. More explicit structural ablation studies are also needed to isolate the contributions of the ICD coding module and the gating mechanism, and practical deployment issues—including computational efficiency, latency, interpretability, and workflow integration—require further investigation.
Several directions for future research remain. First, broader validation under more rigorous generalization settings is needed, including multi-center external evaluation, temporally separated validation schemes, and robustness analyses under coding-system and data-structure shifts such as transitions from ICD-9 to ICD-10 or ICD-11. Second, extending the current framework from static admission-level context to dynamic diagnostic context that evolves over time would be an important next step. Third, further deployment-oriented evaluation is required, particularly with respect to computational requirements, explanation interfaces, and usability within real clinical decision support workflows. Through such developments, CAGE may provide a stronger foundation for practical and clinically meaningful predictive decision support in intensive care settings.