CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications

Huang, Li; Chen, Qingfeng

doi:10.3390/app15147975

Open AccessArticle

CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications

by

Li Huang

and

Qingfeng Chen

^*

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7975; https://doi.org/10.3390/app15147975

Submission received: 28 May 2025 / Revised: 8 July 2025 / Accepted: 9 July 2025 / Published: 17 July 2025

Download

Browse Figures

Versions Notes

Abstract

Interpreting complex clinical time series is vital for patient safety and care, as it is both essential for supporting accurate clinical assessment and fundamental to building clinician trust and promoting effective clinical action. In complex time series analysis, decomposing a signal into meaningful underlying components is often a crucial means for achieving interpretability. This process is known as time series disentanglement. While deep learning models excel in predictive performance in this domain, their inherent complexity poses a major challenge to interpretability. Furthermore, existing time series disentanglement methods, including traditional trend or seasonality decomposition techniques, struggle to adequately separate clinically crucial specific components: static patient characteristics, condition trend, and acute events. Thus, a key technical challenge remains: developing an interpretable method capable of effectively disentangling these specific components in complex clinical time series. To address this challenge, we propose CoTD-VAE, a novel variational autoencoder framework for interpretable component disentanglement. CoTD-VAE incorporates temporal constraints tailored to the properties of static, trend, and event components, such as leveraging a Trend Smoothness Loss to capture gradual changes and an Event Sparsity Loss to identify potential acute events. These designs help the model effectively decompose time series into dedicated latent representations. We evaluate CoTD-VAE on critical care (MIMIC-IV) and human activity recognition (UCI HAR) datasets. Results demonstrate successful component disentanglement and promising performance enhancement in downstream tasks. Ablation studies further confirm the crucial role of our proposed temporal constraints. CoTD-VAE offers a promising interpretable framework for analyzing complex time series in critical applications like healthcare.

Keywords:

interpretation; temporal constraints; variational autoencoder; disentanglement; clinical time series

1. Introduction

In recent years, deep learning algorithms have made big progress in the field of time series data analysis. This has given us more advanced tools for analyzing complicated and ever-changing medical time series data [1]. Some computer science models, like RNNs, CNNs, and Transformers, have shown that they can better understand healthcare data over time. They are better at finding patterns in the data that are difficult to predict and complex [2,3,4]. Information about healthcare, such as electronic health records (EHRs) and data from medical devices, contains a lot of useful clinical information. Using deep learning models to analyze medical data is hard to interpret [5,6,7]. To improve how the model is understood, it is necessary to make the model more transparent and analyze the way medical time series data is generated.

In the study of time series data, researchers have looked into and suggested different VAE-based time series decoupling models. These models break down complex time series data into its different parts, like trends, seasonality, and random noise [8,9,10]. Medical data that shows changes over time is often caused by a mix of things. It is not enough to look at trends and seasons separately. We need better ways to separate these features that are more meaningful for doctors and patients. As a new way of thinking about data, disentangled representation learning has been studied a lot by academics recently [11]. The variational autoencoder (VAE) is a representation learning model that is often used. It achieves decoupling by learning a low-dimensional representation of the data and imposing specific a priori distribution constraints on it [12].

To improve the accuracy of the separation of features in time series data, we propose a new model called constrained Temporal Disentangled Variational Autoencoder (CoTD-VAE). The goal of the model is to disentangle complex medical time series data into three clinically relevant latent factors: the static factor, which captures baseline characteristics that are inherent to the patient and do not change or change very slowly over time; the trend factor, which represents the smooth evolution of disease states over time; and the event factor, which captures clinically important and transient changes in health status. This decomposition is designed to mirror the conceptual framework of clinical assessment, where a patient’s condition is understood through their baseline characteristics, gradual physiological trends, and acute events [13,14].

This is different from our previous work, which did not take into account the changing nature of healthcare data. Temporal constraints include Trend Smoothness Loss and Event Sparsity Loss. Trend Smoothness Loss helps the model learn smoother and continuous trends by penalizing sudden changes in trend latent variables over time. Event Sparsity Loss, on the other hand, tells the model to identify unexpected event points that are sparse in time and deviate from the regular trend. It does this by imposing sparsity constraints (e.g., L1 paradigm) on the event latent variables.

From the perspective of model architecture, CoTD-VAE is built on a base time series disentangled VAE model (Figure 1). The encoder is responsible for mapping the input medical time series into three separate latent spaces, namely static, trend and event. The decoder reconstructs the original input sequences using the learned latent variables (Figure 2). We introduced the aforementioned trend smoothness regularization term and event sparsity regularization term in addition to the standard reconstruction loss and KL scatter in the optimization objective (loss function) of the model. These learned disentangled representations (static, trend, event) can be applied to downstream prediction tasks. We demonstrate the higher quality of the learned disentangled representations through comparison and ablation experiments, where CoTD-VAE outperforms other benchmark models in terms of prediction performance for downstream tasks. This paper makes the following contributions:

We propose a novel disentangled variational autoencoder called “CoTD-VAE”. It can disentangle medical time series data into three parts: static, trend, and event.
CoTD-VAE is given explicit temporal constraints, which are the loss of trend smoothness and the loss of event sparsity. These constraints improve the model’s ability to capture and distinguish dynamic changes in medical data at different timescales.
We evaluated the proposed CoTD-VAE and its learned Disentangled representation by running an experiment on a real healthcare risk prediction task. The results of the experiment showed that the proposed CoTD-VAE and its learned Disentangled representation are valid and better than other models.

The rest of this paper is organized as follows. Section 2 will review related work on time series analysis, learning of disentangled representations, and VAE applications in healthcare. Section 3 will explain the specific structure, mathematical rules, and how to implement our proposed CoTD-VAE. Section 4 will present the dataset, the ways to measure performance, the basic model, and the detailed way the experiments were set up. Section 5 talks about the results of the experiment. Section 6 will summarize the full paper and talk about possible future research directions.

2. Related Work

In this section, we look at research that is related to our proposed disentangled variational autoencoder for time series. We also explore healthcare time series analysis, disentangled representation learning, and the use of variational autoencoders in time series modeling. We focus on the challenges that existing approaches face when dealing with complex healthcare data.

2.1. Medical Time Series Analysis

Medical time series data analysis is an active area of research in clinical research and practice. Some common statistical methods, like autoregressive integral sliding average models (ARIMA) [15] and Kalman filtering [16], have been used to study and predict medical data. However, traditional linear models often have problems with modern healthcare big data that has complex nonlinear patterns and long-term dependencies. In recent years, deep learning models have been making progress in medical time series analysis because they can automatically learn features. Recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs), are now common methods in healthcare time series prediction. They are able to effectively model temporal dynamics in sequence data [17]. Different architectures, such as GRUs, LSTMs, and their bi-directional and multilayered variants, as well as feature-specific networks and target replication strategies, have specific advantages in different scenarios. Convolutional Neural Networks (CNNs) have a lot of potential in the field of recognizing and classifying electroencephalography (EEG) signals. This is expected to provide efficient solutions to practical problems in medical and brain–computer interface systems [18]. A new Transformer model called ETHOS uses a zero-sample learning approach to predict future health trajectories. It analyzes high-dimensional, heterogeneous, and intermittent health data, such as patient health timelines (PHTs) [19]. The Transformer architecture was used to train a large amount of EHR data ahead of time. This was carried out to predict the risk of serious lung problems in patients with SARS-CoV-2. It worked better than traditional machine learning models [20]. These methods have been very successful in improving how well predictions can be made. However, not enough attention has been paid to how easy it is to understand the representations learned by the model. This is especially true when it comes to disentangled latent factors, which is one of the things this study looks at.

2.2. Disentangled Representation Learning

The goal of Disentangled Representation Learning (DRL) is to allow a model to learn to separate potentially independent factors in the data. This improves the interpretability, generalization, and controllability of the representation [11]. The Variational Auto-Encoder (VAE) is a popular method for learning data distribution through something called variational inference. This is used in a process called disentangled representation learning, or DRL. Researchers have proposed various methods to improve the Disentangled ability of VAE on complex datasets. For example,

β

-VAE makes latent variables more independent by adding a

β

penalty term, which improves disentanglement. However, a large

β

value can make reconstruction quality worse [21]. FactorVAE forces latent variables to be independent by minimizing the Total Correlation, which improves disentanglement even more [22]. DIP-VAE improves the Disentangled performance by matching the covariance matrix and prior distribution of latent variables, which is useful in situations that require strict mathematical guarantees [23]. JointVAE can disentangle both continuous and discrete latent variables, which is useful in situations that require handling different types of latent factors [24]. RF-VAE improves the decoupling capability by introducing correlation indicator variables to identify important latent factors [25]. By designing the right model structure and loss function, we can encourage VAE to learn a disentangled latent representation [26]. In the medical field, a method called “disentangled representation learning” has been used to study medical images and genetic data. This method helps identify what causes diseases and what types of diseases they are [27,28,29,30].

2.3. Disentangled Representation Learning for Time Series Based on LSTM, Transformer and VAE

The application of disentangled representation learning in the field of time series data analysis has driven research on Disentangled Temporal Variational Self-Encoder (DTSE) for time series [31,32,33]. The goal of these models is to break down a complex time series signal into several independent parts with specific meanings, such as trend and seasonality [34,35]. Mainstream interpretability methods in this area generally fall into two categories. The first achieves disentanglement through multi-encoder architectures, as demonstrated in the work of Kim & Cho [36]. The second employs compositional generation based on a function library, with the ITF-VAE model by Klopries & Schwung [37] being a representative example. To achieve this goal, these methods typically use a variational autoencoder (VAE) as the main framework and use recurrent neural networks (RNNs) such as LSTM and Transformer, as well as convolutional neural networks (CNNs) as the encoder and decoder to capture the dynamic properties of the time series [38,39]. In the medical field, a type of machine learning called a VAE has been used to study heart signals. This type of VAE is called a “disentangled VAE.” It can learn to understand different parts of a heartbeat signal. It can also spot things that are not normal [40]. However, these VAE variants of the approach are not very good at separating these different properties of the underlying factors when dealing with data with multiple complexities and lack temporal constraints specific to the characteristics of medical data [41].

3. Methods

This section describes CoTD-VAE, which is designed to learn separate representations of complex time series data and apply them to classification [42]. We start by explaining the overall structure of the model. Then, we look at its different parts, such as the encoder structure, decoder design, time limits, and training strategies. CoTD-VAE comprises three parallel encoders, a decoder, and classifier module. CoTD-VAE Disentangled time series features are divided into three different latent variables: static features (

z_{static}

), trend features (

z_{trend}

), and event features (

z_{event}

). Given a time series

x \in R^{C \times L}

, where C is the number of feature channels and L is the length of the series, the three encoders of CoTD-VAE map it to three independent latent distributions:

\begin{matrix} q (z_{static} ∣ x) & = N (μ_{static} (x), diag (σ_{static}^{2} (x))), \end{matrix}

(1)

\begin{matrix} q (z_{trend} ∣ x) & = N (μ_{trend} (x), diag (σ_{trend}^{2} (x))), \end{matrix}

(2)

\begin{matrix} q (z_{event} ∣ x) & = N (μ_{event} (x), diag (σ_{event}^{2} (x))) \end{matrix}

(3)

where

μ

and

σ

denote the mean and standard deviation functions, respectively, and diag denotes the diagonal covariance matrix.

Encoder design. CoTD-VAE uses three different encoder designs, each focusing on different types of time series features [43,44]. The static feature encoder uses a Temporal Encoder (TEC) architecture and supports either RNN or CNN implementations. The RNN mode uses a bidirectional LSTM to capture global sequence information. The CNN mode uses a convolutional layer and average pooling to extract the overall features of the sequence. The encoder shows the fixed dimensional latent variables that represent the static characteristics of the whole sequence. The trend feature encoder uses a one-dimensional convolutional network structure, which contains multilayer convolution, batch normalization, and Dropout layers. The output of this structure is a sequence of latent variables that is the same length as the sequence. This means that the time dimension information is retained, and the structure is suitable for capturing long-term change trends. The Event Feature Encoder is similar to the Trend Encoder, but it uses a special process called “sparsity regularization” to capture patterns that emerge or last for a short time. All encoders output mean

μ

and log-variance

log σ^{2}

. Latent variables are sampled from the posterior distribution by a reparameterization trick:

z = μ + σ ⊙ ϵ, ϵ \sim N (0, I)

[45], where ⊙ denotes element-level multiplication, a trick that allows gradients to be back-propagated through a stochastic sampling process for end-to-end training.

Decoder design. The decoder receives a combined representation of the three latent variables and reconstructs the original time series. The specific implementation includes expanding the static latent variables to the same dimension as the length of the sequence, splicing the expanded static latent variables with the trend latent variables and the event latent variables in the feature dimension, averaging along the sequence dimensions, and finally reconstructing the original sequence by a Temporal Decoder (TD), which also supports RNN or CNN implementations.

Temporal constraints. We introduce temporal constraints [46] to guide different latent variables to learn specific types of features.

Trend Smoothness Loss (

L_{smooth}

). The purpose of the trend smoothness loss is to ensure that the latent variable

z_{trend}

captures the low-frequency, slowly varying dynamics of the time series. This is achieved by penalizing large variations in the latent trend over time. Our loss combines penalties on both the first- and second-order differences of the latent time series:

\begin{matrix} L_{smooth} = L_{smooth}^{(1)} + α \cdot L_{smooth}^{(2)} \end{matrix}

(4)

where

α

is a weighting factor.

The first-order loss

L_{smooth}^{(1)}

penalizes the velocity of the trend, encouraging gradual changes:

\begin{matrix} L_{smooth}^{(1)} = \frac{1}{B} \sum_{i = 1}^{B} [\frac{1}{d_{trend} \times (L - 1)} \sum_{j = 1}^{d_{trend}} \sum_{t = 1}^{L - 1} {∥ z_{trend}^{(i, j, t)} - z_{trend}^{(i, j, t - 1)} ∥}_{2}^{2}] \end{matrix}

(5)

The second-order loss

L_{smooth}^{(2)}

penalizes the acceleration of the trend, promoting a constant rate of change:

\begin{matrix} L_{smooth}^{(2)} = \frac{1}{B} \sum_{i = 1}^{B} [\frac{1}{d_{trend} \times (L - 2)} \sum_{j = 1}^{d_{trend}} \sum_{t = 2}^{L - 1} {∥ z_{trend}^{(i, j, t)} - 2 z_{trend}^{(i, j, t - 1)} + z_{trend}^{(i, j, t - 2)} ∥}_{2}^{2}] \end{matrix}

(6)

In these equations,

z_{trend}^{(i, j, t)}

represents the latent value for the i-th sample, j-th trend dimension, at time step t. B,

d_{trend}

, and L are the batch size, trend latent dimension, and sequence length, respectively. This combined penalty ensures that

z_{trend}

represents a robust and smooth underlying signal.

Event Sparsity Loss (

L_{sparse}

). To guide the event latent variable,

z_{event}

, to capture sparse, bursty, and significant short-term patterns, we designed a composite sparsity loss function,

L_{sparse}

. While the specific combination of these losses is tailored for our disentanglement objective, its constituent parts are based on well-established principles of sparsity measurement from machine learning and signal processing. The loss function is composed of three distinct terms:

\begin{matrix} L_{sparse} = L_{1} + γ_{c} \cdot L_{contrast} + γ_{c} \cdot L_{peak} \end{matrix}

(7)

Here,

L_{1}

is the standard L1 regularization loss,

L_{contrast}

is a contrastive sparsity loss, and

L_{peak}

is a peak activation loss. The key weighting hyperparameters,

γ_{c}

and

γ_{p}

here and

α

in the smoothness loss, were determined empirically through multiple experiments and set to 0.2, 0.2, and 0.5, respectively, to optimize performance on the downstream task.

Let the event latent tensor be denoted as

z_{event} \in R^{B \times d_{event} \times L}

, where B is the batch size,

d_{event}

is the dimensionality of the event latent space, and L is the sequence length. L1 Regularization Loss (

L_{1}

): This loss, derived from the classical LASSO regression [47], promotes overall sparsity by penalizing all non-zero activations in the latent space.

\begin{matrix} L_{1} = \frac{1}{B \cdot d_{event} \cdot L} \sum_{i = 1}^{B} \sum_{j = 1}^{d_{event}} \sum_{t = 1}^{L} |z_{event}^{(i, j, t)}| \end{matrix}

(8)

Contrastive Sparsity Loss (

L_{contrast}

): This loss aims to enforce sparsity across the feature dimension, encouraging energy concentration in a few feature dimensions. This principle is explored in sparse coding [48], and our implementation achieves this by maximizing the standard deviation of the feature activation vector. We define the loss as the negative of the mean standard deviation:

\begin{matrix} L_{contrast} = - \frac{1}{B \cdot L} \sum_{i = 1}^{B} \sum_{t = 1}^{L} Std (z_{event}^{(i, :, t)}) \end{matrix}

(9)

where

z_{event}^{(i, :, t)} \in R^{d_{event}}

is the feature vector for the i-th sample at time step t, and

Std (\cdot)

is the standard deviation function.

Peak Activation Loss (

L_{peak}

): This loss is designed to promote sparsity along the temporal dimension, encouraging sharp, transient peaks, a concept crucial in event detection [49,50]. We achieve this by maximizing the peak-to-average ratio for each feature channel. The loss is consequently defined as the negative of this ratio:

\begin{matrix} L_{peak} = - \frac{1}{B \cdot d_{event}} \sum_{i = 1}^{B} \sum_{j = 1}^{d_{event}} \frac{{max}_{t = 1}^{L} |z_{event}^{(i, j, t)}|}{(\frac{1}{L} \sum_{t = 1}^{L} |z_{event}^{(i, j, t)}|) + ϵ} \end{matrix}

(10)

where

{max}_{t} (\cdot)

finds the maximum value along the temporal dimension t. The term

ϵ

in the denominator is a small constant added to ensure numerical stability and prevent division-by-zero errors. In our implementation,

ϵ

is specifically set to

1 \times 10^{- 10}

.

Training Goal. The goal of CoTD-VAE is to find the best way to minimize the following loss function:

\begin{matrix} L_{total} & = L_{recon} + β_{static} \cdot L_{KL}^{static} + β_{trend} \cdot L_{KL}^{trend} \\ + β_{event} \cdot L_{KL}^{event} + λ_{smooth} \cdot L_{smooth} + λ_{sparse} \cdot L_{sparse} \end{matrix}

(11)

L_{recon}

is the reconstruction loss, computed using the mean squared error (MSE).

L_{KL}^{static}

,

L_{KL}^{trend}

and

L_{KL}^{event}

are the KL scatter losses for each of the three latent variables. These losses measure the difference between the posterior distributions and the standard normal prior.

β_{static}

,

β_{trend}

and

β_{event}

are the weighting parameters for the KL scatter losses. These are designed as learnable parameters in this implementation.

λ_{smooth}

and

λ_{sparse}

are weight parameters for the temporal constraints. These are also designed as learnable parameters. During the training process, we use numerical stabilization techniques, such as cropping the gradient and loss values, to ensure the stability and convergence of the training.

Disentanglement and classification. CoTD-VAE is trained through a two-step process [51]. This process separates representation learning and downstream task prediction. In the first stage, the variational self-encoder part is trained to learn high-quality latent representations. The training objective is to minimize the reconstruction loss, the KL scatter loss, and the temporal constraints. The classification loss is not included. After finishing the training, the encoder can map time series data to three disentangled potential spaces. In the second stage, we freeze the trained encoder parameters and pass the time series data through the trained VAE encoder to obtain the distribution parameters of the three latent variables: static, trend, and event. The statistical features of the three hidden variables are then put together into a long vector. This is used as input to the random forest to train and test the random forest classifier.

Assessment methods. We use the mean square error and mean absolute error to assess the ability of the model to reconstruct the original time series, assess the degree of disentanglement and expressiveness of the latent variables through visualization and statistical analysis, and compute prediction accuracy, precision, recall and F1 score.

4. Experiments

In this subsection, we design experiments to understand the performance and potential limitations of CoTD-VAE. The experimental objectives are as follows: (1) Verify the effectiveness of disentangled representation learning. (2) Evaluate the importance of the temporal consistency constraint. (3) Explore the generalization ability of CoTD-VAE on cross-domain data. (4) Discuss the interpretability and clinical application value of the latent representation. All models were implemented using Python 3.12.8. Our proposed CoTD-VAE was built using the PyTorch framework (version 2.6.0), while the baseline models utilized the Scikit-learn (version 1.6.1), XGBoost (version 3.0.2), and LightGBM (version 4.6.0) libraries.

4.1. Datasets

We performed our comparison and ablation experiments on two datasets. The first dataset is the UCI Human Activity Recognition (HAR) dataset [52]. It is a widely used benchmark dataset in the field of human activity recognition. This dataset includes information from 30 volunteers (ages 19 to 48) who used their smartphones to record sensor data while doing six everyday activities. The activities were WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, and LAYING. UCI HAR contains a total of 10,299 samples, which we divide into a training set (7352 samples, about 71%) and a test set (2947 samples, about 29%). We also divide 20% of the data in the training set as a validation set for optimizing the hyper-parameters and implementing the early stopping strategy. Each data sample has nine signal channels. These channels include total acceleration (X, Y, and Z axes), body acceleration (X, Y, and Z axes), and body gyroscope signals (X, Y, and Z axes). Initially, the signals were processed by a filter that reduced noise. Then, they were divided into segments using a fixed-width sliding window (2.56 s, 128 samples) with 50% window overlap. To prepare the data, we use a sliding window strategy. This means that we divide the sensor timing data into sections of 128 points each. Each section is 2.56 s of data, and the sections are 50% the same as the sections in the window right before it. We apply the MinMaxScaler technique to normalize all the feature values to the interval [0, 1]. This technique eliminates the scale difference between different features. It does so to ensure the stability of model training.

The second dataset was based on multivariate time-series data from the MIMIC-IV clinical database [53]. This database documents the hospitalization of patients in the intensive care unit (ICU). The dataset is organized by patients and includes their important health information and details about the treatment they received during their stay in the ICU. The data were created using a sliding window approach. This means that a time_step was generated at 1 h intervals. These time steps started from the patient’s time of admission to the ICU. Each time step is the same as a 3 h observation window. Each window shows the patient’s status during that time period. Each time step has the patient’s main physiological information, like their average heart rate and average systolic blood pressure. It also has their lab results, like their highest platelet count and highest D-dimer value. And it has info on any medical treatments they received, like if they received anticoagulant therapy, when they were in the hospital, and if they were diagnosed with a blood clot-related disease.The dataset that was obtained after using SQL to make a request contained 130,000 samples. All of the samples were separated into three sets: a training set, a test set, and a validation set. We used a tool called StandardScaler in the scikit-learn library to adjust each feature so that the mean was 0 and the standard deviation was 1. This made the features more comparable and prepared the data for training the model. For the thrombus prediction task, the MIMIC-IV dataset was divided into a training set (64%), a validation set (16%), and a held-out test set (20%), with all splits being stratified.

4.2. Baselines

We considered three perspectives of comparison when choosing the baselines: (1) We will compare different sequence modeling structures. (2) We will compare generative and discriminative models (AE and VAE). (3) A comparison of different strategies for learning disentangled representations (

β

-VAE, CVAE vs. CoTD-VAE). Four baseline methods are finally selected for comparison experiments. All of these methods provide ways to encode data and use these representations as features for later classification tasks.

Long Short-Term Memory Autoencoder (LSTM-AE) [54,55,56]: LSTM-AE is a classical approach for processing sequence data. It uses a bi-directional LSTM as an encoder and a uni-directional LSTM as a decoder. The encoder takes the input sequence and maps it to a fixed dimensional potential vector. The decoder uses this vector to reconstruct the original sequence. The specific structure includes an input projection layer, a bidirectional LSTM encoder (hidden layer dimension 128, 2-layer structure), a bottleneck layer (LN+ReLU activation) and a decoder. The MSE loss function is used for reconstruction.

Transformer Autoencoder (Transformer-AE) [57,58]: The Transformer architecture, which is based on the self-attention mechanism, has been successful in recent years in tasks that involve understanding sequences. Our version of Transformer-AE includes a Transformer encoder and decoder module that uses positional encoding to improve its ability to understand the order of events. The model dimension is set to 64, there are four attention heads, it contains two layers of encoder and decoder layers, and the feedforward network dimension is 128. Once again, mean squared error (MSE) is used as the reconstruction loss.

Beta-Variable Autoencoder (

β

-VAE) [21]:

β

-VAE is a type of VAE that allows you to adjust the weights (

β

) to balance how well something is reconstructed with how much it is represented. We use CNN as the main structure, and the encoder has several convolutional layers (channel configuration (16, 32), kernel size 5, step size 2) and a batch normalization layer that converts the inputs to the mean and variance parameters of the hidden variables. The decoder uses transposed convolution to recover the original sequence. The loss function combines two things: the reconstruction error and the weighted KL scatter.

Conditional Variational Autoencoder (CVAE) [59,60]: CVAE introduces category information as a condition into the generation process and learns the conditional distribution of specific activity categories. The structure is similar to

β

-VAE, but it introduces category embedding (dimension 8) in both the encoding and decoding processes. This allows the model to generate specific reconstructions and learn representations that are sensitive to specific conditions.

We applied systematic hyperparameter optimization to all baseline models. We used a grid search strategy to explore different combinations of key hyperparameters for each model. These included hidden layer dimensions and number of layers for LSTM, model dimensions and number of attention heads for Transformer, latent space dimensions and regularization strengths for the VAE variant, etc. We chose the configuration with the best performance on the validation set for complete training.

The autoencoder/VAE model for each configuration was trained first. The model state from the epoch with the lowest reconstruction loss on the validation set was saved for the next stage. Subsequently, a classifier was trained on the frozen representations from the best-performing autoencoder to evaluate the discriminative performance of the latent representation. The classifier contains two hidden layers (dimensions 128 and 64, respectively) and uses BatchNorm and Dropout (0.3) to improve generalization. The final classifier model was selected based on the highest classification accuracy on a separate validation set, using cross-entropy loss and an Adam optimizer (learning rate 5 × 10⁻⁴ ). The early stopping strategy prevents overfitting.

The specific hyperparameter configurations used to generate the baseline results for the MIMIC-IV thrombus prediction task are detailed here. Logistic Regression was implemented with C=1.0 and class_weight=‘balanced’. Random Forest utilized n_estimators=150, max_depth=15, and also set class_weight=‘balanced’. The gradient boosting models, XGBoost and LightGBM, were both configured with n_estimators=150, max_depth=5, learning_rate=0.1, and a scale_pos_weight calculated as the ratio of negative to positive samples to specifically address class imbalance. As a direct deep learning benchmark, the LSTM Classifier operated on the original sequential data; it was constructed as a 2-layer bidirectional LSTM with a hidden dimension of 128 and trained with an Adam optimizer (lr= $5 \times 10^{- 4}$ ) and a weighted loss function. The random state was fixed to 42 for all experiments.

4.3. Ablation Experiments

We used a multivariate time series dataset based on the MIMIC-IV clinical database to evaluate the impact of the trend smoothness constraint and the event sparsity constraint within the model. We constructed two CoTD-VAE variants:

No Smoothness: It would be beneficial to consider removing the loss of trend smoothing and testing the impact of the trend smoothing constraints on the model performance.
No Sparsity: It might be worthwhile to explore removing the loss of event sparsity and testing the impact of the event sparsity constraints on the model performance. By comparing the performance difference between these variants and the full model, we can hopefully quantify the contribution of these two components to the overall performance and verify the validity of our proposed temporal constraints.

5. Results

5.1. Reconstruction Task

The Mean Squared Error (MSE) and Mean Absolute Error (MAE) are employed to evaluate the reconstruction performance, which reflects the model’s ability to capture the essential features of the data. The Mean Squared Error (MSE) is defined as follows:

\begin{matrix} MSE = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - {\hat{x}}_{i})}^{2} \end{matrix}

(12)

MSE is calculated as the mean of the square of the differences between the original data and the data that has been reconstructed. Here,

x_{i}

denotes the original data point,

{\hat{x}}_{i}

denotes the corresponding reconstructed data point, and n is the total number of samples. This metric is more sensitive to larger errors, which are made bigger by squaring. It helps to identify significant reconstruction distortions. In activity recognition, MSE captures the model’s ability to remember the details of an action and is more sensitive to large-magnitude action features. Mean Absolute Error (MAE) is defined as follows:

\begin{matrix} MAE = \frac{1}{n} \sum_{i = 1}^{n} | x_{i} - {\hat{x}}_{i} | \end{matrix}

(13)

MAE calculates the mean absolute difference between the original and reconstructed data. It uses

x_{i}

for the original data points,

{\hat{x}}_{i}

for the reconstructed data points, and n for the total number of samples. MAE is less sensitive to outliers and provides a more balanced picture of overall reconstruction quality compared to MSE. These two error metrics work well together to provide a thorough assessment of how well the reconstruction is carried out. MSE focuses on capturing important details, while MAE shows how accurate the reconstruction is overall.

As shown in Table 1, CoTD-VAE outperforms LSTM-AE,

β

-VAE, and CVAE across both MSE and MAE metrics. An interesting observation from the result is that Transformer-AE exhibits significantly lower MSE and MAE values compared to other models. Overfitting is potentially suggested by this. Despite the notably low reconstruction error, its generalization ability was further assessed through downstream prediction tasks.

To qualitatively assess the reconstruction fidelity of CoTD-VAE, we have designed Figure 3 to present a focused, narrative-driven analysis. Channel 4 serves as an example of high-fidelity reconstruction. Here, the reconstructed signal (red curve) closely tracks the original signal’s (blue curve) primary peaks and troughs, demonstrating the model’s capability to preserve core signal dynamics when the underlying pattern is clear. In contrast, Channel 5 highlights a more sophisticated capability: event capture. The model successfully identifies and reconstructs the significant, sharp dip around the t = 60 mark, proving that it can distinguish meaningful transient events from random noise. Finally, Channel 1 offers a transparent view of the model’s limitations and inductive biases. Faced with high-frequency oscillations, the model produces a significantly smoothed reconstruction. This behavior, a direct consequence of our Trend Smoothness Loss, shows the model’s tendency to function as a denoising filter, which is beneficial for trend extraction but comes at the cost of losing fine-grained texture. This balanced, qualitative evidence provides a much more nuanced and convincing assessment of CoTD-VAE’s capabilities than a simple visual inspection of all channels would allow.

5.2. Classification Task

Accuracy, macro-averaged precision, macro-averaged recall and macro-averaged F1 score are used to evaluate the classification performance of the model. Accuracy is defined as follows:

\begin{matrix} Accuracy = \frac{\sum_{i = 1}^{k} T P_{i}}{\sum_{i = 1}^{k} (T P_{i} + F P_{i})} \end{matrix}

(14)

The accuracy rate shows the proportion of activities that were correctly identified.

T P_{i}

is the number of true instances in category i,

F P_{i}

is the number of false positive instances in category i, and k is the total number of categories. In activity recognition, high accuracy means that the model can reliably tell the difference between different types of activities. Macro average precision, macro average recall and macro average F1 score are defined as follows:

\begin{matrix} {Precision}_{macro} = \frac{1}{k} \sum_{i = 1}^{k} \frac{T P_{i}}{T P_{i} + F P_{i}} \end{matrix}

(15)

\begin{matrix} {Recall}_{macro} = \frac{1}{k} \sum_{i = 1}^{k} \frac{T P_{i}}{T P_{i} + F N_{i}} \end{matrix}

(16)

\begin{matrix} {F 1}_{macro} = \frac{1}{k} \sum_{i = 1}^{k} \frac{2 \times {Precision}_{i} \times {Recall}_{i}}{{Precision}_{i} + {Recall}_{i}} \end{matrix}

(17)

Precision measures the reliability of the model to categorize samples into a particular class, where

T P_{i}

and

F P_{i}

denote the number of true and false positive cases in class i, respectively. Recall measures the model’s ability to recognize all samples in a given category, where

F N_{i}

denotes the number of false negative cases in category i. The macro-averaging approach considers all activity categories equally and is not affected by category imbalance. The F1 score is a reconciled average of precision and recall, where

{Precision}_{i}

and

{Recall}_{i}

denote the precision and recall of category i, respectively.

Table 2 presents a comprehensive comparison of the overall classification performance across all evaluated models. Notably, CoTD-VAE outperformed all baseline methods on all classification metrics reported in Table 2, demonstrating its robust capability on the classification task.

Table 3 lists the F1 Score, Precision, and Recall for different models across each activity category. Variations in performance between models and activity categories can be observed. CoTD-VAE consistently shows strong performance, ranking in the top two in most activity categories. CVAE performs exceptionally well in SITTING and STANDING, but shows significant weaknesses in other categories such as WALKING DOWNSTAIRS. LSTM-AE also performs prominently in LAYING and WALKING DOWNSTAIRS. LAYING appears to be easier for most models to predict, achieving very high or perfect metric scores.

To further validate our model’s effectiveness in a real-world clinical setting, we conducted a comprehensive evaluation on the MIMIC-IV dataset for the task of predicting thrombus events. Given the significant class imbalance inherent in clinical data, where thrombus events are rare, we focused on metrics that provide a more nuanced view than accuracy alone, such as F1-Score and the Area Under the Receiver Operating Characteristic Curve (AUC).

Table 4 presents the detailed classification results. While all baseline models achieved excellent performance, CoTD-VAE demonstrated the best performance. It achieved the highest F1-Score (0.9939) and a perfect AUC (1.0000), showcasing its superior predictive power. Notably, the performance of CoTD-VAE surpasses that of the standard end-to-end LSTM classifier (F1-Score 0.9939 vs. 0.8968). This result suggests that for this task, the structured disentanglement of static, trend, and event components in our model provides a more effective and powerful representation than the single, mixed hidden state learned by a standard recurrent neural network. Although tree-based models like Random Forest are highly efficient on this dataset, the unique advantage of CoTD-VAE lies in its generative and interpretable framework, which offers deeper insights beyond a simple classification output.

Figure 4 presents a detailed performance comparison on the MIMIC-IV thrombus prediction task. The bar charts display the F1-Score and AUC for each model, providing a clear quantitative comparison of their predictive performance. As illustrated, CoTD-VAE demonstrates the best overall performance in this task.

5.3. Ablation and Sensitivity Analysis

5.3.1. Ablation Study

As shown in Table 5, removing either the trend smoothness loss or the event sparsity loss leads to a decrease in the model’s performance on the classification task. Removing the trend smoothness loss resulted in a 3.62 percentage point decrease in accuracy, a 5.04 percentage point decrease in F1 score, and a 0.6 percentage point decrease in AUC. Removing the event sparsity loss resulted in a 4.59 percentage point decrease in accuracy, a 6.04 percentage point decrease in F1 score, and a 1.3 percentage point decrease in AUC.

These results indicate that both temporal consistency constraints we introduced contribute to improving the model’s performance, with the event sparsity constraint contributing more significantly.

5.3.2. Sensitivity Analysis

We conducted a sensitivity analysis on the key hyperparameters of CoTD-VAE. We established a strong baseline configuration (e.g., learning_rate = 5 × 10⁻⁴,

α

= 0.5, all

β

s = 0.5) and then varied each parameter individually while keeping others fixed. The analysis revealed several key insights into the model’s inner workings.

As shown in Figure 5, the model’s performance is notably sensitive to the KL divergence weights (

β

). A critical finding is that applying stronger regularization to the dynamic components (

β_{trend}

and

β_{event}

) while using weaker regularization for the static component consistently yields superior results. For example, increasing

β_{trend}

from 0.1 to 1.0 improved the F1-score from 0.9827 to 0.9947. Conversely, increasing

β_{static}

over the same range caused a significant drop in the F1-score, from 0.9921 to 0.9659. This outcome strongly supports our core hypothesis: an effective model should learn generalized, abstract representations for dynamic patterns (trend and event) while preserving the rich, specific information inherent in a patient’s static baseline.

Concurrently, the model exhibited considerable robustness to the temporal constraint weights. For instance, varying the trend smoothness weight

α

from 0.1 to 1.0, and the event sparsity weights

γ_{c}

and

γ_{p}

across their tested ranges, resulted in only minor fluctuations in performance, with the F1-score remaining consistently high (above 0.99). This suggests that our proposed temporal constraints are beneficial and effective without requiring meticulous fine-tuning, which enhances the practical applicability and reliability of our model. Full results of the sensitivity analysis are provided in Appendix A.

5.4. Interpretability Analysis

A central premise of our work is that CoTD-VAE enhances interpretability by disentangling time series into meaningful, human-understandable components. To substantiate this claim, we performed a series of qualitative analyses on both the MIMIC-IV and UCI HAR datasets, illustrating the practical value and internal logic of our disentanglement methodology.

5.4.1. Clinical Decision Insight from MIMIC-IV

Our evaluation on the MIMIC-IV thrombosis task aimed to reveal how the framework’s disentangled representations can illuminate the basis of clinical predictions.

We began by visualizing the latent spaces via t-SNE, which provides a window into the model’s internal reasoning. As presented in Figure 6, each point represents a patient sample, colored by its ground truth label (‘Thrombus’ or ‘No Thrombus’). While considerable overlap is expected given the complexity of clinical data, the visualization nonetheless reveals a critical insight. In the Event latent space, specifically designed to capture acute occurrences, ‘Thrombus’ cases form several dense clusters, distinct from the general ‘No Thrombus’ population.

This non-random clustering is the cornerstone of the model’s interpretability. It demonstrates that our model has autonomously learned a clinically relevant rule; specific patterns of acute events are key discriminators for thrombosis. In other words, the model does not merely provide a prediction, but also reveals that its judgment heavily relies on whether a patient exhibits certain “high-risk” event signatures. This data-driven reasoning aligns closely with clinical intuition—where physicians pay close attention to acute changes in a patient’s condition—thereby enhancing the trustworthiness of the model’s predictions.

Then, to quantify the contribution of each disentangled component, we analyzed the feature importance derived from the downstream classifier. The results, shown in Figure 7, confirm our qualitative findings. Features from the Event and Trend spaces dominate the importance ranking, possessing the most significant predictive power. The prominence of features like Event_3 and Trend_15 indicates that the model’s risk assessment is driven by a combination of acute, transient signals and sustained physiological trajectories. This synergy between event detection and trend analysis mirrors established clinical practice, making the model’s decision-making process transparent and verifiable.

5.4.2. Representation Validation on UCI HAR

An analysis on the UCI HAR dataset further validated the semantic integrity of the disentangled components.

As shown in Figure 8, each latent space captures distinct aspects of the data. In the Static Latent Space, dynamic activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS) are largely separated from static activities (SITTING, STANDING, LAYING), though some overlap exists. The Trend Latent Space shows clearer separation between activity groups, suggesting it effectively captures the overall motion patterns. The Event Latent Space also demonstrates good clustering, particularly in separating LAYING from all other activities. This visualization confirms that the model learns representations that reflect the inherent structure of the activities.

To provide quantitative evidence of the learned features’ utility, we analyzed their importance for the downstream classification task. Figure 9 illustrates the feature importance ranking from the Random Forest classifier. The results show that both ‘Event’ and ‘Trend’ features are highly influential. This indicates that the classifier leverages both the gradual, overall patterns (Trend) and the sparse, specific moments (Event) to distinguish between activities. This analysis confirms that the disentanglement is not merely a structural exercise; the separated components are semantically meaningful and directly contribute to the model’s predictive power.

6. Discussion

The results of our experiments show that our proposed CoTD-VAE performed better than the baselines. The model can effectively separate static features, long-term trends, and sudden events in human activity data. Furthermore, by adding constraints based on time, the model’s ability to capture how things change over time was improved even more.

CoTD-VAE is better than other VAE versions (like BetaVAE and CVAE) at classifying data while still being good at making copies of the data. This shows that the disentangled latent representations and temporal consistency constraints are important for capturing the essential characteristics of activity data.

A deeper analysis of the class-specific performance presented in Table 3 reveals significant insights into the architectural biases and capabilities of the evaluated models. Firstly, the near-perfect scores for the LAYING activity across all models suggest that its feature representation is highly distinct and stable. The low signal variance and clear separation from dynamic activities make it an easy class to identify, regardless of the modeling approach. Secondly, the challenge lies in distinguishing between activities with high similarity, such as the two static states SITTING and STANDING, and the three dynamic states (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS). Here, the architectural differences become apparent. The Conditional VAE (CVAE) demonstrated exceptional performance in separating SITTING from STANDING. We attribute this to its conditional design, which explicitly uses class labels to learn highly specialized feature distributions for each activity. This makes it powerful for discriminating between classes based on subtle, static differences. However, this same model struggled significantly with WALKING_DOWNSTAIRS, suggesting that its label-conditioned generation may not be robust enough to capture the complex and transient signal patterns of dynamic movements.

In contrast, our proposed CoTD-VAE, designed for temporal disentanglement, showed strong and more balanced performance across dynamic activities. We posit that its ability to separate a signal into a smooth trend component and a sparse event component is crucial. The trend representation captures the overall rhythm of walking, while the event representation effectively isolates the abrupt, high-frequency signals corresponding to individual steps, which are particularly pronounced during stair descent.

Nevertheless, this detailed analysis also highlights a potential limitation of CoTD-VAE. Its slightly lower performance compared to CVAE on the SITTING vs. STANDING task suggests that while our model excels at understanding temporal dynamics, its unsupervised disentanglement might be less effective than supervised, label-conditioned methods for resolving fine-grained differences between very similar static states. This presents a clear direction for future work: exploring hybrid models that combine the benefits of temporal disentanglement with label-conditional learning to achieve both dynamic robustness and high-fidelity static-state separation.

A key limitation of our experimental framework is the two-stage model selection process. By selecting the optimal autoencoder based solely on reconstruction performance, we may have inadvertently favored models that excel at data compression rather than learning the most discriminative features for activity recognition. This likely explains the performance discrepancy of the Transformer-AE model, which achieved the lowest reconstruction error but did not yield the best classification results. Future work should employ an end-to-end hyperparameter optimization strategy guided directly by the downstream task performance to ensure a more holistic model evaluation.

7. Conclusions

CoTD-VAE effectively improves the model’s performance on reconstruction and classification tasks by disentangling time series data into latent factors such as static features, long-term trends, and abrupt events, and by introducing temporal consistency constraints such as trend smoothness and event sparsity. The CoTD-VAE’s disentanglement mechanism enables the model to extract different types of information from the time series. This meaningful decomposition enhances the model’s interpretability, allowing us to gain insight into the contribution of different latent factors to classification and providing more valuable features for downstream tasks. This is expected to significantly improve the accuracy and clinical utility of classification in medical time series data. Future work can explore the following directions:

Further optimization of model architecture: Exploring more advanced sequence modeling architectures (e.g., more complex attention mechanisms) or different disentanglement methods to further enhance the separability and expressiveness of latent representations. Also, investigating how to adaptively determine the dimensions of each latent space and the weights of the regularization terms (e.g., the $β$ and $λ$ parameters) instead of using fixed hyperparameter settings.
More fine-grained latent factor analysis and clinical association: Conducting more in-depth analysis of the disentangled static, trend, and event latent spaces, for example, by using clustering, visualization, or other statistical methods, to identify clinically meaningful subgroups or patterns. Crucially, future work will involve close collaboration with clinical experts to validate the specific interpretations of these learned latent factors and to assess their association strength with specific disease states or risk events.
Application to wider medical time series tasks: Applying the CoTD-VAE model to other types of medical time series data (e.g., physiological waveforms, continuous glucose monitoring data, etc.), as well as different clinical tasks, such as early disease diagnosis, disease progression prediction, treatment response assessment, or patient phenotyping.
Enhancing model generalization ability and transferability: Investigating how to improve the generalization ability of trained models across different hospitals or patient populations. Exploring federated learning or transfer learning techniques in order to utilize data from multiple sources while protecting data privacy, aiming to train more robust models.
Integration with causal inference: Exploring the combination of disentangled representations with causal inference methods to better understand the causal relationships between different latent factors and how they jointly influence patient risk outcomes. This will help reveal the underlying disease mechanisms and provide guidance for clinical interventions.

These future research directions will further advance the development of medical time series analysis and risk prediction techniques based on deep generative models and disentangled representation learning.

Author Contributions

Conceptualization, L.H.; methodology, L.H. and Q.C.; software, L.H.; validation, L.H.; formal analysis, L.H.; investigation, L.H.; resources, Q.C.; data curation, L.H.; writing—original draft preparation, L.H.; writing—review and editing, Q.C.; visualization, L.H.; supervision, Q.C.; project administration, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Specific Research Project of Guangxi for Research Bases and Talents (GuiKe AD24010011) and the Key Research & Development Program Project of Guangxi (GuiKe AB25069095).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MIMIC-IV dataset can be accessed via the PhysioNet platform, and the UCI HAR dataset is available from the UCI Machine Learning Repository. The source code for CoTD-VAE is openly available in GitHub at https://github.com/HL-DataMining/CoTD-VAE (accessed on 8 July 2025).

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Appendix A. Sensitivity Analysis Results

We started with a baseline configuration and varied one parameter at a time, while keeping the others fixed. The baseline configuration was set as learning_rate = 5 × 10⁻⁴,

α = 0.5

,

β_{static} = 0.5

,

β_{trend} = 0.5

,

β_{event} = 0.5

,

γ_{c} = 0.2

, and

γ_{p} = 0.2

. The performance was evaluated on the test set using five key metrics: Accuracy, Precision, Recall, F1-Score, and AUC. The detailed results are presented in Table A1.

Table A1. Sensitivity analysis of CoTD-VAE hyperparameters on the MIMIC-IV thrombus prediction task.

Parameter Tested	Value	Accuracy	Precision	Recall	F1-Score	AUC
Baseline Config	–	0.9981	1.0000	0.9878	0.9939	1.0000
`learning_rate`	1 × 10⁻³	0.9980	1.0000	0.9875	0.9937	1.0000
	1 × 10⁻⁴	0.9969	0.9997	0.9804	0.9899	1.0000
$α$	0.1	0.9976	1.0000	0.9849	0.9924	1.0000
	1.0	0.9985	1.0000	0.9904	0.9952	1.0000
$γ_{c}$	0.1	0.9975	1.0000	0.9842	0.9921	1.0000
	0.5	0.9978	1.0000	0.9862	0.9930	1.0000
$γ_{p}$	0.1	0.9980	1.0000	0.9871	0.9935	1.0000
	0.5	0.9976	1.0000	0.9846	0.9922	1.0000
$β_{static}$	0.1	0.9975	1.0000	0.9842	0.9921	1.0000
	1.0	0.9896	1.0000	0.9341	0.9659	1.0000
$β_{trend}$	0.1	0.9946	1.0000	0.9659	0.9827	1.0000
	1.0	0.9983	1.0000	0.9894	0.9947	1.0000
$β_{event}$	0.1	0.9943	1.0000	0.9637	0.9815	0.9999
	1.0	0.9978	1.0000	0.9859	0.9929	1.0000

References

Sakib, M.; Mustajab, S.; Alam, M. Ensemble deep learning techniques for time series analysis: A comprehensive review, applications, open issues, challenges, and future directions. Clust. Comput. 2025, 28, 1–44. [Google Scholar] [CrossRef]
Kumaragurubaran, T.; Senthil Pandi, S.; Vijay Raj, S.R.; Vigneshwaran, R. Real-time Patient Response Forecasting in ICU: A Robust Model Driven by LSTM and Advanced Data Processing Approaches. In Proceedings of the 2024 2nd International Conference on Networking and Communications (ICNWC), Chennai, India, 2–4 April 2024; pp. 1–6. [Google Scholar]
Aminorroaya, A.; Dhingra, L.; Zhou, X.; Camargos, A.P.; Khera, R. A Novel Sentence Transformer Natural Language Processing Approach for Pragmatic Evaluation of Medication Costs in Patients with Type 2 Diabetes in Electronic Health Records. J. Am. Coll. Cardiol. 2025, 85, 407. [Google Scholar] [CrossRef]
Patil, S.A.; Paithane, A.N. Advanced stress detection with optimized feature selection and hybrid neural networks. Int. J. Electr. Comput. Eng. (IJECE) 2025, 15, 1647–1655. [Google Scholar] [CrossRef]
Xie, F.; Yuan, H.; Ning, Y.; Ong, M.E.H.; Feng, M.; Hsu, W.; Chakraborty, B.; Liu, N. Deep learning for temporal data representation in electronic health records: A systematic review of challenges and methodologies. J. Biomed. Inform. 2022, 126, 103980. [Google Scholar] [CrossRef] [PubMed]
Lan, W.; Liao, H.; Chen, Q.; Zhu, L.; Pan, Y.; Chen, Y.-P. DeepKEGG: A multi-omics data integration framework with biological insights for cancer recurrence prediction and biomarker discovery. Briefings Bioinform. 2024, 25, bbae185. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Chen, Q.; Liu, Z.; Pan, S.; Zhang, S. Bi-SGTAR: A simple yet efficient model for circRNA-disease association prediction based on known association pair only. Knowl.-Based Syst. 2024, 291, 111622. [Google Scholar] [CrossRef]
Li, Y.; Lu, X.; Wang, Y.; Dou, D. Generative time series forecasting with diffusion, denoise, and disentanglement. Adv. Neural Inf. Process. Syst. 2022, 35, 23009–23022. [Google Scholar]
Neloy, A.A.; Turgeon, M. A comprehensive study of auto-encoders for anomaly detection: Efficiency and trade-offs. Mach. Learn. Appl. 2024, 17, 100572. [Google Scholar] [CrossRef]
Asesh, A. Variational Autoencoder Frameworks in Generative AI Model. In Proceedings of the 2023 24th International Arab Conference on Information Technology (ACIT), Ajman, United Arab Emirates, 6–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 01–06. [Google Scholar]
Wang, X.; Chen, H.; Tang, S.; Wu, Z.; Zhu, W. Disentangled representation learning. IEEE Trans. Pattern Anal. Mach. Intell. arXiv 2024, arXiv:2211.11695. [Google Scholar] [CrossRef] [PubMed]
Liang, S.; Pan, Z.; Liu, W.; Yin, J.; De Rijke, M. A survey on variational autoencoders in recommender systems. ACM Comput. Surv. 2024, 56, 1–40. [Google Scholar] [CrossRef]
Hyland, S.L.; Faltys, M.; Hüser, M.; Lyu, X.; Gumbsch, T.; Esteban, C.; Bock, C.; Horn, M.; Moor, M.; Rieck, B.; et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 2020, 26, 364–373. [Google Scholar] [CrossRef] [PubMed]
McGinn, T.G.; Guyatt, G.H.; Wyer, P.C.; Naylor, C.D.; Stiell, I.G.; Richardson, W.S.; Evidence-Based Medicine Working Group. Users’ guides to the medical literature: XXII: How to use articles about clinical decision rules. JAMA 2000, 284, 79–84. [Google Scholar] [CrossRef] [PubMed]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2015. [Google Scholar]
Shakandli, M.M. State Space Models in Medical Time Series. Ph.D. Thesis, University of Sheffield, Sheffield, UK, 2018. [Google Scholar]
Morid, M.A.; Sheng, O.R.L.; Dunbar, J. Time series prediction using deep learning methods in healthcare. ACM Trans. Manag. Inf. Syst. 2023, 14, 1–29. [Google Scholar] [CrossRef]
Rajwal, S.; Aggarwal, S. Convolutional neural network-based EEG signal analysis: A systematic review. Arch. Comput. Methods Eng. 2023, 30, 3585–3615. [Google Scholar] [CrossRef]
Renc, P.; Jia, Y.; Samir, A.E.; Was, J.; Li, Q.; Bates, D.W.; Sitek, A. Zero shot health trajectory prediction using transformer. NPJ Digit. Med. 2024, 7, 256. [Google Scholar] [CrossRef] [PubMed]
Lentzen, M.; Linden, T.; Veeranki, S.; Madan, S.; Kramer, D.; Leodolter, W.; Fröhlich, H. A transformer-based model trained on large scale claims data for prediction of severe COVID-19 disease progression. IEEE J. Biomed. Health Inform. 2023, 27, 4548–4558. [Google Scholar] [CrossRef] [PubMed]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.P.; Glorot, X.; Botvinick, M.M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Kim, H.; Mnih, A. Disentangling by Factorising. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Kumar, A.; Sattigeri, P.; Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. arXiv 2017, arXiv:1711.00848. [Google Scholar]
Dupont, E. Learning disentangled joint continuous and discrete representations. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Kim, M.; Wang, Y.; Sahu, P.; Pavlovic, V. Relevance factor vae: Learning and identifying disentangled factors. arXiv 2019, arXiv:1902.01568. [Google Scholar]
Liu, Z.; Li, M.; Han, C.; Tang, S.; Guo, T. STDNet: Rethinking disentanglement learning with information theory. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 10407–10421. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Sanchez, P.; Thermos, S.; O’Neil, A.Q.; Tsaftaris, S.A. Learning disentangled representations in the imaging domain. Med Image Anal. 2022, 80, 102516. [Google Scholar] [CrossRef] [PubMed]
Cheng, J.; Gao, M.; Liu, J.; Yue, H.; Kuang, H.; Liu, J.; Wang, J. Multimodal disentangled variational autoencoder with game theoretic interpretability for glioma grading. IEEE J. Biomed. Health Inform. 2021, 26, 673–684. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Welch, J.D. MichiGAN: Sampling from disentangled representations of single-cell data using generative adversarial networks. Genome Biol. 2021, 22, 158. [Google Scholar] [CrossRef] [PubMed]
Qiu, Y.L.; Zheng, H.; Gevaert, O. Genomic data imputation with variational auto-encoders. GigaScience 2020, 9, giaa082. [Google Scholar] [CrossRef] [PubMed]
Lim, M.H.; Cho, Y.M.; Kim, S. Multi-task disentangled autoencoder for time-series data in glucose dynamics. IEEE J. Biomed. Health Inform. 2022, 26, 4702–4713. [Google Scholar] [CrossRef] [PubMed]
Hahn, T.V.; Mechefske, C.K. Self-supervised learning for tool wear monitoring with a disentangled-variational-autoencoder. Int. J. Hydromechatronics 2021, 4, 69–98. [Google Scholar] [CrossRef]
Wu, S.; Haque, K.I.; Yumak, Z. ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE. In Proceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games, Arlington, VA, USA, 21–23 November 2024. [Google Scholar]
Wang, Z.; Xu, X.; Zhang, W.; Trajcevski, G.; Zhong, T.; Zhou, F. Learning latent seasonal-trend representations for time series forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 38775–38787. [Google Scholar]
Liu, X.; Zhang, Q. Combining Seasonal and Trend Decomposition Using LOESS with a Gated Recurrent Unit for Climate Time Series Forecasting. IEEE Access 2024, 12, 85275–85290. [Google Scholar] [CrossRef]
Kim, J.-Y.; Cho, S.-B. Explainable prediction of electric energy demand using a deep autoencoder with interpretable latent space. Expert Syst. Appl. 2021, 186, 115842. [Google Scholar] [CrossRef]
Klopries, H.; Schwung, A. ITF-VAE: Variational Auto-Encoder using interpretable continuous time series features. IEEE Trans. Artif. Intell. 2025. [Google Scholar] [CrossRef]
Staffini, A.; Svensson, T.; Chung, U.; Svensson, A.K. A disentangled VAE-BiLSTM model for heart rate anomaly detection. Bioengineering 2023, 10, 683. [Google Scholar] [CrossRef] [PubMed]
Buch, R.; Grimm, S.; Korn, R.; Richert, I. Estimating the value-at-risk by Temporal VAE. Risks 2023, 11, 79. [Google Scholar] [CrossRef]
Kapsecker, M.; Möller, M.C.; Jonas, S.M. Disentangled representational learning for anomaly detection in single-lead electrocardiogram signals using variational autoencoder. Comput. Biol. Med. 2025, 184, 109422. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Chen, Z.; Zha, D.; Du, M.; Ni, J.; Zhang, D.; Chen, H.; Hu, X. Towards learning disentangled representations for time series. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 3270–3278. [Google Scholar]
Pinheiro Cinelli, L.; Araújo Marins, M.; Barros da Silva, E.A.; Lima Netto, S. Variational Autoencoder. In Variational Methods for Machine Learning with Applications to Deep Networks; Springer International Publishing: Cham, Switzerland, 2021; pp. 111–149. [Google Scholar]
Fortuin, V.; Baranchuk, D.; Rätsch, G.; Mandt, S. Gp-vae: Deep probabilistic time series imputation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 26–28 August 2020; PMLR: Birmingham, UK, 2020; pp. 1651–1661. [Google Scholar]
Hsu, W.-N.; Zhang, Y.; Glass, J. Unsupervised learning of disentangled and interpretable representations from sequential data. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the 31st International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014; Volume 32, pp. 1278–1286. [Google Scholar]
Zhao, Y.; Zhao, W.; Boney, R.; Kannala, J.; Pajarinen, J. Simplified temporal consistency reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 42227–42246. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Hoyer, P.O. Non-negative sparse coding. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, Martigny, Switzerland, 6 September 2002; IEEE: Piscataway, NJ, USA, 2002; pp. 557–565. [Google Scholar]
Hu, W.; Yang, Y.; Cheng, Z.; Yang, C.; Ren, X. Time-series event prediction with evolutionary state graph. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, 8–12 March 2021; pp. 580–588. [Google Scholar]
Giannakopoulos, T.; Pikrakis, A. Introduction to Audio Analysis: A MATLAB^® Approach; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
Lan, W.; Li, C.; Chen, Q.; Yu, N.; Pan, Y.; Zheng, Y.; Chen, Y.-P. LGCDA: Predicting CircRNA-Disease Association Based on Fusion of Local and Global Features. IEEE/ACM Trans. Comput. Biol. Bioinform. 2024, 21, 1413–1422. [Google Scholar] [CrossRef] [PubMed]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In Ambient Assisted Living and Home Care: 4th International Workshop, IWAAL 2012, Vitoria-Gasteiz, Spain, 3–5 December 2012; Proceedings 4; Springer: Berlin/Heidelberg, Germany, 2012; pp. 216–223. [Google Scholar]
Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Moody, B.; Gow, B.; Lehman, L.-W.H.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
Xie, F.; Xiao, F.; Tang, X.; Luo, Y.; Shen, H.; Shi, Z. Degradation State Assessment of IGBT Module Based on Interpretable LSTM-AE Modeling Under Changing Working Conditions. IEEE J. Emerg. Sel. Top. Power Electron. 2024, 12, 5544–5557. [Google Scholar] [CrossRef]
Madhukar, S.R.; Singh, K.; Kanniyappan, S.P.; Krishnan, T.; Sarode, G.C.; Suganthi, D. Towards Efficient Energy Management of Smart Buildings: A LSTM-AE Based Model. In Proceedings of the 2024 International Conference on Electronics, Computing, Communication and Control Technology (ICECCC), Bengaluru, India, 2–3 May 2024; pp. 1–6. [Google Scholar]
Han, Z.; Tian, H.; Han, X.; Wu, J.; Zhang, W.; Li, C.; Qiu, L.; Duan, X.; Tian, W. A Respiratory Motion Prediction Method Based on LSTM-AE with Attention Mechanism for Spine Surgery. Cyborg Bionic Syst. 2023, 5, 0063. [Google Scholar] [CrossRef] [PubMed]
Prabhakar, C.; Li, H.; Yang, J.; Shit, S.; Wiestler, B.; Menze, B.H. ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Nashville, TN, USA, 10–12 July 2023. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sun, W.; Xiong, W.; Chen, H.; Chiplunkar, R.; Huang, B. A Novel CVAE-Based Sequential Monte Carlo Framework for Dynamic Soft Sensor Applications. IEEE Trans. Ind. Inform. 2024, 20, 3789–3800. [Google Scholar] [CrossRef]
Kim, J.; Kong, J.; Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. arXiv 2021, arXiv:2106.06103. [Google Scholar]

Figure 1. Basic Variational Autoencoder Model, consisting of an encoder and a decoder. The input data is analyzed by the encoder to produce means and variances, and reparameterization is used to obtain latent variables. These latent variables are fed into the decoder to reconstruct the original data, with the aim of minimizing the sum of reconstruction error and KL divergence to optimize reconstruction performance.

Figure 2. CoTD-VAE is an advanced VAE model tailored for medical time series data. It uses three encoders to disentangled input data into latent variables

Z_{static}

,

Z_{trend}

, and

Z_{event}

. During backpropagation optimization, it adds minimizing trend smoothness loss and event sparsity loss.

Figure 2. CoTD-VAE is an advanced VAE model tailored for medical time series data. It uses three encoders to disentangled input data into latent variables

Z_{static}

,

Z_{trend}

, and

Z_{event}

. During backpropagation optimization, it adds minimizing trend smoothness loss and event sparsity loss.

Figure 3. Qualitative analysis of CoTD-VAE’s reconstruction capabilities on selected illustrative channels. Comparison of original signals (blue curves) and signals reconstructed by the CoTD-VAE model (red curves) for a selected sample from the UCI HAR test set. To provide a more focused and insightful analysis, three illustrative channels were selected to demonstrate the model’s distinct capabilities and behaviors.

Figure 4. Performance comparison of different models on the MIMIC-IV thrombus prediction task.

Figure 5. Sensitivity analysis of the KL divergence weights (

β

) on model performance.

Figure 5. Sensitivity analysis of the KL divergence weights (

β

) on model performance.

Figure 6. Class-level latent space visualization. t-SNE visualization of the disentangled latent spaces for thrombosis prediction. Figure 6 presents t-SNE projections of the three disentangled latent spaces: Static, Trend, and Event. Each subplot represents a latent space, where blue points indicate ‘No Thrombus’ cases and red points indicate ‘Thrombus’ cases, visualizing the distribution of different class samples within their respective latent spaces.

Figure 7. Feature importance analysis. Figure 7 shows the top-ranked features, where the names (e.g., Event_3) correspond to indices of specific event or trend features extracted during our preprocessing. Feature importance ranking for the thrombosis prediction task. The analysis shows that Event and Trend type features are the most influential, suggesting the model prioritizes acute events and dynamic physiological trends when assessing risk.

Figure 8. t-SNE visualization of the Static, Trend, and Event latent spaces on the UCI HAR test set. CoTD-VAE demonstrates a clear ability to cluster activities based on their dynamic or static nature, supporting the interpretability of the disentangled representations.

Figure 9. Feature importance ranking of the disentangled latent variables for the classification task on the UCI HAR dataset. The plot shows the top 20 features as determined by the Random Forest classifier. Both event and trend features are shown to be highly predictive.

Table 1. Performance on the UCI dataset reconstruction task.

Model	MSE ( $\times 10^{- 3}$ )	MAE ( $\times 10^{- 2}$ )
LSTM-AE	3.311	3.837
Transformer-AE	0.438	1.589
$β$ -VAE	6.616	5.145
CVAE	5.776	4.852
CoTD-VAE	3.267	3.422

Table 2. Aggregate performance of models on the classification task.

Model	Accuracy	Macro-Averaged F1	Macro-Averaged Precision	Macro-Averaged Recall
LSTM-AE	0.8850	0.8851	0.8888	0.8862
Transformer-AE	0.8599	0.8586	0.8582	0.8591
$β$ -VAE	0.8677	0.8661	0.8670	0.8666
CVAE	0.8717	0.8605	0.8627	0.8618
CoTD-VAE	0.9026	0.9027	0.9030	0.9027

Table 3. F1 Score, Precision, and Recall for each activity category.

Metric	Model	Walk	Walk Up	Walk Down	Sit	Stand	Lay
F1 Score	LSTM-AE	0.893	0.881	0.947	0.782	0.807	1.000
	Transformer-AE	0.831	0.836	0.882	0.791	0.813	0.999
	$β$ -VAE	0.909	0.845	0.849	0.794	0.826	0.973
	CVAE	0.759	0.811	0.684	0.984	0.942	0.983
	CoTD-VAE	0.901	0.865	0.962	0.832	0.857	1.000
Precision	LSTM-AE	0.860	0.987	0.899	0.790	0.797	1.000
	Transformer-AE	0.834	0.832	0.874	0.793	0.819	0.998
	$β$ -VAE	0.896	0.861	0.820	0.824	0.802	0.998
	CVAE	0.770	0.801	0.765	0.978	0.895	0.968
	CoTD-VAE	0.888	0.869	0.962	0.853	0.846	1.000
Recall	LSTM-AE	0.929	0.796	1.000	0.774	0.818	1.000
	Transformer-AE	0.829	0.841	0.890	0.788	0.806	1.000
	$β$ -VAE	0.921	0.830	0.881	0.766	0.852	0.950
	CVAE	0.748	0.822	0.619	0.990	0.994	0.998
	CoTD-VAE	0.913	0.860	0.962	0.813	0.868	1.000

Activity abbreviations: Walk (Walking), Walk Up (Walking Upstairs), Walk Down (Walking Downstairs), Sit (Sitting), Stand (Standing), Lay (Laying).

Table 4. Performance comparison of different models on the MIMIC-IV thrombus prediction task.

Model	Accuracy	Precision	Recall	F1-Score	AUC
CoTD-VAE	0.9981	1.0000	0.9878	0.9939	1.0000
Random Forest	0.9863	0.9234	0.9961	0.9584	0.9995
LightGBM	0.9819	0.9008	0.9949	0.9455	0.9993
XGBoost	0.9773	0.8804	0.9910	0.9324	0.9989
LSTM	0.9641	0.8214	0.9875	0.8968	0.9956
Logistic Regression	0.6002	0.2013	0.5162	0.2896	0.5983

Table 5. Ablation study results of the CoTD-VAE model on classification task.

Model	Accuracy	F1 Score	AUC
CoTD-VAE	0.8707	0.6092	0.918
No Smoothness	0.8345	0.5588	0.912
No Sparsity	0.8248	0.5488	0.905

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, L.; Chen, Q. CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications. Appl. Sci. 2025, 15, 7975. https://doi.org/10.3390/app15147975

AMA Style

Huang L, Chen Q. CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications. Applied Sciences. 2025; 15(14):7975. https://doi.org/10.3390/app15147975

Chicago/Turabian Style

Huang, Li, and Qingfeng Chen. 2025. "CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications" Applied Sciences 15, no. 14: 7975. https://doi.org/10.3390/app15147975

APA Style

Huang, L., & Chen, Q. (2025). CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications. Applied Sciences, 15(14), 7975. https://doi.org/10.3390/app15147975

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CoTD-VAE: Interpretable Disentanglement of Static, Trend, and Event Components in Complex Time Series for Medical Applications

Abstract

1. Introduction

2. Related Work

2.1. Medical Time Series Analysis

2.2. Disentangled Representation Learning

2.3. Disentangled Representation Learning for Time Series Based on LSTM, Transformer and VAE

3. Methods

4. Experiments

4.1. Datasets

4.2. Baselines

4.3. Ablation Experiments

5. Results

5.1. Reconstruction Task

5.2. Classification Task

5.3. Ablation and Sensitivity Analysis

5.3.1. Ablation Study

5.3.2. Sensitivity Analysis

5.4. Interpretability Analysis

5.4.1. Clinical Decision Insight from MIMIC-IV

5.4.2. Representation Validation on UCI HAR

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Sensitivity Analysis Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI