Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data

Eriş, Hasan Ali; Aydın, Muhammed Ali; Erturk, Mehmet Ali

doi:10.3390/app16042104

Open AccessArticle

Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data

by

Hasan Ali Eriş

^1,2,*,

Muhammed Ali Aydın

²

and

Mehmet Ali Erturk

^3,4

¹

KoçSistem Information and Communication Services Inc., Istanbul 34700, Turkey

²

Computer Engineering Department, Istanbul University-Cerrahpaşa, Istanbul 34320, Turkey

³

Computer Engineering Department, Istanbul University, Istanbul 34452, Turkey

⁴

School of Computing, Engineering and the Built Environment, Edinburgh Napier University, Edinburgh EH10 5DT, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2104; https://doi.org/10.3390/app16042104

Submission received: 21 January 2026 / Revised: 12 February 2026 / Accepted: 16 February 2026 / Published: 21 February 2026

Download

Browse Figures

Versions Notes

Abstract

Artificial intelligence (AI) is reshaping healthcare by supporting faster and more informed clinical decisions. However, the complexity of human health makes accurate predictive modeling challenging. In this study, we introduce a methodological framework for constructing intelligent digital twins of disease progression by combining patients’ sequential medical images with temporally aligned electronic health records (EHRs). EHRs in this context include structured clinical parameters such as laboratory test results, demographic characteristics, and medication information. The existing literature provides limited approaches that jointly forecast future medical images and clinical status using long-term historical data. Our framework integrates aligned temporal image sequences with these EHR features and employs either ConvLSTM or ViViT-based spatio-temporal encoders, optionally coupled with a generative module for future image synthesis. While awaiting access to patient datasets, we conducted an initial evaluation using a single-cell time-lapse microscopy dataset whose temporal dynamics resemble patient data. Both systems generate time-ordered image sequences that evolve under changing conditions, and the shifting nutrient environment in microfluidic channels parallels the temporal variations observed in patients’ EHR records. This preliminary study demonstrates the broader applicability of our model to datasets containing long-term sequential images and associated parameters, supporting its potential for future patient-specific digital twin development.

Keywords:

digital twin; ConvLSTM; vision transformer; ViViT; medical imaging; EHR; cVAE; GAN; image registration; FiLM

1. Introduction

The digital twin concept, initially developed for virtual representation of industrial systems to perform analyses that cannot be done on physical systems, has been adapted to healthcare to model organs, diseases, and medicinal responses. In healthcare, DTs have mostly been used to optimize hospital workflows, simulate drug– cell interactions, and create virtual physiological models. However, the application of digital twins to generate personalized future medical images for each patient has not been fully explored yet.

With advances in next-frame prediction models within artificial neural networks, it is now possible to generate video sequences conditioned on specific inputs. We foresee that these techniques could be adapted to the medical images used to track disease progression by fusing EHR (structured clinical parameters such as laboratory test results, demographic characteristics, and medication information) conditions with the images.

In our research, we examined existing computer-aided diagnosis (CAD) systems in the healthcare domain. We observed that most CAD systems lack the ability to temporally interpret sequences of medical images. These systems are typically designed to analyze individual medical scans—for instance, by detecting a lesion—without predicting future outcomes. Furthermore, only a few existing CAD systems attempt to integrate both medical images and electronic health records (EHRs), often leaving such multimodal interpretation to the expertise of clinicians. It is notable that, in the literature, there is a gap in generating future medical images based on a patient’s historical image data and EHRs with generative models, which we want to address.

These observations motivated us to develop a novel framework, which represents a new algorithm pipeline that exceeds the limitations of traditional CAD systems. This novel framework is designed to achieve the following key objectives:

Multimodal Fusion: Consolidate medical image and EHR data with a unified model.
Temporal Analysis: Modeling the disease progress over time by capturing temporal dependencies in sequential medical images and EHRs.
Future Prediction: Develop methods to generate future medical images and forecast future EHR metrics, enabling predictive insight into the evolution of disease.

Our contribution is the development of a new model that predicts the disease-progression trajectory of individual patients by jointly using medical images and EHR data. We define this approach as a form of digital twin for the disease, and the rationale for this terminology comes from the two core principles underlying digital twin systems. First, a digital twin must be able to perform what-if analysis, generating individualized predictions for a specific person or object. Second, it must rely on inputs that are uniquely associated with that same person or object. Although many digital twin applications operate with real-time or continuously updated data, this is not a strict requirement. As long as a model satisfies these two fundamental conditions, it can be considered a digital twin.

In our case, the proposed model produces patient-specific forecasts for the future course of a particular disease. Moreover, the inputs used by the model—longitudinal EHR information and sequential medical images—encode the patient’s own long-term clinical history. Because the model learns from patient-specific trajectories, it can also reflect how changes in key clinical variables might influence the predicted future images. For example, if a medication dosage recorded in the EHR were adjusted, if periodically administered cognitive test scores changed, or if laboratory parameters such as blood glucose shifted, the model would be capable of projecting how these altered conditions might affect the resulting future medical images. In addition, incorporating long-term medical imaging history as an input enables the model to determine both the magnitude and the speed with which EHR-driven changes influence the predicted future images; in this sense, the patient’s longitudinal imaging record also functions as a personalized conditioning factor, similar to EHR variables. In this sense, the system not only predicts the natural progression of the disease but also allows scenario-based exploration of potential outcomes under varying clinical conditions, further reinforcing its characterization as a digital twin.

The remainder of this paper is organized as follows: Section 2 discusses the current state-of-the-art studies from the literature. Section 3 introduces the proposed framework for image processing with generative models. The patient datasets and the test implementation using the single-cell time-lapse microscopy study are described in Section 4. This dataset was chosen because its temporal structure resembles patient data: the sequential microfluidic channel images parallel the temporal progression of medical imaging, while the alternating glucose–lactose environment reflects the time-varying nature of EHR parameters. Thus, it provides a suitable preliminary testbed before applying the model to real patient datasets. Section 5 discusses the experimental findings, the expected role of image registration in real medical datasets, and the implications of modality-specific alignment quality for longitudinal prediction. Finally, Section 6 concludes the paper by summarizing the framework’s contribution to patient-specific digital twin modeling and outlining future validation efforts on longitudinal real-world clinical datasets.

2. Related Work

Disease classification/detection using deep learning models for static medical images is the most used method in computer-aided diagnostics (CAD). In mammography, convolutional neural networks (CNNs) have been used for lesion segmentation [1], microcalcification detection [2], and breast density classification [3]. In neuroimaging, structural MRI and PET images are commonly used in Alzheimer’s disease (AD) diagnosis using 3D CNNs and attention models [4,5]. While effective, these static-image-based approaches fail to account for the disease’s temporal progression, and they only offers classifications of disease severity, not future predictions about it.

In a study where patient bone age was estimated using patient X-ray images and gender information [6], a vision transformers (ViT) model was used to evaluate medical images and gender information together. In this model, gender information was added to the ViT model as a global token. This model integrates gender information, serving as an example of an EHR that incorporates medical images. However, it lacks temporal interpretation and future prediction.

In another study [7] that interprets 3D brain MRI scans spatio-temporally using ViViT (video vision transformer), an attempt was made to detect Alzheimer’s disease. In this study, the central 32 MRI slices (2D slices) are provided to the model as input. While spatial attention is applied to each MRI slice, the sequential slices are treated as if they represent consecutive time steps, and temporal attention is applied across them. There are usages of ViViT in medical image analysis. However, as seen in this study, there is no true temporal interpretation across images from different actual time points, no generation of future medical images, and no integration of EHRs (electronic health records) to help the model to detect the disease or to generate future images.

Temporal modeling strategies attempt to overcome static limitations. Temporal subtraction, which highlights changes between prior and current images, has been applied in mammography for enhanced microcalcification detection [8]. Loizidou et al. [9] showed performance gains using subtraction imaging in longitudinal studies. However, these models only predict the current state of the disease.

We also reviewed models that predict disease progression using regression, classification, or image-generation approaches. As examples of models that generate future medical images, we can refer to the following studies. In brain imaging, DaniNet [10] introduced a generative adversarial network (GAN) conditioned on patient age and diagnosis (normal, MCI, AD) to simulate future brain MRI slices in AD patients. This model makes future predictions according to the current EHR (age, diagnosis) and medical image scan (MRI). However, it does not take into account prior image scans and EHRs. In a longitudinal MRI-to-MRI prediction study [11], multiple deep architectures—UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet—were evaluated to generate a subject’s future brain MRI from baseline imaging. However, the approach is imaging-only and does not incorporate longitudinal EHR signals; consequently, it cannot model how changes in medications, laboratory results, or other EHR variables would alter the predicted future scan. In a longitudinal MRI modeling study [12], a generative autoencoder-based framework with disentangled structural and longitudinal state representations—supported by a personalized memory module—was used to generate missing and future MR images through latent interpolation and extrapolation. However, the method is imaging-only and does not incorporate longitudinal EHR histories, which means it cannot simulate how changes in clinical measurements or other EHR variables would influence the predicted future scan. In a spatio-temporal tumor growth prediction study [13], a 4D deep learning framework based on spatio-temporal ConvLSTM (ST-ConvLSTM) was employed to generate future tumor volumes, shapes, and intensity patterns from sequential volumetric imaging. Although the study mentions that clinical information can be incorporated, the implemented model relies almost exclusively on imaging data and does not integrate rich longitudinal EHR variables; therefore, it cannot perform patient-specific what-if analysis driven by changes in clinical parameters or laboratory values. As we did for these studies, for models that predict the future of the disease, we conducted a literature review according to the following criteria. These studies are listed in Table 1.

The criteria were as follows:

Does the model use image data as input?
Does the model use EHR data as input?
Does the model use history data as input?
Does the model output a predicted image for the future of the disease?
Does the model output predicted EHR (classification/regression) for the future of the disease?

According to Table 1, these temporal models also have limitations. Some incorporate a patient’s previous medical images, but not their electronic health records (EHRs). Some models includes both EHRs and medical images as input but fail to consider prior medical images and historical EHR data, which are essential for achieving a more patient-specific understanding. Moreover, even image-only temporal models do not always generate future scans; approximately half of the studies instead formulate future prediction as a classification or regression task.

In the proposed framework, both longitudinal medical images and corresponding EHR records are jointly used as model inputs. By learning from historical digital records across patients, the model captures shared disease dynamics while preserving the ability to condition predictions on individual patient trajectories. Patients with similar imaging and clinical histories contribute to the learning of comparable progression patterns, allowing the model to transition from a generalized representation of disease evolution toward patient-specific forecasting. Under this formulation, the trained model can be instantiated as a disease-specific digital twin for an individual patient when applied to that patient’s historical data.

3. Proposed Architecture

An overview of the proposed architecture is given in Figure 1. This architecture is a conceptual representation of how the digital data of patients are collected and presented to DT models to predict future outcomes. The diagram illustrates a disease-specific, patient-centered digital twin pipeline that jointly leverages longitudinal medical images and time-aligned EHR data. First, the inputs are prepared through modality-aware preprocessing, including medical image preprocessing, image registration to align sequential scans, and EHR preprocessing to standardize clinical variables. The processed data are then routed into separate digital twins for each disease a patient may have (e.g., Alzheimer’s, breast cancer), emphasizing that a single patient can be associated with multiple disease-specific DT instances. Within each DT, generative modeling components forecast future medical images, while regression-based components predict future EHR values. Together, these outputs support patient-specific prognosis and enable scenario-based what-if analysis by simulating how changes in clinical conditions may influence future imaging and EHR trajectories. The main components of the proposed architecture are discussed in the remainder of this section.

Sequential image processing requires preprocessing where images in order have the same structure. In the proposed architecture, we employ an image registration technique on sequential images before feeding the neural network model. The first step is to apply image registration to all medical images to align each frame to have the same angle, size, and deformation. For sequential medical images, in general, there is a need to register. However, as noted above, this framework is not limited to patient medical image datasets and can be applied to other sequential imaging scenarios as well. If the sequential images exhibit minimal changes in orientation, viewpoint, or local tissue/texture deformation—as in some single-cell time-lapse microscopy settings—this registration preprocessing step can be omitted.

After that, we can use ConvLSTM [20] to interpret sequential images, because ConvLSTM is more successful than classic ’CNN + LSTM’ models for image generation tasks. In Figure 2, the ConvLSTM pipeline is displayed. First, an embedding layer is used for EHRs. After that, you have two options to fuse the EHR embedding output with image channels. One option is adding EHR embedding outputs as channels to medical images, and the second option is generating new image channels that hold both the EHR and image information by using the spatio-temporal FiLM (ST-FiLM) algorithm. In this way, EHRs will play a role in generating new images. Then, these new inputs are processed in the ConvLSTM layer, and from the output of the ConvLSTM, a medical image can be generated directly or a cVAE-GAN structure can be used for interpreting this output. This increases the generated image output quality. At the same time, the output of the ConvLSTM allows for the prediction of EHR output, making the model suitable for joint image and EHR prediction.

However, instead of ConvLSTM, a ViViT (Video ViT) [21] pipeline can be used for spatio-temporal interpretation of sequential images as shown in Figure 3. ViViT is an extended version of the vision transformers (ViTs). While ViT uses an attention mechanism to interpret spatial information in one image, ViViT can perform spatio-temporal interpretation for sequential images. In ViT / ViViT models, conditions can be added to the model as global tokens. At this point, EHRs can be added to the ViViT model as ’global tokens’ per frame at each time step. In this way, they will be evaluated as conditions. The output of the ViViT will be a latent representation (latent space). To generate a medical image from this latent space, a cGAN, a cVAE, an upsampling model, or a diffusion model can be used.

We will discuss the steps suggested in the ConvLSTM/ViViT pipelines in more depth in the following subsections.

3.1. Image Registration

Image registration is the process of aligning two or more images of the same scene, object, or body parts acquired at different times, from different viewpoints, or by different imaging devices. The goal is to bring the images into a common coordinate system so that corresponding features (e.g., anatomical structures, objects) overlap accurately. This process involves designating one image as the reference image, also called the fixed image, and applying global geometric transformations or local displacements to the other image so that it aligns with the reference image [22].

Global geometric transformations include shifting, rotation, scaling, or shearing. An example of global shifting and rotation is displayed in Figure 4 on CT and MR images.

Medical image registration is the process of aligning multiple medical images, volumes, and surfaces to a common coordinate system. In medical imaging, it is often necessary to compare multiple scans of the same patient acquired during different sessions and under varying conditions. At this point, medical image registration is typically used as a preprocessing step to align the medical images to a common coordinate system before analysis [24].

The brain MRI example above demonstrates a typical case of global geometric misalignment that can be corrected using standard rigid or affine transformations. However, medical imaging encompasses a wide range of modalities, many of which involve acquisition-specific distortions that cannot be resolved by global registration alone. To illustrate this broader challenge, we next present a mammography example, where compression-based shape deformation leads to local pixel displacements, making deformable (non-rigid) registration essential.

Specifically, in mammographic imaging, the breast is compressed or positioned at specific angles during acquisition, which introduces characteristic shape changes and local texture distortions (pixel displacements). These modality-specific deformations cannot be corrected using only global transformations.

To address this issue, deformable or non-rigid registration algorithms are applied after global registration. At this stage, non-rigid registration becomes essential. In Figure 5, old and new MLO mammograms of the same patient are displayed.

With the MATLAB R2025a image registration tool, we apply image registration algorithms to these mammograms. We select an old mammogram as a reference image and make a new mammogram that overlaps/registers with the old one.

As seen in the Figure 5, the breast tissue of the new mammogram is smaller than that of the old one. Global geometric transformations can handle this, but breast tissue boundaries are not exactly the same. This means that during mammogram acquisition, the breast is compressed, and positioned angles are different between the two mammograms, and this causes shape distortions. Then, deformable (non-rigid) registration methods must be applied. As seen in Figure 6, after global geometric transformations, to overlap the new mammogram over the old mammogram, the demons [25,26] and B-spline FFD [27] algorithms are applied.

Both of these algorithms are based on a displacement field that holds information about pixels and their displacement direction and value. These algorithms are static and learn the displacement field based on iterations defined by an expert. They are not learning-based artificial neural network models.

There are other static algorithms too, but development in the area of artificial neural network models has revealed neural network registration models like VoxelMorph [28]. This model is fast at registering images according to static models, but it needs training before usage. That is why this model is suitable when you need to register large datasets.

3.2. EHR Embedding

Electronic health records (EHRs), which belong to the patient, can be static, such as gender, or can vary for each measurement, including blood pressure, lab tests, blood metrics, medications, and age.

Both of them can affect the generation of the predicted medical image. In our model, we can use both the static and varying/dynamic EHRs. However, because the varying EHRs are temporally linked to sequential medical images, we need to embed them in the model for each timely sequential frame separately.

In the ConvLSTM pipeline, we can opt to embed all EHRs with an embedding neural network layer. After that, the output of the embedding layer can be added to frames as channels.

Alternately, in the ConvLSTM pipeline, instead of adding the embedding output to frames as channels, we can use a new popular method named FiLM (feature-wise linear modulation) [29], which is used for spatial condition embedding in image generation models. The FiLM algorithm feature map is formulated as in Equation (1) below.

F_{c}^{'} (x, y) = γ_{c} (z) F_{c} (x, y) + β_{c} (z)

(1)

where

F_{c} (x, y)

denotes the activation of channel c at spatial position

(x, y)

, and z represents an external condition such as a language or metadata embedding. The network learns channel-wise affine modulation parameters

γ

and

β

from a conditioning vector z and applies them uniformly across all spatial locations of each feature map.

We can extend this model to make it spatio-temporally extended by adding a time (t) dimension. After that, the spatio-temporal conditioning via a feature-wise linear modulation (ST-FiLM) future map can be formulated as in Equation (2) below.

F_{c}^{'} (x, y, t) = γ_{c} (x, y, t; z_{t}) F_{c} (x, y, t) + β_{c} (x, y, t; z_{t})

(2)

where

F_{c} (x, y, t)

denotes the activation of channel c at spatial position

(x, y)

and time t. Here,

γ_{c} (x, y, t; z_{t})

and

β_{c} (x, y, t; z_{t})

are spatially and temporally varying modulation maps predicted by the FiLM generator from the per-time condition embedding

z_{t}

. To put it more clearly: ST-FiLM does not create additional channels; instead, it modulates the existing feature maps (image channels) using condition-dependent

γ

and

β

parameters. After this modulation, each image channel becomes condition-aware, blending both the image features and the spatio-temporal condition information. A representative application of ST-FiLM can be seen in this video development article [30].

Additionally, in the ViViT pipeline, we can add EHRs to the model as ’global tokens’ per frame for each time step.

In standard ViT, there is a [CLS] token added to the image patch sequence. It gathers information via attention and is used for final classification. The CLS token is a trainable vector for learning the image context. Similarly, in ViViT, global tokens are added to the spatio-temporal image patch sequences to perform a similar summarization role but extended over time as well as space.

3.3. Spatio-Temporal Modeling

Both ConvLSTM and vision transformer (ViViT-style) encoders can be used to learn spatio-temporal features between sequential frames. These allow modeling anatomical trends and disease dynamics.

Standard vision transformers [31] are used only to interpret spatial information in one image with an attention mechanism. The attention mechanism divides the images into small pieces (small areas) and converts them into patches, which are vectors. After that, relations between patches are calculated. This increases the spatial interpretation in the image. However, video ViTs (ViViT) [21] are used to interpret spatio-temporal information on sequential frames. In some of the ViViT versions, spatial and temporal attention are calculated separately. First, spatial attention is calculated internally for each frame; then, temporal attention is calculated for the frames in sequence. However, in other versions, spatial and temporal attention are calculated together. In that version, spatial attention is calculated across all temporal sequential frames as if they are one frame. This version generally achieves better performance but needs more training.

We also want to mention that if the time between sequential images is not equal, both ConvLSTM and ViViT can be configured to accommodate varying time-lapse situations.

3.4. Image Generation and EHR Prediction

According to the modeling of the ConvLSTM pipeline, the ConvLSTM output can be directly used to generate a predicted medical image, or, to increase the quality of the output, we can use this output as latent space and feed this output into a cVAE+GAN [32] model as input. As a result, the output of the cVAE+GAN layer becomes high quality. In Figure 7, the structure of the cVAE+GAN model is explained.

cVAE+GAN resolves the weaknesses of cVAE and cGAN models. We can briefly summarize these situations:

Compared with cVAE, the sharpness of the output improves, and the texture similarity between the original image and the generated image increases.
Compared with cGAN, it can mitigate common training instabilities, such as mode collapse and gradient vanishing.
It preserves output diversity similar to cVAE while improving conditional controllability.

To predict EHRs from the output of the ConvLSTM layer, the output can be flattened and used as a latent space; then, with a fully connected layer or an MLP layer, it can be used to predict EHRs correlated with the generated medical image.

In the ViViT pipeline, the output of the ViViT encoder is a latent representation. This can be used for both medical image generation and EHR prediction, which are correlated with the generated image. By using this latent representation as input for cGAN, cVAE, upsampling, or diffusion models, a medical image can be generated. As in the ConvLSTM pipeline, interpreting the latent representation with a fully connected layer or an MLP layer, the model can predict EHRs too.

4. Datasets and Implementation

Accurate future image prediction in a patient-specific digital twin requires sequential medical images that are temporally aligned with longitudinal EHR data. Since real patient datasets with long-term paired imaging–EHR sequences are restricted, we first performed a preliminary evaluation on a publicly available single-cell time-lapse microscopy dataset [33]. Although this dataset is not medical, it exhibits the same type of temporal processes needed for digital twin modeling: periodically acquired images that evolve over time and external conditions (glucose–lactose) that change across time steps. These changing nutrient conditions serve as temporal conditioning variables analogous to how a patient’s clinical measurements (e.g., medications, lab tests, cognitive scores) vary across clinical visits. Modifying the nutrient sequence also enables what-if exploration, similar to scenario simulation in patient-specific digital twins. Thus, this single-cell dataset functions as a controlled testbed validating the core modeling principles—learning long-term trajectories, integrating time-varying conditions, and forecasting future states—before applying our model to real patient medical images and EHR data. The medical datasets we aim to use are outlined in the section “Patient Medical Image and EHR Data Research” and will be the focus of our future work.

4.1. Implementing the ConvLSTM Model on Single-Cell Time-Lapse Microscopy Study

In this dataset, images were acquired using phase-contrast and wide-field epifluorescence microscopy on a microfluidic platform designed for long-term single-cell observation. Each microfluidic channel containing E. coli cells was imaged periodically at fixed time intervals, capturing both morphological (phase-contrast) and gene expression (fluorescence) information.

All images were stored as multi-frame TIFF stacks. We processed 156 TIFF stacks, exporting phase-contrast and GFP frames as PNG images at 30 min intervals. In parallel, we recorded the nutrient condition (glucose vs. lactose) between consecutive frames as a time-aligned conditioning variable. These condition labels were used to model how environmental changes influence future frames.

We used the ConvLSTM model explained in Figure 2. We have used the ST-FiLM algorithm for spatio-temporal condition embedding instead of adding conditions as channels to images.

We use phase-contrast and GFP images together with the correlated nutritional environment at time steps Tn,

T_{n + 1}

, and

T_{n + 2}

as model inputs, and aim to predict the phase-contrast (PHC) and GFP images at time step

T_{n + 3}

. In this setting, the nutritional environment acts as a conditioning variable. The proposed model operates as a long-term prediction model. For comparison, we also implemented two short-term models. Model 1 corresponds to the proposed long-term ConvLSTM model, while Models 2 and 3 represent short-term prediction variants. The input configurations and prediction targets of all three models are summarized in Table 2.

Nutrition conditions in the models must be interpreted like this; if input for

T_{n}

is glucose, it means that from starting time

T_{n}

until the time

T_{n + 1}

, cells are in glucose medium.

By using time-related images and the nutrition conditions, we generated training, validation, and test datasets. The sizes of the datasets are, respectively, 64%, 16%, and 20%.

We used the same test dataset for all models. The success of the models is measured by MSE (mean squared error), SSIM (structural similarity index), and the test loss values, which are calculated by using the MSE and SSIM values. According to these metrics, the success of each models is displayed in Table 3 below.

When we look at the model metrics, the minimum test loss value and maximum SSIM values belong to Model 1. This proves that our long-term model is the most successful one.

Model 1’s test loss function value is 0.5% better than Model 3’s loss function value, while Model 1’s SSIM value is 1.1% better than Model 3’s SSIM value. These values may seem small, but in these datasets, most of the image pixel values are zero, which means a black color. This makes differences in metrics small. This is a database-specific issue. However, when we look at predicted images, we can see the real difference. In Figure 8, displayed below, we can see phase-contrast ground truth for

T_{n + 3}

and predictions of all models for

T_{n + 3}

.

Figure 8 illustrates the prediction performance comparison among the three models. The ground truth phase-contrast image for

T_{n + 3}

is shown in (a). The prediction of the long-term ConvLSTM model (b) most closely resembles the ground truth, particularly in reproducing the bright spot-like regions along the cell axis that correspond to optically dense structures observed in the phase-contrast image. In contrast, the short-term models (c) and (d) yield smoother predictions with reduced local intensity variation, indicating a loss of fine structural detail over time. These visual differences demonstrate that the long-term ConvLSTM model achieves more accurate spatio-temporal prediction of cellular morphology.

4.2. Patient Medical Image and EHR Data Research

For future implementation of this model on medical datasets, we identified several publicly available datasets suitable for our objectives. These include OASIS-3 [34], ADNI [35], and NACC [36] for Alzheimer’s disease, and KIOS [37] for mammography. The ADNI, NACC, and OASIS-3 datasets provide longitudinal MRI scans with matched EHR records, making them ideal for temporal modeling. In contrast, the KIOS mammogram dataset lacks EHR data and can therefore be utilized only for preliminary image registration experiments due to the simplicity of its 2D structure.

We are also engaged in ongoing collaborations with Turkish hospitals to gain access to real-world sequential mammogram data paired with EHR metadata. Initially, we will focus on fusing 2D mammogram data with corresponding EHRs. Once the model achieves the expected accuracy, we aim to extend it to 3D MRI-based data from Alzheimer’s datasets. The computational cost of 3D data can be addressed by utilizing reduced regions of interest (ROIs) with downsampled slices of 3D MRI data.

5. Discussion

In this study, the proposed framework was preliminarily evaluated on a bacterial time-lapse dataset whose temporal structure resembles longitudinal patient data. Because this dataset does not require spatial alignment across acquisitions, image registration was not included in the current implementation. Nevertheless, when the framework is applied to real medical datasets such as MRI or mammography, image registration is expected to become a critical preprocessing step to ensure that sequential images are mapped to a common coordinate system and that subtle tissue-level changes can be tracked reliably over time.

We anticipate that registration quality will substantially influence downstream prediction performance. Mammogram images typically exhibit greater local deformation due to breast compression during acquisition, whereas MRI images are structurally more stable; therefore, we expect registration to be more reliable for Alzheimer’s MRI datasets than for mammograms. Based on our preliminary mammogram registration trials, SSIM values above 0.80 appeared visually acceptable, suggesting that SSIM—together with mutual information (MI)—can serve as practical metrics for assessing alignment quality. Importantly, because each imaging modality has distinct deformation characteristics, modality-specific SSIM and MI threshold values (e.g., separate thresholds for MRI, mammography, CT, and ultrasound) should be established. Defining such modality-dependent thresholds is expected to be a fundamental requirement for achieving robust performance when applying the framework to real patient imaging and longitudinal EHR data.

6. Conclusions

This work presents a patient-specific digital twin framework that jointly models temporal medical imaging and longitudinal EHR data to forecast disease evolution and enable scenario-based what-if analysis. The approach is intended to support clinicians by providing forward-looking projections of future imaging outcomes and associated clinical trajectories, potentially facilitating earlier intervention and treatment planning.

Although the methodology is discussed primarily in the context of brain MRI and mammography for Alzheimer’s disease and breast cancer, it is not disease-specific and can be generalized to other conditions with sequential medical images paired with longitudinal EHR records. The imaging inputs do not necessarily need to cover an entire organ; region-of-interest images can also be used, provided that reliable image registration can be achieved.

Future work will focus on implementing and validating the proposed ConvLSTM- and ViViT-based pipelines on real patient datasets, particularly longitudinal Alzheimer’s MRI datasets (e.g., OASIS-3/ADNI/NACC) and sequential mammography datasets paired with EHR metadata. We plan to evaluate registration quality using SSIM and mutual information (MI), assess image generation performance using SSIM and PSNR, and measure EHR prediction accuracy using mean squared error (MSE).

Author Contributions

H.A.E. conducted the research and developed the model, including writing the full manuscript and implementing the code. M.A.A. reviewed the manuscript and provided guidance by identifying both content-related and visual shortcomings. M.A.E. reviewed the manuscript and contributed to improving the content. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data related to this study are available at: https://idr.openmicroscopy.org/study/idr0091, accessed on 24 August 2025. The dataset consists of TIFF files from single-cell time-lapse microscopy experiments. For model training, we extracted PNG images from these TIFF files and organized them into Python dictionaries. Python 3.10.9 version was used in this study. These processed dictionaries can be made available upon reasonable request. No patient-level or proprietary data were used in this study.

Acknowledgments

This study was part of the PhD thesis titled Modeling Disease Progress with Intelligent Digital Twins. The authors would like to express their sincere appreciation to KoçSistem Information and Communication Services Inc. for their support throughout this work.

Conflicts of Interest

Author Hasan Ali Eri¸s was employed by the company KoçSistem Information and Communication Services Inc., Istanbul, Turkey. The remaining declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Salama, W.M.; Aly, M.H. Deep learning in mammography images segmentation and classification: Automated CNN approach. Alex. Eng. J. 2021, 60, 4701–4709. [Google Scholar] [CrossRef]
Pesapane, F.; Trentin, C.; Ferrari, F.; Signorelli, G.; Tantrige, P.; Montesano, M.; Cicala, C.; Virgoli, R.; D’Acquisto, S.; Nicosia, L.; et al. Deep learning performance for detection and classification of Microcalcifications on Mammography. Eur. Radiol. Exp. 2023, 7, 69. [Google Scholar] [CrossRef]
Rigaud, B.; Weaver, O.O.; Dennison, J.B.; Awais, M.; Anderson, B.M.; Chiang, T.Y.D.; Yang, W.T.; Leung, J.W.; Hanash, S.M.; Brock, K.K. Deep learning models for automated assessment of breast density using multiple mammographic image types. Cancers 2022, 14, 5003. [Google Scholar] [CrossRef]
Khvostikov, A.; Aderghal, K.; Krylov, A.; Catheline, G.; Benois-Pineau, J. 3D inception-based CNN with SMRI and MD-DTI data fusion for Alzaymır’s disease diagnostics. arXiv 2018, arXiv:1809.03972. [Google Scholar]
Zhang, Y.; Teng, Q.; He, X.; Niu, T.; Zhang, L.; Liu, Y.; Ren, C. Attention-based 3D CNN with multi-layer features for Alzaymır’s disease diagnosis using brain images. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; IEEE: Sydney, Australia, 2023; pp. 1–4. [Google Scholar] [CrossRef]
Zhang, J.; Chen, W.; Joshi, T.; Zhang, X.; Loh, P.-L.; Jog, V.; Bruce, R.J.; Garrett, J.W.; McMillan, A.B. BAE ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation. Tomography 2024, 10, 2058–2072. [Google Scholar] [CrossRef]
Akan, T.; Bhuiyan, M.S.; Disbrow, E.A. Leveraging Video Vision Transformer for Alzaymır’s Disease Diagnosis from 3D Brain MRI. arXiv 2025, arXiv:2501.15733. [Google Scholar]
Loizidou, K.; Skouroumouni, G.; Nikolaou, C.; Pitris, C. A Review of Computer-Aided Breast Cancer Diagnosis Using Sequential Mammograms. Tomography 2022, 8, 2874–2892. [Google Scholar] [CrossRef] [PubMed]
Loizidou, K.; Skouroumouni, G.; Pitris, C.; Nikolaou, C. Digital subtraction of temporally sequential mammograms for improved detection and classification of microcalcifications. Comput. Biol. Med. 2021, 139, 104982. [Google Scholar] [CrossRef] [PubMed]
Ravi, D.; Alexander, D.C.; Oxtoby, N.P. Degenerative Adversarial NeuroImage Nets: Generating Images that Mimic Disease Progression. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2019; Shen, D., Liu, T., Peters, T., Staib, L., Essert, C., Zhou, S., Yap, P.-T., Khan, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11765, pp. 737–745. [Google Scholar] [CrossRef]
Farki, A.; Moradi, E.; Koundal, D.; Tohka, J. Forecasting future anatomies: Longitudinal brain MRI-to-MRI prediction. arXiv 2025, arXiv:2511.02558. [Google Scholar] [CrossRef]
Kim, S.T.; Kucukaslan, U.; Navab, N. Longitudinal Brain MR Image Modeling Using Personalized Memory for Alzheimer’s Disease. IEEE Access 2021, 9, 143212–143221. [Google Scholar] [CrossRef]
Zhang, L.; Lu, L.; Wang, X.; Zhu, R.M.; Bagheri, M.; Summers, R.M.; Yao, J. Spatio-temporal convolutional LSTMs for tumor growth prediction by learning 4D longitudinal patient data. IEEE Trans. Med. Imaging 2020, 39, 1114–1126. [Google Scholar] [CrossRef] [PubMed]
Dai, L.; Sheng, B.; Chen, T.; Wu, Q.; Liu, R.; Cai, C.; Wu, L.; Yang, D.; Hamzah, H.; Liu, Y.; et al. A deep learning system for predicting time to progression of diabetic retinopathy. Nat. Med. 2024, 30, 584–594. [Google Scholar] [CrossRef]
Yang, H.; Wang, J.; Wang, W.; Shi, S.; Liu, L.; Yao, Y.; Tian, G. MMSurv: A multimodal multi-instance multi-cancer survival model. Briefings Bioinform. 2025, 26, bbaf209. [Google Scholar] [CrossRef]
Al-Iedani, O.; Lea, S.; Alshehri, A.; Maltby, V.E.; Saugbjerg, B.; Ramadan, S.; Lea, R.; Lechner-Scott, J. Multi-modal neuroimaging signatures predict cognitive decline in multiple sclerosis: A 5-year longitudinal study. Mult. Scler. Relat. Disord. 2024, 81, 105379. [Google Scholar] [CrossRef]
Poulain, R.; Beheshti, R. Graph Transformers on EHRs: Better Representation Improves Downstream Performance. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=pe0Vdv7rsL (accessed on 22 November 2025).
Bertolini, D.; Loukianov, A.D.; Smith, A.M.; Li-Bl, D.; Pouliot, Y.; Walsh, J.R.; Fisher, C.K. Modeling disease progression in mild cognitive impairment and Alzheimer’s disease with digital twins. arXiv 2020, arXiv:2012.13455. [Google Scholar] [CrossRef]
Zhang, L.; Lu, L.; Summers, R.M.; Kebebew, E.; Yao, J. Tumor growth prediction using convolutional networks. In Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics: Advances in Computer Vision and Pattern Recognition; Lu, L., Wang, X., Carneiro, G., Yang, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 239–260. [Google Scholar] [CrossRef]
Shi, X.; Yu, Z.; Chen, X.; Li, H.; Yeung, D.-Y. Convolutional LSTM Network: A Machine Learning Approach for recipitation Nowcasting. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2015. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? arXiv 2021, arXiv:2102.05095. [Google Scholar] [CrossRef]
MathWorks. (n.d.) Medical Image Registration. MathWorks. Available online: https://ch.mathworks.com/help/images/image-registration.html?s_tid=CRUX_topnav (accessed on 22 June 2025).
Image Registration (No Date) 3D and Quantitative Imaging Laboratory. Available online: https://3dqlab.stanford.edu/image-registration/ (accessed on 5 August 2025).
MathWorks. (n.d.) Medical Image Registration. MathWorks. Available online: https://ch.mathworks.com/help/medical-imaging/ug/medical-image-registration.html (accessed on 22 June 2025).
Thirion, J.P. Image matching as a diffusion process: An analogy with Maxwell’s demons. Med. Image Anal. 1998, 2, 243–260. [Google Scholar] [CrossRef] [PubMed]
Cahill, N.D.; Noble, J.A.; Hawkes, D.J. A Demons algorithm for image registration with locally adaptive regularization. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2009; Lecture Notes in Computer Science; Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5761, pp. 574–581. [Google Scholar] [CrossRef]
Rueckert, D.; Sonoda, L.I.; Hayes, C.; Hill, D.L.G.; Leach, M.O.; Hawkes, D.J. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Trans. Med. Imaging 1999, 18, 712–721. [Google Scholar] [CrossRef] [PubMed]
Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.V.; Dalca, A.V. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans. Med. Imaging 2019, 38, 1788–1800. [Google Scholar] [CrossRef]
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Patil, P.W.; Gupta, S.; Rana, S.; Venkatesh, S. Dual-frame spatio-temporal feature modulation for video enhancement. Pattern Recognit. 2022, 130, 108822. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE GAN: Fine Grained Image Generation through Asymmetric Training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 2745–2754. [Google Scholar] [CrossRef]
Julou, T.; Zweifel, L.; Blank, D.; Fiori, A.; van Nimwegen, E. Subpopulations of Sensorless Bacteria Drive Fitness in Fluctuating Environments (IDR Study idr0091) [Dataset]; Image Data Resource, University of Dundee: Dundee, UK, 2020; Available online: https://idr.openmicroscopy.org/study/idr0091 (accessed on 24 August 2025).
LaMontagne, P.J.; Benzinger, T.L.; Morris, J.C.; Keefe, S.; Hornbeck, R.; Xiong, C.; Grant, E.; Hassenstab, J.; Moulder, K.; Vlassenko, A.G.; et al. OASIS-3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzaymır disease. medRxiv 2019. [Google Scholar] [CrossRef]
Jack, C.R., Jr.; Bernstein, M.A.; Fox, N.C.; Thompson, P.; Alexander, G.; Harvey, D.; Borowski, B.; Britson, P.J.; L. Whitwell, J.; Ward, C.; et al. The Alzaymır’s Disease Neuroimaging Initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 2008, 27, 685–691. [Google Scholar] [CrossRef] [PubMed]
National Alzaymır’s Coordinating Center. (n.d.). NACC Uniform Data Set (UDS). Available online: https://naccdata.org (accessed on 22 June 2025).
Loizidou, K.; Skouroumouni, G.; Nikolaou, C.; Pitris, C. Breast Micro-Calcifications Dataset with Precisely Annotated Sequential Mammograms [Data Set]; KIOS Research and Innovation Center of Excellence, University of Cyprus: Nicosia, Cyprus, 2021. [Google Scholar] [CrossRef]

Figure 1. Modeling disease progression with patient-specific DTs using sequential medical image and EHR data.

Figure 2. ConvLSTM pipeline for medical image generation.

Figure 3. ViViT pipeline for medical image generation.

Figure 4. CT and MR imaging data (left) are unaligned before registration. Aligned data (right) show the correct anatomical positioning after registration [23].

Figure 5. Two mammogram of the same patient. On the left side, the old mammogram, and on the right side, the new mammogram.

Figure 6. After global geometric transformation algorithms, deformable (non-rigid) registration methods are applied. On the left side, an overlapped mammogram image is displayed after the demons algorithm is applied, and on the right side, an overlapped mammogram image is displayed after the B-spline FFD algorithm is applied. In both overlapped mammograms, the green color represents the old mammogram, and the other colors (blue, red) represent new mammograms. (a) Overlapped mammogram after demons algorithm applied; (b) overlapped mammogram after B-spline algorithm applied.

Figure 7. Structure of cVAE+GAN model. x and x’ are the input image and the generated image, respectively. E, G, C, and D are the encoder, generator, classifier, and discriminator networks, respectively. z is the latent vector. y is a binary output that indicates whether the image is real or generated. c represents conditions such as an attribute or class label.

Figure 8. (a) is the ground truth phase-constrast image for

T_{n + 3}

. (b) is the Model 1 prediction for

T_{n + 3}

, which is our long-term ConvLSTM model. (c) is the Model 2 prediction for

T_{n + 3}

, which only uses

T_{n + 2}

images and condition value (nutrition) as input (which is why it is a short-term model). (d) is the Model 3 prediction for

T_{n + 3}

, which only use

T_{n + 2}

images as input (making it another short-term model).

Figure 8. (a) is the ground truth phase-constrast image for

T_{n + 3}

. (b) is the Model 1 prediction for

T_{n + 3}

, which is our long-term ConvLSTM model. (c) is the Model 2 prediction for

T_{n + 3}

, which only uses

T_{n + 2}

images and condition value (nutrition) as input (which is why it is a short-term model). (d) is the Model 3 prediction for

T_{n + 3}

, which only use

T_{n + 2}

images as input (making it another short-term model).

Table 1. Comparison of the disease progression prediction models in the literature and our proposed models.

Model in Paper	EHR Input	Image Input	History Uses History	Image Output (Future)	Non-Image Output (Class./Regr.)
[14,15,16]	✓	✓	–	–	✓
[10]	✓	✓	–	✓	–
[11]	–	✓	–	✓	–
[17,18]	✓	–	✓	–	✓
[19] *	✓	✓	✓	–	✓
[12]	–	✓	✓	✓	–
[13]	–	✓	✓	✓	✓
Our offered models	✓	✓	✓	✓	✓

* voxel-level tumor prediction on the last image (image labeling).

Table 2. Input configurations and prediction targets for three temporal image–condition models. PHC: phase contrast; GFP: green fluorescent protein; Glc/Lac: glucose or lactose medium.

Model	Inputs for $T_{n}$		Inputs for $T_{n + 1}$		Inputs for $T_{n + 2}$		Prediction for $T_{n + 3}$
Model	Image	Cond.	Image	Cond.	Image	Cond.	Image
1	PHC, GFP	Nutrition (Glc/Lac)	PHC, GFP	Nutrition (Glc/Lac)	PHC, GFP	Nutrition (Glc/Lac)	PHC, GFP
2					PHC, GFP	Nutrition (Glc/Lac)	PHC, GFP
3					PHC, GFP		PHC, GFP

Table 3. Test metrics for the three temporal prediction models. MSE: mean squared error; SSIM: structural similarity index.

Model	Test MSE	Test SSIM	Test Loss
Model 1	0.003501	0.796742	0.083446
Model 2	0.004026	0.787003	0.087654
Model 3	0.004101	0.785771	0.088189

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Eriş, H.A.; Aydın, M.A.; Erturk, M.A. Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data. Appl. Sci. 2026, 16, 2104. https://doi.org/10.3390/app16042104

AMA Style

Eriş HA, Aydın MA, Erturk MA. Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data. Applied Sciences. 2026; 16(4):2104. https://doi.org/10.3390/app16042104

Chicago/Turabian Style

Eriş, Hasan Ali, Muhammed Ali Aydın, and Mehmet Ali Erturk. 2026. "Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data" Applied Sciences 16, no. 4: 2104. https://doi.org/10.3390/app16042104

APA Style

Eriş, H. A., Aydın, M. A., & Erturk, M. A. (2026). Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data. Applied Sciences, 16(4), 2104. https://doi.org/10.3390/app16042104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Patient-Specific Digital Twin Models of Disease Progression Using Sequential Medical Imaging and EHR Data

Abstract

1. Introduction

2. Related Work

3. Proposed Architecture

3.1. Image Registration

3.2. EHR Embedding

3.3. Spatio-Temporal Modeling

3.4. Image Generation and EHR Prediction

4. Datasets and Implementation

4.1. Implementing the ConvLSTM Model on Single-Cell Time-Lapse Microscopy Study

4.2. Patient Medical Image and EHR Data Research

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI