1. Introduction
The digital twin concept, initially developed for virtual representation of industrial systems to perform analyses that cannot be done on physical systems, has been adapted to healthcare to model organs, diseases, and medicinal responses. In healthcare, DTs have mostly been used to optimize hospital workflows, simulate drug– cell interactions, and create virtual physiological models. However, the application of digital twins to generate personalized future medical images for each patient has not been fully explored yet.
With advances in next-frame prediction models within artificial neural networks, it is now possible to generate video sequences conditioned on specific inputs. We foresee that these techniques could be adapted to the medical images used to track disease progression by fusing EHR (structured clinical parameters such as laboratory test results, demographic characteristics, and medication information) conditions with the images.
In our research, we examined existing computer-aided diagnosis (CAD) systems in the healthcare domain. We observed that most CAD systems lack the ability to temporally interpret sequences of medical images. These systems are typically designed to analyze individual medical scans—for instance, by detecting a lesion—without predicting future outcomes. Furthermore, only a few existing CAD systems attempt to integrate both medical images and electronic health records (EHRs), often leaving such multimodal interpretation to the expertise of clinicians. It is notable that, in the literature, there is a gap in generating future medical images based on a patient’s historical image data and EHRs with generative models, which we want to address.
These observations motivated us to develop a novel framework, which represents a new algorithm pipeline that exceeds the limitations of traditional CAD systems. This novel framework is designed to achieve the following key objectives:
Multimodal Fusion: Consolidate medical image and EHR data with a unified model.
Temporal Analysis: Modeling the disease progress over time by capturing temporal dependencies in sequential medical images and EHRs.
Future Prediction: Develop methods to generate future medical images and forecast future EHR metrics, enabling predictive insight into the evolution of disease.
Our contribution is the development of a new model that predicts the disease-progression trajectory of individual patients by jointly using medical images and EHR data. We define this approach as a form of digital twin for the disease, and the rationale for this terminology comes from the two core principles underlying digital twin systems. First, a digital twin must be able to perform what-if analysis, generating individualized predictions for a specific person or object. Second, it must rely on inputs that are uniquely associated with that same person or object. Although many digital twin applications operate with real-time or continuously updated data, this is not a strict requirement. As long as a model satisfies these two fundamental conditions, it can be considered a digital twin.
In our case, the proposed model produces patient-specific forecasts for the future course of a particular disease. Moreover, the inputs used by the model—longitudinal EHR information and sequential medical images—encode the patient’s own long-term clinical history. Because the model learns from patient-specific trajectories, it can also reflect how changes in key clinical variables might influence the predicted future images. For example, if a medication dosage recorded in the EHR were adjusted, if periodically administered cognitive test scores changed, or if laboratory parameters such as blood glucose shifted, the model would be capable of projecting how these altered conditions might affect the resulting future medical images. In addition, incorporating long-term medical imaging history as an input enables the model to determine both the magnitude and the speed with which EHR-driven changes influence the predicted future images; in this sense, the patient’s longitudinal imaging record also functions as a personalized conditioning factor, similar to EHR variables. In this sense, the system not only predicts the natural progression of the disease but also allows scenario-based exploration of potential outcomes under varying clinical conditions, further reinforcing its characterization as a digital twin.
The remainder of this paper is organized as follows:
Section 2 discusses the current state-of-the-art studies from the literature.
Section 3 introduces the proposed framework for image processing with generative models. The patient datasets and the test implementation using the single-cell time-lapse microscopy study are described in
Section 4. This dataset was chosen because its temporal structure resembles patient data: the sequential microfluidic channel images parallel the temporal progression of medical imaging, while the alternating glucose–lactose environment reflects the time-varying nature of EHR parameters. Thus, it provides a suitable preliminary testbed before applying the model to real patient datasets.
Section 5 discusses the experimental findings, the expected role of image registration in real medical datasets, and the implications of modality-specific alignment quality for longitudinal prediction. Finally,
Section 6 concludes the paper by summarizing the framework’s contribution to patient-specific digital twin modeling and outlining future validation efforts on longitudinal real-world clinical datasets.
2. Related Work
Disease classification/detection using deep learning models for static medical images is the most used method in computer-aided diagnostics (CAD). In mammography, convolutional neural networks (CNNs) have been used for lesion segmentation [
1], microcalcification detection [
2], and breast density classification [
3]. In neuroimaging, structural MRI and PET images are commonly used in Alzheimer’s disease (AD) diagnosis using 3D CNNs and attention models [
4,
5]. While effective, these static-image-based approaches fail to account for the disease’s temporal progression, and they only offers classifications of disease severity, not future predictions about it.
In a study where patient bone age was estimated using patient X-ray images and gender information [
6], a vision transformers (ViT) model was used to evaluate medical images and gender information together. In this model, gender information was added to the ViT model as a global token. This model integrates gender information, serving as an example of an EHR that incorporates medical images. However, it lacks temporal interpretation and future prediction.
In another study [
7] that interprets 3D brain MRI scans spatio-temporally using ViViT (video vision transformer), an attempt was made to detect Alzheimer’s disease. In this study, the central 32 MRI slices (2D slices) are provided to the model as input. While spatial attention is applied to each MRI slice, the sequential slices are treated as if they represent consecutive time steps, and temporal attention is applied across them. There are usages of ViViT in medical image analysis. However, as seen in this study, there is no true temporal interpretation across images from different actual time points, no generation of future medical images, and no integration of EHRs (electronic health records) to help the model to detect the disease or to generate future images.
Temporal modeling strategies attempt to overcome static limitations. Temporal subtraction, which highlights changes between prior and current images, has been applied in mammography for enhanced microcalcification detection [
8]. Loizidou et al. [
9] showed performance gains using subtraction imaging in longitudinal studies. However, these models only predict the current state of the disease.
We also reviewed models that predict disease progression using regression, classification, or image-generation approaches. As examples of models that generate future medical images, we can refer to the following studies. In brain imaging, DaniNet [
10] introduced a generative adversarial network (GAN) conditioned on patient age and diagnosis (normal, MCI, AD) to simulate future brain MRI slices in AD patients. This model makes future predictions according to the current EHR (age, diagnosis) and medical image scan (MRI). However, it does not take into account prior image scans and EHRs. In a longitudinal MRI-to-MRI prediction study [
11], multiple deep architectures—UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet—were evaluated to generate a subject’s future brain MRI from baseline imaging. However, the approach is imaging-only and does not incorporate longitudinal EHR signals; consequently, it cannot model how changes in medications, laboratory results, or other EHR variables would alter the predicted future scan. In a longitudinal MRI modeling study [
12], a generative autoencoder-based framework with disentangled structural and longitudinal state representations—supported by a personalized memory module—was used to generate missing and future MR images through latent interpolation and extrapolation. However, the method is imaging-only and does not incorporate longitudinal EHR histories, which means it cannot simulate how changes in clinical measurements or other EHR variables would influence the predicted future scan. In a spatio-temporal tumor growth prediction study [
13], a 4D deep learning framework based on spatio-temporal ConvLSTM (ST-ConvLSTM) was employed to generate future tumor volumes, shapes, and intensity patterns from sequential volumetric imaging. Although the study mentions that clinical information can be incorporated, the implemented model relies almost exclusively on imaging data and does not integrate rich longitudinal EHR variables; therefore, it cannot perform patient-specific what-if analysis driven by changes in clinical parameters or laboratory values. As we did for these studies, for models that predict the future of the disease, we conducted a literature review according to the following criteria. These studies are listed in
Table 1.
The criteria were as follows:
Does the model use image data as input?
Does the model use EHR data as input?
Does the model use history data as input?
Does the model output a predicted image for the future of the disease?
Does the model output predicted EHR (classification/regression) for the future of the disease?
According to
Table 1, these temporal models also have limitations. Some incorporate a patient’s previous medical images, but not their electronic health records (EHRs). Some models includes both EHRs and medical images as input but fail to consider prior medical images and historical EHR data, which are essential for achieving a more patient-specific understanding. Moreover, even image-only temporal models do not always generate future scans; approximately half of the studies instead formulate future prediction as a classification or regression task.
In the proposed framework, both longitudinal medical images and corresponding EHR records are jointly used as model inputs. By learning from historical digital records across patients, the model captures shared disease dynamics while preserving the ability to condition predictions on individual patient trajectories. Patients with similar imaging and clinical histories contribute to the learning of comparable progression patterns, allowing the model to transition from a generalized representation of disease evolution toward patient-specific forecasting. Under this formulation, the trained model can be instantiated as a disease-specific digital twin for an individual patient when applied to that patient’s historical data.
3. Proposed Architecture
An overview of the proposed architecture is given in
Figure 1. This architecture is a conceptual representation of how the digital data of patients are collected and presented to DT models to predict future outcomes. The diagram illustrates a disease-specific, patient-centered digital twin pipeline that jointly leverages longitudinal medical images and time-aligned EHR data. First, the inputs are prepared through modality-aware preprocessing, including medical image preprocessing, image registration to align sequential scans, and EHR preprocessing to standardize clinical variables. The processed data are then routed into separate digital twins for each disease a patient may have (e.g., Alzheimer’s, breast cancer), emphasizing that a single patient can be associated with multiple disease-specific DT instances. Within each DT, generative modeling components forecast future medical images, while regression-based components predict future EHR values. Together, these outputs support patient-specific prognosis and enable scenario-based what-if analysis by simulating how changes in clinical conditions may influence future imaging and EHR trajectories. The main components of the proposed architecture are discussed in the remainder of this section.
Sequential image processing requires preprocessing where images in order have the same structure. In the proposed architecture, we employ an image registration technique on sequential images before feeding the neural network model. The first step is to apply image registration to all medical images to align each frame to have the same angle, size, and deformation. For sequential medical images, in general, there is a need to register. However, as noted above, this framework is not limited to patient medical image datasets and can be applied to other sequential imaging scenarios as well. If the sequential images exhibit minimal changes in orientation, viewpoint, or local tissue/texture deformation—as in some single-cell time-lapse microscopy settings—this registration preprocessing step can be omitted.
After that, we can use ConvLSTM [
20] to interpret sequential images, because ConvLSTM is more successful than classic ’CNN + LSTM’ models for image generation tasks. In
Figure 2, the ConvLSTM pipeline is displayed. First, an embedding layer is used for EHRs. After that, you have two options to fuse the EHR embedding output with image channels. One option is adding EHR embedding outputs as channels to medical images, and the second option is generating new image channels that hold both the EHR and image information by using the spatio-temporal FiLM (ST-FiLM) algorithm. In this way, EHRs will play a role in generating new images. Then, these new inputs are processed in the ConvLSTM layer, and from the output of the ConvLSTM, a medical image can be generated directly or a cVAE-GAN structure can be used for interpreting this output. This increases the generated image output quality. At the same time, the output of the ConvLSTM allows for the prediction of EHR output, making the model suitable for joint image and EHR prediction.
However, instead of ConvLSTM, a ViViT (Video ViT) [
21] pipeline can be used for spatio-temporal interpretation of sequential images as shown in
Figure 3. ViViT is an extended version of the vision transformers (ViTs). While ViT uses an attention mechanism to interpret spatial information in one image, ViViT can perform spatio-temporal interpretation for sequential images. In ViT / ViViT models, conditions can be added to the model as global tokens. At this point, EHRs can be added to the ViViT model as ’global tokens’ per frame at each time step. In this way, they will be evaluated as conditions. The output of the ViViT will be a latent representation (latent space). To generate a medical image from this latent space, a cGAN, a cVAE, an upsampling model, or a diffusion model can be used.
We will discuss the steps suggested in the ConvLSTM/ViViT pipelines in more depth in the following subsections.
3.1. Image Registration
Image registration is the process of aligning two or more images of the same scene, object, or body parts acquired at different times, from different viewpoints, or by different imaging devices. The goal is to bring the images into a common coordinate system so that corresponding features (e.g., anatomical structures, objects) overlap accurately. This process involves designating one image as the reference image, also called the fixed image, and applying global geometric transformations or local displacements to the other image so that it aligns with the reference image [
22].
Global geometric transformations include shifting, rotation, scaling, or shearing. An example of global shifting and rotation is displayed in
Figure 4 on CT and MR images.
Medical image registration is the process of aligning multiple medical images, volumes, and surfaces to a common coordinate system. In medical imaging, it is often necessary to compare multiple scans of the same patient acquired during different sessions and under varying conditions. At this point, medical image registration is typically used as a preprocessing step to align the medical images to a common coordinate system before analysis [
24].
The brain MRI example above demonstrates a typical case of global geometric misalignment that can be corrected using standard rigid or affine transformations. However, medical imaging encompasses a wide range of modalities, many of which involve acquisition-specific distortions that cannot be resolved by global registration alone. To illustrate this broader challenge, we next present a mammography example, where compression-based shape deformation leads to local pixel displacements, making deformable (non-rigid) registration essential.
Specifically, in mammographic imaging, the breast is compressed or positioned at specific angles during acquisition, which introduces characteristic shape changes and local texture distortions (pixel displacements). These modality-specific deformations cannot be corrected using only global transformations.
To address this issue, deformable or non-rigid registration algorithms are applied after global registration. At this stage, non-rigid registration becomes essential. In
Figure 5, old and new MLO mammograms of the same patient are displayed.
With the MATLAB R2025a image registration tool, we apply image registration algorithms to these mammograms. We select an old mammogram as a reference image and make a new mammogram that overlaps/registers with the old one.
As seen in the
Figure 5, the breast tissue of the new mammogram is smaller than that of the old one. Global geometric transformations can handle this, but breast tissue boundaries are not exactly the same. This means that during mammogram acquisition, the breast is compressed, and positioned angles are different between the two mammograms, and this causes shape distortions. Then, deformable (non-rigid) registration methods must be applied. As seen in
Figure 6, after global geometric transformations, to overlap the new mammogram over the old mammogram, the demons [
25,
26] and B-spline FFD [
27] algorithms are applied.
Both of these algorithms are based on a displacement field that holds information about pixels and their displacement direction and value. These algorithms are static and learn the displacement field based on iterations defined by an expert. They are not learning-based artificial neural network models.
There are other static algorithms too, but development in the area of artificial neural network models has revealed neural network registration models like VoxelMorph [
28]. This model is fast at registering images according to static models, but it needs training before usage. That is why this model is suitable when you need to register large datasets.
3.2. EHR Embedding
Electronic health records (EHRs), which belong to the patient, can be static, such as gender, or can vary for each measurement, including blood pressure, lab tests, blood metrics, medications, and age.
Both of them can affect the generation of the predicted medical image. In our model, we can use both the static and varying/dynamic EHRs. However, because the varying EHRs are temporally linked to sequential medical images, we need to embed them in the model for each timely sequential frame separately.
In the ConvLSTM pipeline, we can opt to embed all EHRs with an embedding neural network layer. After that, the output of the embedding layer can be added to frames as channels.
Alternately, in the ConvLSTM pipeline, instead of adding the embedding output to frames as channels, we can use a new popular method named FiLM (feature-wise linear modulation) [
29], which is used for spatial condition embedding in image generation models. The FiLM algorithm feature map is formulated as in Equation (
1) below.
where
denotes the activation of channel
c at spatial position
, and
z represents an external condition such as a language or metadata embedding. The network learns channel-wise affine modulation parameters
and
from a conditioning vector
z and applies them uniformly across all spatial locations of each feature map.
We can extend this model to make it spatio-temporally extended by adding a time (t) dimension. After that, the spatio-temporal conditioning via a feature-wise linear modulation (ST-FiLM) future map can be formulated as in Equation (
2) below.
where
denotes the activation of channel
c at spatial position
and time
t. Here,
and
are spatially and temporally varying modulation maps predicted by the FiLM generator from the per-time condition embedding
. To put it more clearly: ST-FiLM does not create additional channels; instead, it modulates the existing feature maps (image channels) using condition-dependent
and
parameters. After this modulation, each image channel becomes condition-aware, blending both the image features and the spatio-temporal condition information. A representative application of ST-FiLM can be seen in this video development article [
30].
Additionally, in the ViViT pipeline, we can add EHRs to the model as ’global tokens’ per frame for each time step.
In standard ViT, there is a [CLS] token added to the image patch sequence. It gathers information via attention and is used for final classification. The CLS token is a trainable vector for learning the image context. Similarly, in ViViT, global tokens are added to the spatio-temporal image patch sequences to perform a similar summarization role but extended over time as well as space.
3.3. Spatio-Temporal Modeling
Both ConvLSTM and vision transformer (ViViT-style) encoders can be used to learn spatio-temporal features between sequential frames. These allow modeling anatomical trends and disease dynamics.
Standard vision transformers [
31] are used only to interpret spatial information in one image with an attention mechanism. The attention mechanism divides the images into small pieces (small areas) and converts them into patches, which are vectors. After that, relations between patches are calculated. This increases the spatial interpretation in the image. However, video ViTs (ViViT) [
21] are used to interpret spatio-temporal information on sequential frames. In some of the ViViT versions, spatial and temporal attention are calculated separately. First, spatial attention is calculated internally for each frame; then, temporal attention is calculated for the frames in sequence. However, in other versions, spatial and temporal attention are calculated together. In that version, spatial attention is calculated across all temporal sequential frames as if they are one frame. This version generally achieves better performance but needs more training.
We also want to mention that if the time between sequential images is not equal, both ConvLSTM and ViViT can be configured to accommodate varying time-lapse situations.
3.4. Image Generation and EHR Prediction
According to the modeling of the ConvLSTM pipeline, the ConvLSTM output can be directly used to generate a predicted medical image, or, to increase the quality of the output, we can use this output as latent space and feed this output into a cVAE+GAN [
32] model as input. As a result, the output of the cVAE+GAN layer becomes high quality. In
Figure 7, the structure of the cVAE+GAN model is explained.
cVAE+GAN resolves the weaknesses of cVAE and cGAN models. We can briefly summarize these situations:
Compared with cVAE, the sharpness of the output improves, and the texture similarity between the original image and the generated image increases.
Compared with cGAN, it can mitigate common training instabilities, such as mode collapse and gradient vanishing.
It preserves output diversity similar to cVAE while improving conditional controllability.
To predict EHRs from the output of the ConvLSTM layer, the output can be flattened and used as a latent space; then, with a fully connected layer or an MLP layer, it can be used to predict EHRs correlated with the generated medical image.
In the ViViT pipeline, the output of the ViViT encoder is a latent representation. This can be used for both medical image generation and EHR prediction, which are correlated with the generated image. By using this latent representation as input for cGAN, cVAE, upsampling, or diffusion models, a medical image can be generated. As in the ConvLSTM pipeline, interpreting the latent representation with a fully connected layer or an MLP layer, the model can predict EHRs too.
4. Datasets and Implementation
Accurate future image prediction in a patient-specific digital twin requires sequential medical images that are temporally aligned with longitudinal EHR data. Since real patient datasets with long-term paired imaging–EHR sequences are restricted, we first performed a preliminary evaluation on a publicly available single-cell time-lapse microscopy dataset [
33]. Although this dataset is not medical, it exhibits the same type of temporal processes needed for digital twin modeling: periodically acquired images that evolve over time and external conditions (glucose–lactose) that change across time steps. These changing nutrient conditions serve as temporal conditioning variables analogous to how a patient’s clinical measurements (e.g., medications, lab tests, cognitive scores) vary across clinical visits. Modifying the nutrient sequence also enables what-if exploration, similar to scenario simulation in patient-specific digital twins. Thus, this single-cell dataset functions as a controlled testbed validating the core modeling principles—learning long-term trajectories, integrating time-varying conditions, and forecasting future states—before applying our model to real patient medical images and EHR data. The medical datasets we aim to use are outlined in the section “Patient Medical Image and EHR Data Research” and will be the focus of our future work.
4.1. Implementing the ConvLSTM Model on Single-Cell Time-Lapse Microscopy Study
In this dataset, images were acquired using phase-contrast and wide-field epifluorescence microscopy on a microfluidic platform designed for long-term single-cell observation. Each microfluidic channel containing E. coli cells was imaged periodically at fixed time intervals, capturing both morphological (phase-contrast) and gene expression (fluorescence) information.
All images were stored as multi-frame TIFF stacks. We processed 156 TIFF stacks, exporting phase-contrast and GFP frames as PNG images at 30 min intervals. In parallel, we recorded the nutrient condition (glucose vs. lactose) between consecutive frames as a time-aligned conditioning variable. These condition labels were used to model how environmental changes influence future frames.
We used the ConvLSTM model explained in
Figure 2. We have used the ST-FiLM algorithm for spatio-temporal condition embedding instead of adding conditions as channels to images.
We use phase-contrast and GFP images together with the correlated nutritional environment at time steps Tn,
, and
as model inputs, and aim to predict the phase-contrast (PHC) and GFP images at time step
. In this setting, the nutritional environment acts as a conditioning variable. The proposed model operates as a long-term prediction model. For comparison, we also implemented two short-term models. Model 1 corresponds to the proposed long-term ConvLSTM model, while Models 2 and 3 represent short-term prediction variants. The input configurations and prediction targets of all three models are summarized in
Table 2.
Nutrition conditions in the models must be interpreted like this; if input for is glucose, it means that from starting time until the time , cells are in glucose medium.
By using time-related images and the nutrition conditions, we generated training, validation, and test datasets. The sizes of the datasets are, respectively, 64%, 16%, and 20%.
We used the same test dataset for all models. The success of the models is measured by MSE (mean squared error), SSIM (structural similarity index), and the test loss values, which are calculated by using the MSE and SSIM values. According to these metrics, the success of each models is displayed in
Table 3 below.
When we look at the model metrics, the minimum test loss value and maximum SSIM values belong to Model 1. This proves that our long-term model is the most successful one.
Model 1’s test loss function value is 0.5% better than Model 3’s loss function value, while Model 1’s SSIM value is 1.1% better than Model 3’s SSIM value. These values may seem small, but in these datasets, most of the image pixel values are zero, which means a black color. This makes differences in metrics small. This is a database-specific issue. However, when we look at predicted images, we can see the real difference. In
Figure 8, displayed below, we can see phase-contrast ground truth for
and predictions of all models for
.
Figure 8 illustrates the prediction performance comparison among the three models. The ground truth phase-contrast image for
is shown in (a). The prediction of the long-term ConvLSTM model (b) most closely resembles the ground truth, particularly in reproducing the bright spot-like regions along the cell axis that correspond to optically dense structures observed in the phase-contrast image. In contrast, the short-term models (c) and (d) yield smoother predictions with reduced local intensity variation, indicating a loss of fine structural detail over time. These visual differences demonstrate that the long-term ConvLSTM model achieves more accurate spatio-temporal prediction of cellular morphology.
4.2. Patient Medical Image and EHR Data Research
For future implementation of this model on medical datasets, we identified several publicly available datasets suitable for our objectives. These include OASIS-3 [
34], ADNI [
35], and NACC [
36] for Alzheimer’s disease, and KIOS [
37] for mammography. The ADNI, NACC, and OASIS-3 datasets provide longitudinal MRI scans with matched EHR records, making them ideal for temporal modeling. In contrast, the KIOS mammogram dataset lacks EHR data and can therefore be utilized only for preliminary image registration experiments due to the simplicity of its 2D structure.
We are also engaged in ongoing collaborations with Turkish hospitals to gain access to real-world sequential mammogram data paired with EHR metadata. Initially, we will focus on fusing 2D mammogram data with corresponding EHRs. Once the model achieves the expected accuracy, we aim to extend it to 3D MRI-based data from Alzheimer’s datasets. The computational cost of 3D data can be addressed by utilizing reduced regions of interest (ROIs) with downsampled slices of 3D MRI data.
5. Discussion
In this study, the proposed framework was preliminarily evaluated on a bacterial time-lapse dataset whose temporal structure resembles longitudinal patient data. Because this dataset does not require spatial alignment across acquisitions, image registration was not included in the current implementation. Nevertheless, when the framework is applied to real medical datasets such as MRI or mammography, image registration is expected to become a critical preprocessing step to ensure that sequential images are mapped to a common coordinate system and that subtle tissue-level changes can be tracked reliably over time.
We anticipate that registration quality will substantially influence downstream prediction performance. Mammogram images typically exhibit greater local deformation due to breast compression during acquisition, whereas MRI images are structurally more stable; therefore, we expect registration to be more reliable for Alzheimer’s MRI datasets than for mammograms. Based on our preliminary mammogram registration trials, SSIM values above 0.80 appeared visually acceptable, suggesting that SSIM—together with mutual information (MI)—can serve as practical metrics for assessing alignment quality. Importantly, because each imaging modality has distinct deformation characteristics, modality-specific SSIM and MI threshold values (e.g., separate thresholds for MRI, mammography, CT, and ultrasound) should be established. Defining such modality-dependent thresholds is expected to be a fundamental requirement for achieving robust performance when applying the framework to real patient imaging and longitudinal EHR data.
6. Conclusions
This work presents a patient-specific digital twin framework that jointly models temporal medical imaging and longitudinal EHR data to forecast disease evolution and enable scenario-based what-if analysis. The approach is intended to support clinicians by providing forward-looking projections of future imaging outcomes and associated clinical trajectories, potentially facilitating earlier intervention and treatment planning.
Although the methodology is discussed primarily in the context of brain MRI and mammography for Alzheimer’s disease and breast cancer, it is not disease-specific and can be generalized to other conditions with sequential medical images paired with longitudinal EHR records. The imaging inputs do not necessarily need to cover an entire organ; region-of-interest images can also be used, provided that reliable image registration can be achieved.
Future work will focus on implementing and validating the proposed ConvLSTM- and ViViT-based pipelines on real patient datasets, particularly longitudinal Alzheimer’s MRI datasets (e.g., OASIS-3/ADNI/NACC) and sequential mammography datasets paired with EHR metadata. We plan to evaluate registration quality using SSIM and mutual information (MI), assess image generation performance using SSIM and PSNR, and measure EHR prediction accuracy using mean squared error (MSE).