MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision

Hao, Yongtao; Wang, Pandong; Chen, Yanming; Zhao, Haifeng

doi:10.3390/electronics15050997

Open AccessArticle

MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision

by

Yongtao Hao

^1,*,

Pandong Wang

¹

,

Yanming Chen

¹ and

Haifeng Zhao

^2,3,*

¹

School of Computer Science and Technology, Tongji University, Shanghai 201804, China

²

School of Economics and Management, Tongji University, Shanghai 200092, China

³

Service-Oriented Manufacturing Innovation and Research Center, Shanghai 200092, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(5), 997; https://doi.org/10.3390/electronics15050997

Submission received: 14 January 2026 / Revised: 6 February 2026 / Accepted: 25 February 2026 / Published: 27 February 2026

(This article belongs to the Special Issue AI-Driven Medical Image/Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Automated radiology report generation has become a prominent research topic in medical multimodal learning. However, most existing approaches primarily focus on single-image interpretation and rarely address the task of tracking disease progression across longitudinal chest X-rays. This task presents two major challenges: accurately localizing pathological changes between temporally paired images, and effectively translating visual difference representations into clinically meaningful textual descriptions. To address these challenges, we propose MRID (Modeling Radiological Image Differences for Disease Progression Reasoning), a multi-task self-supervised framework that follows a pretraining–finetuning paradigm. MRID leverages multiple complementary self-supervised objectives to jointly achieve (1) intra-modal spatial alignment of organs and pathological regions across image pairs, and (2) cross-modal semantic alignment between visual difference representations and radiology report embeddings. Furthermore, we introduce a simple yet effective data augmentation strategy to alleviate the imbalance of disease progression categories. Extensive experiments conducted on the Longitudinal-MIMIC and MS-CXR-T datasets demonstrate that MRID effectively captures fine-grained disease progression patterns. In addition, the proposed framework achieves competitive performance on single-image radiology report generation, further highlighting its strong capability in modeling chest X-ray semantics.

Keywords:

medical report generation; disease progression reasoning; visual–language pretraining; image difference modeling; self-supervised learning

1. Introduction

The automated interpretation of medical images has emerged as a key research topic in medical artificial intelligence [1]. Among these efforts, radiology report generation aims to translate visual findings in chest X-rays (CXRs) into structured and clinically meaningful textual descriptions, thereby improving diagnostic efficiency and ensuring reporting consistency [2]. However, most existing studies [3,4,5,6,7] primarily formulate this task as a single-image captioning problem, which limits their ability to model temporal disease progression—an essential factor for evaluating patient status and treatment response.

In real-world clinical practice, radiologists routinely compare longitudinal CXRs of the same patient acquired at different time points to assess subtle changes in lesions, anatomical structures, or opacity distributions [8,9]. Consequently, radiology reports often include expressions describing disease progression or stability, such as “pneumonia has improved” or “pleural effusion remains unchanged”. Despite its clinical importance, automating this process remains highly challenging due to the intrinsic complexity of temporal variations in CXRs and the limited availability of annotated longitudinal datasets for supervised learning [10].

To address these challenges, several prior studies [11,12,13] adopt explicit spatial alignment pipelines, in which external segmentation modules are employed to localize pathological regions within predefined anatomical structures. The resulting localized features are subsequently integrated with expert-defined priors to construct structured visual representations for downstream reasoning tasks.

While such approaches can incorporate spatial priors that guide the model’s attention toward clinically relevant regions, they also suffer from several fundamental limitations. First, the performance of current medical image detection and classification models remains suboptimal, particularly in complex pathological scenarios [14]. Inaccurate localization may propagate errors to subsequent modeling stages, thereby undermining the reliability of disease progression reasoning. Second, bounding-box-driven localization schemes tend to restrict the receptive field to isolated regions, neglecting global contextual information and the interdependence among multiple pathological areas—both of which are critical for comprehensive disease understanding.

To overcome the limitations of explicit localization strategies, recent studies have increasingly explored self-supervised learning paradigms that leverage large-scale unlabeled image–text data to induce implicit semantic alignment [15,16,17]. These approaches have demonstrated strong potential in capturing high-level correspondences between medical images and radiology reports. Building upon these insights, we decompose disease progression reasoning into two sub-tasks:

(1) Inter-image spatial alignment. This sub-task focuses on implicitly aligning pathological regions between the current image and its historical reference, and extracting disease progression features through a dedicated dual-image encoder. A key challenge lies in suppressing pseudo-changes introduced by imaging noise, acquisition variability, or patient positioning differences [10].

(2) Image–report semantic alignment. This sub-task aims to jointly model visual difference representations and their corresponding textual descriptions within a unified framework. By leveraging multiple self-supervised objectives, the model is encouraged to align unified visual representations with complete radiology reports that jointly describe static findings and disease progression.

Based on the above analysis, we propose a novel framework termed MRID, which aims to model disease progression from paired chest X-rays and generate clinically coherent radiology reports. Instead of relying on explicit region annotations, MRID formulates disease progression reasoning as a unified representation alignment problem, where pathological changes are implicitly captured through joint modeling of inter-image visual differences and report semantics. In addition, we design a data augmentation strategy to improve robustness under limited and imbalanced longitudinal supervision.

Our main contributions are summarized as follows:

(1) We propose a novel self-supervised framework MRID for longitudinal radiology report generation. The framework jointly aligns unified visual representations derived from paired chest X-rays with complete radiology reports, through five carefully designed self-supervised objectives that enforce fine-grained semantic alignment.

(2) We design a dual-image encoder for disease progression modeling. By extending conventional single-image representations to encode inter-image temporal differences, the proposed encoder enables the model to capture subtle pathological changes while preserving global contextual information.

(3) We propose a direction-aware data augmentation and curriculum learning strategy for longitudinal modeling. By synthesizing temporally reversed image–text pairs and adopting stage-wise training, the proposed framework achieves more stable optimization under limited and imbalanced data. Extensive experiments demonstrate that MRID achieves competitive performance on both longitudinal disease progression reasoning and single-image radiology report generation tasks.

2. Related Works

2.1. Medical Image Report Generation

Deep learning techniques have achieved remarkable success in medical image analysis [18]. However, acquiring high-quality medical annotations is both time-consuming and costly, making it challenging to construct sufficiently large datasets for fully supervised training [1,2]. To alleviate this limitation, vision–language pretraining (VLP) has emerged as a promising paradigm. Rather than relying on manually curated annotations, VLP models leverage paired yet weakly aligned image–report data and are optimized with self-supervised objectives to learn transferable multimodal representations. Once pretrained, these models can serve as generic feature extractors or be fine-tuned for downstream tasks such as radiology report generation and disease classification.

Representative VLP methods include MMBERT [4], which integrates ResNet-extracted visual features with textual embeddings and incorporates image information as contextual guidance for masked medical text modeling. GLoRIA [5] enhances cross-modal understanding by learning context-aware local representations, enabling fine-grained alignment between visual subregions and textual tokens through attention mechanisms. BioViL [6] introduces a domain-specific vocabulary and combines masked language modeling with a report–study matching objective to improve semantic alignment between paired images and reports. KIA [7] unifies three complementary implicit alignment objectives—masked modeling, global contrastive alignment, and image-to-report generation—within a single framework.

Despite their effectiveness, these approaches primarily focus on single-image report generation, treating each chest X-ray independently. As a result, temporal variations across longitudinal examinations are not explicitly modeled, which limits their applicability to disease progression analysis and longitudinal clinical reasoning.

2.2. Image Change Captioning in the General Domain

In the general domain, image change captioning aims to generate natural language descriptions that capture genuine semantic differences between two visually similar images. A central challenge of this task lies in distinguishing meaningful semantic variations from pseudo changes induced by illumination shifts, viewpoint differences, or other non-semantic visual perturbations.

To address these challenges, a variety of methods have been proposed. ReGAT [19] introduces a relation-aware graph attention network that models object interactions through structured graph reasoning to better capture relational changes. NCT [20] presents a neighborhood contrastive transformer that aggregates contextual information from neighboring features to enhance contextual representation learning. IFDC [21] extracts fine-grained visual, semantic, and positional object representations and performs multi-round feature fusion to accurately localize and describe changed regions. MCCFormer [22] further advances this task by learning dense cross-image associations, enabling the model to identify multiple changed areas and dynamically focus on relevant regions during description generation.

Compared with general image pairs, medical image change captioning introduces additional challenges. Variations in imaging conditions can lead to artifacts and grayscale inconsistencies, while normal anatomical variability and patient positioning differences hinder the establishment of a stable anatomical reference across images [15]. These factors make it substantially more difficult to distinguish true pathological changes from pseudo variations in medical imaging scenarios.

2.3. Disease Progression Reasoning

Early approaches to disease progression reasoning, primarily focused on single-modality visual change detection. Traditional computer vision techniques, as well as deep neural architectures such as convolutional neural networks (CNNs) [23] and recurrent models (e.g., LSTMs) [24] have been explored to identify visual differences in paired medical images. These studies laid the foundation for subsequent research on multimodal longitudinal modeling.

Unlike generic image captioning, disease progression captioning requires models to interpret subtle temporal variations in pathological findings, such as “slight improvement of consolidation.” This task demands accurate visual grounding, cross-temporal reasoning, and the ability to integrate multimodal cues into clinically coherent and context-aware textual descriptions.

Several recent methods have been proposed to address these challenges. CheXRelNet [11] jointly models local and global visual representations by leveraging a graph attention network to capture intra- and inter-image dependencies among regional features. RE-CAP [12] constructs a disease progression graph that explicitly encodes prior and current observations, pathological trends, and fine-grained attributes, and employs a dynamic reasoning mechanism to infer the evolution of individual clinical findings. BioViL-T [15] adopts a hybrid CNN–Transformer encoder for longitudinal image pairs together with a text encoder, and optimizes the framework using image-guided masked language modeling alongside both global and local contrastive objectives.

In contrast to these prior approaches, our model avoids explicit organ- or region-level extraction by introducing a dual-image encoder based on a cross-attention mechanism. The encoder jointly processes the current and historical images and fuses them into a unified, difference-aware visual representation that simultaneously captures static observations and temporal changes. In addition, we refine both the network architecture and training objectives to better model fine-grained visual differences, resulting in improved interpretability of chest X-ray changes and enhanced training efficiency.

3. Architecture of the MRID Framework

Given a current image

X_{c}

and its historical reference image

X_{p}

, the objective of MRID is to learn a conditional generation model

P_{θ} (R | X_{c}, X_{p})

, where

R

denotes the ground-truth radiology report corresponding to the current examination. During inference, the model produces a report

\hat{R}

that accurately describes the clinical observations in the current image while incorporating longitudinal comparisons with the prior study.

As illustrated in Figure 1, the overall architecture of MRID consists of several core components, which can be grouped into three functional categories:

Uni-modal Encoders. This category includes a single-image encoder ${E n c}_{i}$ , a dual-image encoder ${E n c}_{d}$ , and a text encoder ${E n c}_{t}$ . These modules extract semantic representations from their respective modalities, where ${E n c}_{d}$ further captures fine-grained spatial differences between the paired images.
Cross-Modal Encoders. Two cross-modal encoders ${C r o}_{t}$ and ${C r o}_{i}$ are built upon cross-attention mechanisms to enable bidirectional fusion between visual and textual modalities.
Text Decoder. The text decoder ${D e c}_{t}$ consumes the aligned multimodal features and autoregressively generates a coherent report that jointly reflects current clinical findings and longitudinal disease progression.

We next detail the design of each component.

3.1. Single-Image Encoder and Text Encoder

Accurate disease progression reasoning relies on robust uni-modal representations of both images and texts. We employ a pretrained Vision Transformer (ViT) [25] from KIA [7] as the frozen single-image encoder

{E n c}_{i}

to extract stable and domain-specific visual semantics from individual chest X-rays. Given a current image

X_{c}

and a historical reference image

X_{p}

, their visual representations are obtained as:

I_{c} = {E n c}_{i} (X_{c}) = \{{[I M G]}_{c}, i_{c}^{1}, \dots, i_{c}^{N}\} \in R^{(N + 1) \times D}, I_{p} = {E n c}_{i} (X_{p}) = \{{[I M G]}_{p}, i_{p}^{1}, \dots, i_{p}^{N}\} \in R^{(N + 1) \times D},

(1)

where

{[I M G]}_{c, p}

denote global image embeddings,

i_{c, p}^{n}

are patch-level visual tokens,

N

is the number of image patches, and

D

is the embedding dimension. During training,

{E n c}_{i}

is kept frozen to preserve reliable and stable single-image semantics learned from large-scale medical image–text data.

For the textual modality, MRID adopts BERT [26] as the text encoder

{E n c}_{t}

, initialized with weights pretrained in KIA [7]. Given a radiology report

R

, its textual representation is:

T = {E n c}_{t} (R) = \{[T X T], t^{1}, \dots, t^{M}\} \in R^{(M + 1) \times D},

(2)

where

[T X T]

denotes the global report embedding,

t^{m}

are contextualized token representations, and

M

is the report length. Unlike the image encoder, which is kept frozen during training,

{E n c}_{t}

is fine-tuned within the MRID framework to enhance the modeling of temporal change expressions in radiology reports.

3.2. Dual-Image Encoder

To model longitudinal changes without relying on explicit lesion localization or region-level supervision, we introduce a dual-image encoder

{E n c}_{d}

based on an asymmetric intra-visual cross-attention mechanism.

Following BioViL-T [15], given the visual features of the current image

I_{c}

and the prior image

I_{p}

extracted by

{E n c}_{i}

, spatial and temporal priors are injected to form the initial representations:

H_{c}^{(0)} = I_{c} + S + {1_{(N + 1)} ⨂ t}_{c u r r}, H_{p} = I_{p} + S + 1_{(N + 1)} ⨂ t_{p r i o r},

(3)

where

S \in R^{(N + 1) \times D}

denotes sinusoidal positional encodings, and

t_{c u r r}, t_{p r i o r} \in R^{1 \times D}

are learnable temporal embeddings indicating the chronological order of the image pair.

The dual-image encoder stacks

L

cross-attention layers, where the current image features are updated using the prior image features as a fixed temporal reference. At the

l

-th layer, cross-attention is computed as:

C r o s s A t t n (H_{c}^{(l - 1)}, H_{p}) = s o f t m a x (\frac{H_{c}^{(l - 1)} W_{Q}^{(l)} {(H_{p} W_{K}^{(l)})}^{⊤}}{\sqrt{D}}) H_{p} W_{V}^{(l)},

(4)

followed by a Pre-LN Transformer update with residual connections:

\begin{matrix} H_{c}^{(l)} = & H_{c}^{(l - 1)} + C r o s s A t t n ({L N (H}_{c}^{(l - 1)}), H_{p}), \\ H_{c}^{(l)} = & H_{c}^{(l)} + F F N (L N (H_{c}^{(l)})) . \end{matrix}

(5)

This asymmetric design introduces a directional inductive bias that is consistent with the temporal semantics of disease progression, where historical observations provide contextual reference. After

L

layers, the encoder outputs a unified, difference-aware visual representation:

I_{d} = H_{c}^{(L)} = \{{[I M G]}_{d}, i_{d}^{1}, \dots, i_{d}^{N}\} \in R^{(N + 1) \times D} .

(6)

The resulting representation

I_{d}

simultaneously encodes static visual observations of the current image and temporal differences relative to the prior image. By modeling inter-image dependencies across the global feature space, the encoder discourages spurious differences caused by acquisition variability (e.g., pose, illumination) from dominating the learned representation.

3.3. Cross-Modal Encoders

We employ two bidirectional cross-modal encoders

{C r o}_{t}

and

{C r o}_{i}

to enable fine-grained vision–language interaction. The text-side cross-modal encoder

{C r o}_{t}

takes textual feature

T

as the query (Q) and the visual representation

I_{d}

as the key (K) and value (V). Through cross-attention, difference-aware visual cues are injected into the textual embedding space, allowing text tokens to attend to clinically relevant visual changes between the current and prior images. This design provides visual grounding for language modeling tasks related to disease progression. Conversely, the image-side cross-modal encoder

{C r o}_{i}

adopts the opposite information flow, conditioning visual representations on linguistic semantics to suppress irrelevant visual noise.

Both

{C r o}_{t}

and

{C r o}_{i}

are implemented by inserting cross-attention layers into the text encoder and the dual-image encoder, respectively. Through this design, cross-modal interaction is selectively enabled by different training objectives to support image–text alignment.

3.4. Text Decoder

The text decoder

{D e c}_{t}

is responsible for generating the final radiology report from the aligned multimodal representations. It shares a similar Transformer backbone with the text-side cross-modal encoder

{C r o}_{t}

, but replaces bidirectional self-attention with causal (masked) attention to support autoregressive text generation. During decoding,

{D e c}_{t}

conditions on previously generated text tokens and attends to the difference-aware visual representation

I_{d}

through cross-attention.

3.5. Parameter Sharing

Inspired by multimodal frameworks such as LLaVA [27], MRID adopts a cross-module parameter sharing strategy to improve parameter efficiency and promote synergistic optimization under multi-task self-supervised training. Specifically, parameter sharing is applied to the following components:

The bidirectional self-attention layers in ${E n c}_{d}$ and ${C r o}_{i}$ ;
The bidirectional self-attention layers in ${E n c}_{t}$ and ${C r o}_{t}$ ;
The cross-attention layers in ${C r o}_{t}$ and ${D e c}_{t}$ .

Taking the textual stream as an example, the computation at the

l

-th layer can be uniformly expressed as:

T^{(l)} = \{\begin{matrix} T^{(l - 1)} + B i S e l f A t t n ({L N (T}^{(l - 1)})), i f i n t h e e n c o d i n g s t a g e \\ T^{(l - 1)} + C a u s a l A t t n ({L N (T}^{(l - 1)})), i f i n t h e d e c o d i n g s t a g e \end{matrix} .

(7)

When cross-modal interaction is required, visual features are incorporated through a shared cross-attention layer:

T^{(l)} = T^{(l)} + C r o s s A t t n ({L N (T}^{(l)}), I_{d}),

(8)

followed by a feed-forward network:

T^{(l)} = T^{(l)} + F F N ({L N (T}^{(l)})) .

(9)

Through this parameter sharing mechanism, MRID encourages consistent attention patterns across encoding and generation, ultimately improving the model’s generalization ability under limited supervision.

4. Multi-Task Self-Supervised Learning

To optimize MRID for fine-grained cross-modal reasoning, we adopt a multi-task self-supervised training strategy that jointly incorporates five complementary objectives: masked report modeling (MRM), masked image difference modeling (MIDM), image–report contrastive learning (IRC), image–report matching prediction (IRM), and report generation (RG). As illustrated in Figure 2, these objectives can be conceptually organized into three functional levels:

Representation modeling objectives (MRM and MIDM), which strengthen the model’s understanding of textual semantic structure and fine-grained visual difference representations through masked modeling;
Cross-modal alignment objectives (IRC and IRM), which enforce semantic consistency between visual and textual representations in a shared embedding space and enhance discriminative cross-modal reasoning;
Generation objective (RG), which guides the model to synthesize coherent and clinically meaningful radiology reports by integrating learned multimodal information.

The detailed formulation and implementation of each training objective are described in the following subsections.

4.1. Domain-Aware Masked Report Modeling

Following standard VLP practice [28,29,30], we adopt masked report modeling (MRM) as an objective, where the model predicts masked report tokens conditioned on the unmasked textual context and the paired images. Given a report

R

with masked tokens

w_{m}

, the objective is defined as:

L_{M R M} = {- E}_{(X_{c}, X_{p}, R) ~ D} [l o g P_{θ} (w_{m} | R_{\ m}, X_{c}, X_{p})] .

(10)

To strengthen the visual grounding effect of MRM, we adopt a domain-aware masking strategy with a prioritized two-stage design.

First, guided by the disease progression–related vocabulary listed in Table A1, tokens explicitly describing disease changes are masked with a high probability (80%). This step ensures that clinically critical progression cues are preferentially removed, forcing the model to rely on visual evidence for accurate recovery.

Second, for the remaining tokens, TF–IDF scores computed over the MIMIC-CXR corpus are used to estimate token importance. These tokens are ranked by their TF–IDF values and masked with different probabilities according to four percentile bins: 80% for the top 10%, 50% for the 10–30% range, 20% for the 30–60% range, and 10% for the remaining 60–100% tokens. This graded masking scheme balances the suppression of informative words with the preservation of report readability.

This design encourages the model to attend to difference-aware visual representations, promoting fine-grained semantic alignment between textual descriptions and longitudinal image changes.

4.2. Masked Image Difference Modeling

Conventional masked image modeling (MIM) methods typically operate on single images and focus on reconstructing local structures or textures in the pixel space [31]. In contrast, MRID aims to model semantic differences between paired images, and the dual-image encoder

{E n c}_{d}

is based on cross-attention in a high-level feature space, which is not suitable for precise pixel-level reconstruction. Based on these considerations, we introduce masked image difference modeling (MIDM) by shifting the masking objective from the pixel space to the difference representation space.

For each training triplet

(X_{c}, X_{p}, R)

, masking is applied only to the current image

X_{c}

, with a masking ratio of 60%, following SimMIM [32]. Through the image-side cross-modal encoder

{C r o}_{i}

, the masked current image features serve as queries and interact with the reference image and textual features to predict the difference-aware representation

\hat{I_{d}}

.

During training, the unified visual representation

I_{d}

obtained from the complete input

(X_{c}, X_{p})

is used as the reconstruction target. The reconstruction loss is computed only on the masked positions

M

, while gradients are stopped on the target branch. The MIDM objective is defined as:

L_{M I D M} = E_{(X_{c}, X_{p}, R) ~ D} \frac{1}{| M |} \sum_{i \in M} {‖{\hat{I_{d}}}^{(i)} - s t o p g r a d ({I_{d}}^{(i)})‖}_{2}^{2},

(11)

By operating in the difference-aware feature space, MIDM provides an additional self-supervised signal that supports the learning of semantically meaningful inter-image representations and reduces sensitivity to spurious variations.

4.3. Image–Report Contrastive Learning

Contrastive learning has proven effective for cross-modal representation alignment in multimodal frameworks [33]. Following this paradigm, we introduce a momentum-based image–report contrastive learning (IRC) objective to align the global semantics of difference-aware visual representations with their corresponding radiology reports.

Specifically, two projection heads are applied to the visual CLS token

{[I M G]}_{d}

from

I_{d}

and the textual CLS token

[T X T]

from

T

, respectively, mapping them into a shared embedding space for contrastive learning. We adopt a symmetric InfoNCE loss with a momentum-updated queue of negative samples. For the image-to-report (I2R) direction, the probability of matching an image representation

I_{d}

with the

k

-th textual embedding

T_{k}^{m}

in the momentum queue is defined as:

p_{k}^{I 2 R} (I_{d}) = \frac{e x p (s i m (I_{d}, T_{k}^{m}) / τ)}{\sum_{j = 1}^{K} e x p (s i m (I_{d}, T_{j}^{m}) / τ)},

(12)

where

s i m (\cdot)

denotes cosine similarity of CLS tokens,

τ

is the temperature hyperparameter, and

K

is the queue size. Similarly, for the report-to-image (R2I) direction:

p_{k}^{R 2 I} (T) = \frac{\exp (s i m (I_{d, k}^{m}, T) / τ)}{\sum_{j = 1}^{K} \exp (s i m (I_{d, j}^{m}, T) / τ)} .

(13)

The overall IRC objective is defined as the sum of cross-entropy losses in both directions:

L_{I R C} = E_{(X_{c}, X_{p}, R) ~ D} [L_{C E} (y^{I 2 R}, p^{I 2 R}) + L_{C E} (y^{R 2 I}, p^{R 2 I})],

(14)

4.4. Image–Report Matching Prediction

While IRC enforces global semantic alignment at the embedding level, IRM provides explicit instance-level supervision to refine visual grounding. IRM explicitly predicts whether a difference-aware image representation

I_{d}

and a radiology report

T

correspond to each other.

A special token

[M A T C H]

is prepended to the text sequence, and the augmented text is jointly processed with image features by the encoder

{C r o}_{t}

. After cross-modal interaction, the contextualized embedding of the

[M A T C H]

token, denoted as

h_{m a t c h}

, serves as a holistic representation of the global image–text alignment state. A lightweight binary classification head is applied to

h_{m a t c h}

to predict the matching probability:

p^{I R M} (I_{d}, T) = σ (w^{⊤} h_{m a t c h}),

(15)

where σ(⋅) denotes the sigmoid function and w is the learnable classifier weight. The IRM objective is optimized using binary cross-entropy loss:

L_{I R M} = E_{(X_{c}, X_{p}, R) ~ D} [L_{B C E} (y^{I R M}, p^{I R M} (I_{d}, T))],

(16)

where

y^{I R M} \in {0,1}

indicates whether the image pair and report are matched.

4.5. Report Generation

Report generation (RG) serves as the primary downstream objective of MRID. During training, a special start token

[S T A R T]

is prepended to the target report sequence, and the text decoder generates the report in an autoregressive manner, conditioning each token on all previously generated tokens and the difference-aware visual representation

I_{d}

. The generation objective is defined as the standard cross-entropy loss:

L_{R G} = - E_{(X_{c}, X_{p}, R) ~ D} [\sum_{i = 1}^{|R|} l o g P_{θ} (t_{i} | t_{< i}, I_{d})],

(17)

where

|R|

denotes the report length and

t_{< i}

includes the

[S T A R T]

token and all preceding tokens.

4.6. Curriculum Learning

MRID jointly optimizes five training objectives—MRM, MIDM, IRC, IRM, and RG—using a weighted multi-task loss:

{L_{t o t a l} = λ_{1} L}_{M R M} + λ_{2} L_{M I D M} + λ_{3} L_{I R C} + λ_{4} L_{I R M} + λ_{5} L_{R G},

(18)

where

λ_{1 - 5}

denote task-specific loss weights.

Directly optimizing all objectives from scratch can lead to unstable training due to conflicting learning signals. Since these objectives differ in learning difficulty and influence on generation, MRID adopts a curriculum learning strategy that adjusts task weights across training stages. As shown in Table 1, the training process is divided into four stages:

Stage 1: Representation and alignment warm-up. MRM, MIDM, and IRC are activated to learn difference-aware visual representations and initial cross-modal alignment. This stage establishes a stable multimodal foundation and prevents early overreliance on linguistic priors.
Stage 2: Fine-grained discrimination enhancement. IRM is introduced to improve instance-level discrimination between matched and mismatched image–report pairs, while RG is incorporated with a small weight. This stage strengthens cross-modal reasoning and gradually adapts the model to generation without disrupting learned representations.
Stage 3: Generation-oriented joint fine-tuning. The weight of RG is increased, while other objectives serve as auxiliary regularization.
Stage 4: Task-specific fine-tuning for report generation. Only RG is retained for optimization using the pretrained encoders and difference modeling modules.

Overall, this stage-wise curriculum allows MRID to progressively transition from representation learning to cross-modal alignment and finally to generation modeling, ensuring stable optimization and robust longitudinal reasoning.

5. Dataset and Data Preprocessing

We train and evaluate the proposed MRID framework on multiple publicly available chest X-ray datasets.

5.1. Training and Evaluation Dataset

MIMIC-CXR [34] is one of the largest publicly available chest X-ray image–text datasets, containing 473,057 chest radiographs and 227,835 radiology reports from 63,478 patients. The dataset indexes studies by acquisition date and time, enabling chronological ordering of multiple examinations for the same patient and naturally supporting longitudinal modeling. Approximately 67% of patients have at least two examinations, covering disease progression across different stages [15]. MIMIC-CXR is primarily used for MRID pretraining and main experimental evaluation, following the official data splits.

Longitudinal-MIMIC [35] is a longitudinal subset constructed from MIMIC-CXR to support patient-level follow-up modeling and evaluation. It includes 26,625 patients with at least two visits, and forms longitudinal samples using adjacent visit pairs, which helps maintain temporal consistency while reducing cross-interval noise. The dataset contains 95,169 longitudinal samples and is mainly used for quantitative evaluation of longitudinal report generation.

MS-CXR-T [15] is a multimodal benchmark dataset designed for longitudinal chest X-ray analysis, aiming to assess vision–language models’ ability to understand disease evolution over time. It comprises two tasks: (1) a temporal image classification task, consisting of 1326 image pairs annotated with progression labels (improved, stable, or worsened) for five findings—Consolidation, Edema, Pleural Effusion, Pneumonia, and Pneumothorax; (2) a temporal text similarity task, containing 361 pairs of medical sentences for evaluating semantic consistency or contradiction in disease progression descriptions. This dataset is used to provide complementary evaluation of MRID’s longitudinal disease modeling capability.

5.2. Data Preprocessing

For the image modality, only frontal-view chest radiographs (PA/AP) are retained to ensure view consistency. For computational feasibility and fair comparison with prior methods, all images are resized to a unified resolution of 256 × 256 during training, followed by random rotation within ±10° and random cropping to 224 × 224 as the final input.

For text modality, radiology reports are structurally parsed using the rule-based tools provided in the official MIMIC codebase [34]. Only the Findings and Impression sections are retained. In addition, sentence shuffling is applied to increase linguistic diversity.

5.3. Temporal-Reversal Data Augmentation

Longitudinal chest X-ray datasets often exhibit a pronounced imbalance in temporal change patterns [10], where worsening-related findings are substantially more frequent than improvement-related changes. This skewed distribution limits the model to learn bidirectional disease progression trajectories, especially under data-scarce conditions. To enrich the diversity of temporal change patterns observed during pretraining and alleviate data scarcity, we propose a temporal-reversal text augmentation strategy that reconstructs clinically consistent reports for reversed image orders.

As illustrated in Figure 3, given a paired study consisting of a current image

X_{c}

with report

R_{c}

and a prior image

X_{p}

with report

R_{p}

, the augmentation procedure comprises four steps:

(1) Static–Dynamic Report Decomposition. Using a curated vocabulary of disease progression cues from BioViL-T [15] (Table A1 in the Appendix A), each report is decomposed into static descriptions

R^{s t a t i c}

, which characterize anatomical structures and lesion states, and dynamic descriptions

R^{d y n}

, which explicitly describe temporal disease changes. This decomposition isolates longitudinal semantics while preserving the fundamental image–text correspondence.

(2) Dynamic Description Inversion. Each sentence in

R_{c}^{d y n}

, where each sentence corresponds to a single atomic description of temporal change, is rewritten using a medical-domain large language model (Baichuan-M2 [36]) to express the opposite directional progression (e.g., worsened → improved), while stability-related statements remain unchanged. The resulting inverted dynamic descriptions are denoted as

R_{i n v}^{d y n}

.

(3) Consistency Verification. To mitigate potential semantic drift introduced by LLM generation, each inverted dynamic sentence is independently evaluated through an LLM-based self-consistency verification process. A report-level all-or-nothing strategy is adopted: if any inverted dynamic sentence within a report fails the consistency check, the entire reversed sample is discarded and no temporal reversal is applied to that image–report pair.

(4) Reversed Triplet Construction. For accepted samples, a new report

R_{i n v}

is constructed by concatenating the static descriptions from the prior report with the inverted dynamic descriptions (Equation (19)). By reversing the image order, we obtain a temporally reversed triplet

(X_{p}, X_{c}, R_{i n v})

, which remains visually and semantically consistent under the reversed timeline.

R_{i n v} = R_{p}^{s t a t i c} \cup R_{i n v}^{d y n} .

(19)

As symmetric semantic perturbations, these temporally reversed samples are incorporated only during the first two stages of the curriculum learning process. They are excluded from later training stages and downstream report generation, ensuring that the final model is optimized solely on clinically valid temporal sequences. Additional details regarding prompt design, failure cases, and bias analysis are provided in Appendix B.

6. Experiments

6.1. Implementation Details

Both the single-image encoder

{E n c}_{i}

and the text encoder

{E n c}_{t}

are initialized from components pretrained in the static radiology report generation framework KIA [7]. Specifically,

{E n c}_{i}

adopts a 12-layer ViT-B/16 architecture and is kept frozen during all training stages, serving as a fixed visual feature extractor, whereas

{E n c}_{t}

is implemented as a BERT-based text encoder and is further fine-tuned end-to-end together with the proposed MRID modules.

All experiments are conducted using two NVIDIA RTX 4090 GPUs with a total batch size of 32. The model is optimized with the AdamW optimizer, using an initial learning rate of 1 × 10⁻⁴. A linear warm-up strategy followed by cosine annealing is applied to schedule the learning rate throughout training. To enhance the stability of multi-task optimization, gradient clipping based on the L2 norm is employed, with the maximum gradient norm set to 1.0, effectively mitigating potential gradient explosion during training.

6.2. Evaluation Metrics

To comprehensively evaluate model performance on report generation tasks, we adopt both natural language generation (NLG) metrics and clinical efficacy (CE) metrics.

Natural Language Generation Metrics

NLG metrics are used to measure the lexical and syntactic overlap between generated reports and reference reports. Specifically, we report BLEU [37], METEOR [38] and ROUGE-L [39], which are widely adopted in prior radiology report generation studies.

BLEU [37] evaluates local n-gram precision, reflecting word-level matching accuracy. METEOR [38] combines precision and recall with penalties for fragmentation and additionally accounts for linguistic variants such as stemming and synonymy, providing a more semantically informed assessment. ROUGE-L [39] measures sequence-level similarity based on the longest common subsequence, capturing global structural consistency.

All NLG metrics are computed over the entire test set and averaged to reduce variance caused by individual samples.

Clinical Efficacy Metrics

To assess the medical validity of generated reports, we employ clinical efficacy (CE) metrics. Following prior work, we use CheXbert [40] as an automatic disease labeler to extract clinical findings from both generated and reference reports. The extracted labels are then compared at the disease level, and precision, recall, and F1-score are reported to quantify the consistency of clinical observations.

6.3. Report Generation Results

To evaluate the effectiveness of MRID in radiology report generation, we conduct experiments under two complementary settings that reflect both conventional and longitudinal clinical scenarios:

(1) Radiology Report Generation (RRG). In this setting, the input consists of a single chest X-ray from the MIMIC-CXR dataset, and only the current image representation

I_{c}

is provided as visual evidence.

(2) Longitudinal Radiology Report Generation (LRRG). The input of this setting includes both the current chest X-ray and the temporally closest prior examination of the same patient from the Longitudinal-MIMIC dataset. The model is required to generate a report that not only accurately describes the current imaging findings but also correctly captures disease progression or stability relative to the reference study.

We compare MRID against a diverse set of representative baselines spanning both static and longitudinal radiology report generation paradigms. Static single-image models—R2Gen [41], KiUT [42], and KIA [7]—generate reports from individual CXRs without modeling temporal change, whereas longitudinal models—BioViL-T [15], HERGen [17], AoANet [43], PromptMRG [44], TiBiX [16], and RECAP [12]—explicitly incorporate paired CXRs to capture disease progression dynamics.

Table 2 summarizes the results on both the MIMIC-CXR and Longitudinal-MIMIC datasets. MRID demonstrates consistent advantages across both RG and LRRG tasks, with particularly strong performance on CE metrics, which are more closely aligned with clinical correctness. Compared with the second-best baseline, MRID achieves an average relative improvement of 7.3% on CE metrics in the RG setting and 2.3% in the LRRG setting, indicating a more reliable preservation of disease-level semantics in generated reports.

Given that CE metrics evaluate consistency at the level of clinical findings rather than surface text overlap, these improvements suggest that leveraging longitudinal supervision signals enables more effective fine-grained alignment between visual representations and clinically meaningful textual descriptions.

In the static RRG task, MRID replaces the unified difference-aware representation

I_{d}

with the single-image representation

I_{c}

, which introduces a representation mismatch and partly accounts for the smaller gains on surface-level NLG metrics. In contrast, in the LRRG setting—where

I_{d}

is fully utilized—MRID achieves clear and consistent improvements on NLG metrics, reflecting more accurate modeling of longitudinal changes.

In addition, some competing methods, such as BioViL-T [15], adopt higher input image resolutions (e.g., 448 × 448), which provide an inherent advantage in capturing fine-grained lesion details. Despite operating on lower-resolution inputs (224 × 224), MRID achieves higher CE performance, while the differences in NLG metrics compared with the best-performing baseline remain within ±0.03 points. This suggests that MRID’s performance improvements are primarily driven by effective fine-grained modeling of longitudinal visual differences, rather than increased input resolution or model capacity.

Taken together, these results demonstrate that MRID not only maintains strong performance in conventional single-image report generation but also offers more substantial and clinically meaningful gains in longitudinal settings, validating the effectiveness of difference-aware representation learning and longitudinal supervision for radiology report generation.

6.4. Disease Progression Classification Results

The disease progression classification task aims to determine the evolution trend of specific clinical findings by comparing imaging differences between two examinations. Consistent with the evaluation protocol of BioViL-T [15], we formulate this task under an autoregressive text generation paradigm.

Table 3 reports the disease progression classification results of MRID and representative baseline methods on the MS-CXR-T dataset, in terms of macro-accuracy averaged across the three progression states for each disease category. Overall, MRID outperforms the baseline BioViL-T [15] on most disease categories. Under zero-shot and few-shot settings, MRID achieves moderate improvements of 2.3% and 4.1% in average accuracy, respectively. The advantage in these low-supervision scenarios is present but not yet fully pronounced. It can be attributed to the training objective of MRID, which aligns a unified, difference-aware visual representation with complete radiology reports, embedding both static findings and longitudinal change semantics. When progression classification is performed based on a single textual prompt in zero- or few-shot inference, the semantic granularity of the prompt may not fully match the rich multimodal representation learned during training, leading to performance fluctuations.

In contrast, under the fully supervised setting, MRID exhibits a substantially larger improvement of 9.3% in macro-accuracy compared with BioViL-T [15]. With access to labeled training samples, the model can rapidly adapt the cross-modal embedding space to the target classification task, allowing its longitudinal difference modeling capability to be more effectively exploited. These results indicate that MRID benefits more significantly from explicit supervision, where the alignment between visual differences and progression labels can be directly reinforced.

6.5. Ablation Study

Effect of Difference-Aware Representation and Dual-Image Encoder Design

To analyze the role of dual-image encoder design in longitudinal difference modeling, we compare four representative configurations under a unified experimental setting, which differ only in whether and how longitudinal image differences are modeled. Specifically, we evaluate:

(a) KIA [7], a static radiology report generation model without explicit longitudinal modeling. KIA can be regarded as the architectural predecessor of MRID, as the single-image encoder in MRID is directly inherited from the KIA pre-trained model.

(b) MRID (Asymmetric), which incorporates difference-aware representation modeling using the proposed asymmetric dual-image encoder.

(c) MRID (Bidirectional Cross-Attention), where current and prior image features attend to each other in both directions, with fully shared parameters.

(d) MRID (Symmetric Fusion), where current and prior image tokens are concatenated and jointly processed by self-attention layers.

For all MRID variants, only the updated current-image tokens are used as the final visual representation, ensuring identical output dimensionality across settings. Since KIA is designed for single-time-point report generation, it is evaluated only on the static RRG task, whereas the MRID variants (b–d) are evaluated on the LRRG task and disease progression classification task. Quantitative results are summarized in Table 4.

Comparing the static baseline KIA (a) with MRID (b) highlights the effect of introducing explicit longitudinal difference modeling. As shown in Table 4, on the RRG task, MRID achieves a substantial improvement over KIA on clinical efficacy (CE) metrics, with an average gain of 8.8%. This improvement indicates that incorporating temporal difference information during training enables the model to better capture disease progression patterns, leading to more accurate disease-level descriptions even in static report generation.

The comparison among MRID variants (b–d) further examines how longitudinal differences should be modeled within the dual-image encoder. As reported in Table 4, the asymmetric encoder (b) consistently outperforms both bidirectional cross-attention (c) and symmetric fusion (d) on progression-sensitive metrics, including progression classification accuracy and CE metrics, while maintaining comparable performance on static anatomical descriptions.

To further analyze model behavior under non-monotonic progression, we evaluate performance on a subset of mixed-change samples, where different pathological findings exhibit heterogeneous temporal trends (e.g., improvement in one disease accompanied by worsening findings in another). As shown in Table 4, performance on the mixed-change subset is consistently lower than that on the full evaluation set across all models, indicating the difficulty of modeling heterogeneous progression patterns. Within this challenging setting, the asymmetric design outperforms the symmetric variants (c/d) in classification performance.

A plausible explanation is that in symmetric designs, the prior image is continuously updated through bidirectional interaction, weakening its role as a stable reference for comparison. When both current and prior representations evolve together, it becomes harder to clearly attribute observed differences to temporal progression. This effect is more pronounced in mixed-change cases, where heterogeneous regional trends may be averaged out. In contrast, the asymmetric design keeps the prior image as a fixed reference and updates only the current representation, allowing localized change cues to be retained.

Overall, these results indicate that the performance gains of MRID are not solely attributable to the introduction of difference-aware modeling, but also depend on enforcing a directional information flow that aligns with the semantics of disease progression reasoning.

Effect of Auxiliary Tasks and Curriculum Learning

To systematically investigate the contribution of individual auxiliary learning objectives and the curriculum learning strategy, we design a series of ablation experiments with different training configurations.

Specifically, Setting (a) corresponds to training the model solely on the RG task without any auxiliary supervision. Settings (b–h) introduce different auxiliary tasks or their combinations during the pretraining stage, followed by fine-tuning exclusively on the RG task. Among them, Setting (h) jointly incorporates all five learning objectives during pretraining without any curriculum scheduling. Setting (i) further decomposes pretraining into two stages, first jointly optimizing MRM, MIDM, and IRC, and then introducing all five objectives. The full MRID framework extends Setting (i) by adopting a more fine-grained curriculum learning strategy with four training stages (Section 4.6), enabling progressively structured optimization.

From the results of Table 5, it can be observed that all settings incorporating auxiliary tasks (b–g) consistently outperform the baseline Setting (a). Among them, introducing image–report contrastive learning (IRC) yields the most significant improvement, increasing NLG and CE metrics by 8.6% and 24.1%, respectively. This highlights the critical role of IRC in establishing robust global semantic alignment between visual representations and radiology reports.

In contrast, masked report modeling (MRM) mainly benefits surface-level language quality, leading to an average 9.5% improvement on NLG metrics, which underscores its effectiveness in enhancing textual modeling capability. Meanwhile, masked image difference modeling (MIDM) and image–report matching prediction (IRM) primarily improve clinical efficacy, with gains of 12.0% and 13.3%, respectively, suggesting that these objectives are particularly effective in strengthening fine-grained semantic understanding of disease progression.

A further analysis reveals that directly introducing all five objectives simultaneously during pretraining without curriculum learning (Setting h) leads to degraded performance, with decreases of 8.1% and 1.7% on NLG and CE, respectively. This phenomenon indicates that, in the absence of proper training scheduling, strong gradient competition among multiple objectives can hinder effective representation learning and may even prevent stable convergence.

Finally, comparing Setting (i) with the full MRID training strategy demonstrates that a more structured and fine-grained curriculum learning design yields additional performance gains. In particular, introducing a generation-oriented fine-tuning stage—where RG serves as the primary objective while auxiliary tasks are retained as weak regularizers—proves to be essential.

Effect of Temporal-Reversal Data Augmentation

To evaluate the effectiveness of the temporal-reversal data augmentation strategy proposed in Section 5.3, we conduct a set of ablation experiments under four different configurations. Specifically, setting (a) corresponds to training MRID without any data augmentation, while Settings (b–d) apply the temporal-reversal augmentation at different stages of the curriculum learning process.

The models are evaluated on both the longitudinal radiology report generation (LRRG) task and the disease progression classification task, with results summarized in Table 6. Compared with the baseline Setting (a), all augmented settings (b–d) consistently achieve improvements in CE metrics on the LRRG task as well as in classification accuracy for disease progression prediction. These gains indicate that the proposed augmentation strategy effectively encourages the model to attend to longitudinal changes and inter-image difference features, thereby enhancing its ability to capture disease evolution patterns.

However, as the augmented samples are introduced across a larger number of pretraining epochs, increased fluctuations are observed in surface-level NLG metrics. This phenomenon suggests that reports reconstructed by large language models may still exhibit distributional discrepancies compared with real reports. Excessive reliance on such reconstructed samples can therefore introduce noise into language modeling, particularly for metrics sensitive to lexical or syntactic variations.

Taking both performance gains and training stability into account, we adopt Setting (c) as the final configuration for MRID, as it provides the best trade-off between improved longitudinal supervision and stable language modeling.

6.6. Qualitative Result

(1): Case Study of Generated Reports

To qualitatively assess the effectiveness of MRID in modeling longitudinal disease progression, we present two representative case studies in Figure 4, comparing the generated reports of MRID with those produced by baseline BioViL-T [15], alongside the corresponding ground-truth reports.

In the case (a), BioViL-T captures worsening lung consolidations but incorrectly reports the absence of pleural effusion and describes cardiomegaly as mildly enlarged, thereby distorting key longitudinal semantics. In contrast, MRID correctly identifies unchanged cardiomegaly, acknowledges the presence of a right pleural effusion, and reflects the worsening trend of pulmonary edema, resulting in a more clinically coherent progression description.

The case (b) focuses on interval improvement and stability across multiple findings. While BioViL-T recognizes the presence of several medical devices and describes them as being in satisfactory position, MRID further captures their unchanged positioning across examinations. More importantly, MRID correctly identifies mild interstitial pulmonary edema as slightly improved, whereas BioViL-T characterizes it as stable, indicating that MRID exhibits stronger sensitivity to temporal directionality in disease progression assessment. Both models fail to explicitly identify the persistent pneumothorax in this case, revealing a shared limitation in detecting subtle or loculated abnormalities.

(2): Heatmap Visualization of Image Representations

Figure 4 visualizes the attention heatmaps of the single-image encoder and the dual-image encoder in MRID, revealing a clear progression in representation focus between the two modules. The single-image encoder primarily attends to anatomically salient regions, such as the lung fields and the cardiac silhouette, reflecting its ability to capture global structural and pathological cues from an individual chest X-ray. The resulting attention patterns remain largely structure-oriented and do not explicitly differentiate temporally evolving abnormalities from static findings.

In contrast, after incorporating the historical reference image, the dual-image encoder exhibits a distinct shift in attention behavior. The attention distribution becomes more localized and asymmetric, concentrating on difference-sensitive regions that correspond to meaningful inter-image variations, such as areas associated with mildly improved interstitial pulmonary edema. This change indicates that the dual-image encoder is capable of selectively highlighting regions relevant to disease progression rather than static anatomical structures.

Moreover, when the temporal order of the image pair is reversed, noticeable changes emerge in the attention patterns of the dual-image encoder. This observation suggests that the learned representations are sensitive to temporal conditioning, capturing not only the presence of inter-image differences but also their chronological direction. Such direction-aware attention enables MRID to reason about disease progression trajectories rather than treating longitudinal image pairs as unordered sets.

To further substantiate the clinical interpretability and robustness of the learned difference-aware representations, we provide a more controlled qualitative analysis in Appendix C where cross-attention patterns are examined across representative cases under temporal reversal, artificial occlusion, and geometric/intensity perturbations.

7. Discussion

7.1. Discussion of Experimental Findings

This work investigates longitudinal radiology report generation from the perspective of difference-aware representation learning. By integrating asymmetric dual-image encoding, multi-task self-supervision, and temporal-reversal data augmentation, MRID provides a unified approach for progression-aware reasoning. Compared with prior methods such as KIA [7] and BioViL-T [15], MRID explicitly conditions visual representations on historical reference images, enabling finer alignment between paired images and radiology reports. Both clinical efficacy gains and attention heatmap analyses indicate a shift in model focus from global anatomical structures toward temporally discriminative regions. Although designed for longitudinal settings, MRID also improves single-image report generation, suggesting that longitudinal supervision acts as an effective regularizer.

7.2. Extending Beyond Pairwise Longitudinal Modeling

In its current formulation, MRID operates in a pairwise setting, comparing the current image with a single historical reference, which aligns with common clinical comparison practice. Accordingly, MRID focuses on modeling the temporal ordering between paired images, and does not explicitly encode or reason about the absolute time interval separating the two studies.

Extending MRID to multi-timepoint longitudinal scenarios introduces additional challenges beyond pairwise comparison. Recent work [46] on spatiotemporal representation learning for medical time series has highlighted the importance of incorporating temporal distance as an explicit conditioning signal when modeling long-range disease trajectories. Motivated by these findings, a principled future direction is to learn an adaptive historical representation that aggregates multiple prior examinations while incorporating temporal distance through dedicated time embeddings. We leave the explicit modeling of absolute temporal intervals as future work.

7.3. Toward More Faithful Progression Evaluation

MRID is evaluated using complementary metrics, including standard NLG metrics, clinical efficacy scores, and temporal classification accuracy. These metrics capture textual similarity, disease-level recognition consistency, and coarse-grained progression states, respectively. However, they do not explicitly assess the sentence-level correctness of disease progression descriptions. As a result, favorable scores may still coexist with partially inaccurate or clinically implausible progression statements. Developing evaluation protocols that directly measure fine-grained temporal correctness and longitudinal consistency remains an open challenge. Addressing this limitation is important for advancing reliable longitudinal report generation.

7.4. Considerations on Modality Generalization

In the current implementation, MRID adopts the image encoder from the static chest X-ray report generation model KIA as a frozen module and initializes the text encoder from the same pretrained framework, thereby inheriting CXR-specific priors. Beyond this initialization, the asymmetric dual-image encoding, difference-aware objectives, and curriculum-based training strategy are not inherently modality-specific. These components operate on latent representations and model relative inter-image changes, making them applicable to other 2D longitudinal medical imaging modalities. For volumetric modalities such as CT or MRI, architectural adaptations such as slice aggregation or 3D encoders would be required. Such adaptations primarily affect the visual encoding stage and do not alter the core difference-aware modeling principle of MRID.

8. Conclusions

In this work, we propose MRID, a multi-task self-supervised framework that explicitly models radiological image differences to support disease progression reasoning. The proposed approach is built upon three key contributions.

First, we design an asymmetric dual-image encoder that conditions the representation of the current image on a historical reference, enabling implicit spatial alignment and fine-grained modeling of inter-image differences without relying on explicit lesion localization.

Second, we develop a set of complementary self-supervised objectives and a structured joint training strategy that aligns unified difference-aware visual representations with complete radiology reports, effectively bridging static findings and temporal progression within a shared semantic space.

Third, we introduce a novel data augmentation strategy that synthesizes direction-aware training samples, improving robustness under data imbalance while being carefully integrated into early-stage curriculum learning.

Extensive experiments demonstrate that MRID consistently improves clinical efficacy across RRG and LRRG tasks. These results confirm that explicitly modeling radiological image differences provides a principled and effective foundation for longitudinal vision–language learning and clinically meaningful disease progression analysis.

Author Contributions

Conceptualization, P.W.; methodology, P.W.; validation, P.W. and Y.C.; formal analysis, P.W.; investigation, P.W.; resources, P.W.; data curation, P.W.; writing—original draft preparation, P.W.; writing—review and editing, P.W., Y.H. and H.Z.; visualization, P.W. and Y.C.; supervision, Y.H. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Social Science Foundation Project of China (NSSFC), grant number 25SGC197.

Data Availability Statement

The three datasets used in this paper can be found at the following publicly available websites: https://www.physionet.org/content/mimic-cxr/2.0.0 (accessed on 12 September 2024), https://github.com/CelestialShine/Longitudinal-Chest-X-Ray (accessed on 27 November 2024), https://www.physionet.org/content/ms-cxr-t/1.0.0 (accessed on 27 November 2024).

Acknowledgments

During the preparation of this manuscript/study, the authors used Baichuan-M2 [36] for data augmentation. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VLP	Vision-Language Pretraining
MRM	Masked Report Modeling
MIDM	Masked Image Difference Modeling
IRC	Image-Report Contrastive Learning
IRM	Image-Report Matching Prediction
RRG	Radiological Report Generation
LRRG	Longitudinal Radiological Report Generation

Appendix A

Building upon the disease progression–related vocabulary curated in BioViL-T [15], we further expanded this list to construct Table A1, which is used to precisely mask clinically informative tokens during masked report modeling and to decompose each report into static and dynamic descriptions for data augmentation.

Table A1. Common Terms Used in Medical Reports to Describe Progression Changes in Disease Condition.

Disease Progression Type	Tokens
Improving	better, cleared, decreased, decreasing, improved, improving, reduced, resolved, resolving, resolution, smaller, alleviated, alleviating, diminished, diminishing
Stable	constant, stable, unchanged, no change, maintained
Worsening	bigger, developing, developed, enlarged, enlarging, greater, growing, increased, increasing, larger, new, newly, progressing, progressive, worse, worsened, worsening, extended, extending, extension

Appendix B

This appendix provides a detailed description of the proposed temporal-reversal data augmentation strategy, including prompt design, representative failure cases, and a discussion of potential risks and biases associated with LLM-generated text.

Appendix B.1. Prompt Design

Text augmentation is performed using a medical-domain large language model, Baichuan-M2 [36], and consists of two stages: (1) dynamic description inversion and (2) consistency verification.

(1): Prompts for dynamic description inversion.

[System Prompt]

You are a medical radiology language model. Your task is to perform temporal polarity inversion on radiology report sentences that explicitly describe disease progression between two time points.

Processing protocol:

Step 1: Temporal polarity identification

Classify the sentence into exactly ONE of the following categories:

- Worsening

- Improvement

- Stable

Step 2: Polarity inversion

- If Worsening, rewrite the sentence as a clinically reasonable Improvement.

- If Improvement, rewrite the sentence as a clinically reasonable Worsening.

- If Stable, preserve the original meaning without introducing improvement or worsening.

Constraints:

- Keep disease entities, anatomical locations, and findings unchanged.

- Do NOT introduce new abnormalities or remove existing ones.

- Preserve radiology reporting style and level of certainty.

- Output ONE sentence only.

[Examples]

Example 1:

Sentence: “There is increased right pleural effusion compared to the prior study.”

Output: “There is decreased right pleural effusion compared to the prior study.”

Example 2:

Sentence: “The left lower lobe opacity has improved.”

Output: “The left lower lobe opacity has worsened.”

Example 3:

Sentence: “The cardiomediastinal silhouette is stable.”

Output: “The cardiomediastinal silhouette is unchanged.”

[User Prompt]

Sentence:

“{INPUT_SENTENCE}”

Rewrite the sentence following the protocol.

(2): Prompts for consistency verification.

[System Prompt]

You are an expert radiology report auditor. Your task is to judge whether an inverted sentence is a valid and clinically plausible temporal polarity inversion of an original radiology sentence for longitudinal data augmentation. This is a CONSERVATIVE audit task. If there is any uncertainty in temporal direction, entity preservation, comparability, or clinical plausibility, you MUST choose REJECT.

Judgment protocol:

Definitions:

- Direction labels: Worsening, Improvement, Stable, Not-Comparable

- Not-Comparable includes: (a) Sentences without explicit temporal comparison. (b) Sentences with speculative, equivocal, or conditional temporal language. (c) Sentences whose temporal direction cannot be reliably determined.

Acceptance criteria:

(1) If the original direction is Worsening or Improvement:

- The inverted direction must be the opposite.

- Disease entities and anatomical locations must be preserved.

- No new findings may be introduced and no original findings removed.

- Clinical plausibility MUST be satisfied.

(2) If the original direction is Stable:

- The inverted sentence must remain Stable.

- No improvement or worsening may be introduced.

(3) If the original direction is Not-Comparable, choose REJECT.

(4) If any criterion is violated, the decision must be REJECT.

Output format:

Return a JSON object with exactly the following fields and no extra text:

{

“original_direction”: “…”,

“inverted_direction”: “…”,

“entity_match”: true|false,

“hallucination_or_omission”: true|false,

“decision”: “ACCEPT|REJECT”

}

[User Prompt]

Original sentence: “{ORIGINAL_SENTENCE}”

Inverted sentence: “{INVERTED_SENTENCE}”

Appendix B.2. Filtering Statistics and Rejection Analysis

During consistency verification, each temporally inverted sentence is evaluated conservatively to ensure semantic coherence and clinical plausibility. On the Longitudinal-MIMIC training split, 26% of the temporally inverted sentences were rejected at this stage. The following are some representative rejection samples.

(1): Confusion Between Worsening, Improvement, and Stability

In some cases, the temporal polarity of a sentence is misinterpreted, leading to incorrect inversion. Certain comparative terms (e.g., “smaller”) may reflect different clinical meanings depending on context.

Original sentence: “There is smaller left lung volume with associated basilar atelectasis.”
LLM-inverted sentence: “There is worsened left lung volume loss with associated basilar atelectasis.”
Reason for rejection: In this context, “smaller lung volume” may indicate worsening atelectasis rather than improvement. The inversion incorrectly assumes monotonic correspondence between size and disease severity, leading to semantic distortion.

(2): Lack of Explicit Temporal Comparability

Sentences containing speculative, equivocal, or conditional temporal language may lack a well-defined direction of change and therefore cannot be reliably inverted.

Original sentence: “There may be slight interval increase in bilateral interstitial markings.”
LLM-inverted sentence: “There may be slight interval decrease in bilateral interstitial markings.”
Reason for rejection: The speculative phrasing (“may be,” “slight”) prevents reliable determination of temporal polarity. Such sentences are classified as Not-Comparable and excluded from augmentation.

(3): Contradiction with Expected Effects of Clinical Interventions

Inverted descriptions that contradict the expected short-term effects of explicitly mentioned interventions are considered clinically implausible.

Original sentence: “Following chest tube placement, the left pneumothorax has decreased.”
LLM-inverted sentence: “Following chest tube placement, the left pneumothorax has increased.”
Reason for rejection: The inverted sentence contradicts the anticipated therapeutic effect of chest tube placement, rendering the reversal clinically unreasonable.

Appendix B.3. Bias and Scope Discussion

We acknowledge that LLM-generated text may exhibit stylistic or lexical regularities distinct from human-written reports. Empirically, as shown in the ablation study in Section 6.5, incorporating temporally reversed samples in later training stages can lead to fluctuations or degradation in NLG metrics.

A plausible explanation is that LLM-generated inversions may introduce stylistic artifacts, such as mechanically substituting key change-related terms while preserving surrounding phrasing, or excessively rephrasing sentence structures (e.g., rewriting “newly developed findings” as “previously noted findings have resolved”), which deviates from standard radiology reporting conventions.

The primary benefit of temporal-reversal augmentation lies in encouraging the model to learn direction-invariant longitudinal representations, rather than improving surface-level linguistic fluency. By explicitly constraining its usage to early training stages, aggressively filtering semantically invalid samples, and excluding all augmented text from generation supervision, we minimize semantic pollution while retaining its effectiveness for longitudinal representation learning.

Appendix C

To further examine the clinical interpretability of the learned difference-aware representations, we conduct a qualitative analysis on six longitudinal cases, each involving a single dominant pathological change, including disease worsening, improvement, or no change. For each case, cross-attention maps are visualized under four conditions: (i) the original temporal order, (ii) reversed temporal order, (iii) artificial occlusion to simulate peripheral artifacts, and (iv) mild geometric and intensity perturbations, including ±15° rotation and gamma/brightness adjustments. Note that all cross-attention maps are overlaid on the reference image under the current temporal setting. Anatomically localized regions of interest (ROIs) corresponding to the target pathology are obtained from the Chest ImaGenome dataset [46] and overlaid on all attention maps for post hoc analysis.

As shown in Figure A1, under the original temporal order, cross-attention consistently concentrates on the annotated lesion ROIs for cases involving disease worsening or improvement (case 1–4), whereas cases with no reported change (case 5–6) exhibit relatively more spatially diffuse attention patterns. After temporal reversal, although the reference image changes, the cross-attention remains aligned with anatomically corresponding lesion regions, indicating that the learned representations are conditioned on temporal relationships rather than fixed visual locations.

When artificial occlusions are applied to the reference images, the resulting attention maps exhibit clinically consistent spatial distributions that closely resemble those observed under the original setting. Likewise, under mild geometric rotation (±15°) and intensity perturbations, the attention patterns remain anatomically aligned with the lesion ROIs, preserving their clinical relevance and spatial correspondence.

Taken together, these qualitative results suggest that, even in the absence of explicit anatomical or lesion-level supervision, MRID is able to learn difference-aware representations that are clinically meaningful and robust to common sources of spurious variation, such as imaging conditions, peripheral artifacts, and patient positioning.

Figure A1. Qualitative visualization of cross-attention maps. Each row corresponds to one case, while columns show cross-attention maps under: (i) the original temporal order (CrossAttn(Ori.)), (ii) reversed temporal order (CrossAttn(Inv.)), (iii) artificial occlusion (CrossAttn(Occ.)), and (iv) mild geometric and intensity perturbations, including ±15° rotation and gamma/brightness adjustments (CrossAttn(Per.)). Anatomical regions of interest (ROIs) corresponding to the target pathology are obtained from the Chest ImaGenome dataset and overlaid on all attention maps, highlighted using red bounding boxes.

References

Jing, B.; Xie, P.; Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2577–2586. [Google Scholar]
Akhter, Y.; Singh, R.; Vatsa, M. AI-based radiodiagnosis using chest X-rays: A review. Front. Big Data 2023, 6, 1120989. [Google Scholar] [CrossRef] [PubMed]
Lee, S.; Youn, J.; Kim, H.; Kim, M.; Yoon, S.H. CXR-LLAVA: A multimodal large language model for interpreting chest X-ray images. Eur. Radiol. 2025, 35, 4374–4386. [Google Scholar] [CrossRef] [PubMed]
Khare, Y.; Bagal, V.; Mathew, M.; Devi, A.; Priyakumar, U.D.; Jawahar, C.V. Mmbert: Multimodal bert pretraining for improved medical vqa. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI); IEEE: New York, NY, USA, 2021; pp. 1033–1036. [Google Scholar]
Huang, S.-C.; Shen, L.; Lungren, M.P.; Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 3942–3951. [Google Scholar]
Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the most of text semantics to improve biomedical vision–language processing. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Yin, H.; Zhou, S.; Wang, P.; Wu, Z.; Hao, Y. KIA: Knowledge-guided implicit vision-language alignment for chest X-ray report generation. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 4096–4108. [Google Scholar]
Aideyan, U.O.; Berbaum, K.; Smith, W.L. Influence of prior radiologic information on the interpretation of radiographic examinations. Acad. Radiol. 1995, 2, 205–208. [Google Scholar] [CrossRef] [PubMed]
Rousan, L.A.; Elobeid, E.; Karrar, M.; Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 2020, 20, 245. [Google Scholar] [CrossRef]
Zhou, S.; Li, Y.; Liu, Y.; Liu, L.; Wang, L.; Zhou, L. A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation. arXiv 2025, arXiv:2510.12444. [Google Scholar] [CrossRef]
Karwande, G.; Mbakwe, A.B.; Wu, J.T.; Celi, L.A.; Moradi, M.; Lourentzou, I. Chexrelnet: An anatomy-aware model for tracking longitudinal relationships between chest x-rays. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2022; pp. 581–591. [Google Scholar]
Hou, W.; Cheng, Y.; Xu, K.; Li, W.; Liu, J. RECAP: Towards precise radiology report generation via dynamic disease progression reasoning. arXiv 2023, arXiv:2310.13864. [Google Scholar] [CrossRef]
Hu, X.; Gu, L.; An, Q.; Zhang, M.; Liu, L.; Kobayashi, K.; Harada, T.; Summers, R.M.; Zhu, Y. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2023; pp. 4156–4165. [Google Scholar]
Ramesh, K.K.D.; Kumar, G.K.; Swapna, K.; Datta, D.; Rajest, S.S. A review of medical image segmentation algorithms. EAI Endorsed Trans. Pervasive Health Technol. 2021, 7, e6. [Google Scholar] [CrossRef]
Bannur, S.; Hyland, S.; Liu, Q.; Perez-Garcia, F.; Ilse, M.; Castro, D.C.; Boecking, B.; Sharma, H.; Bouzid, K.; Thieme, A.; et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 15016–15027. [Google Scholar]
Sanjeev, S.; Maani, F.A.; Abzhanov, A.; Papineni, V.R.; Almakky, I.; Papież, B.W.; Yaqub, M. Tibix: Leveraging temporal information for bidirectional x-ray and report generation. In MICCAI Workshop on Deep Generative Models; Springer Nature: Cham, Switzerland, 2024; pp. 169–179. [Google Scholar]
Wang, F.; Du, S.; Yu, L. Hergen: Elevating radiology report generation with longitudinal data. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 183–200. [Google Scholar]
Wang, J.; Wang, S.; Zhang, Y. Deep learning on medical image analysis. CAAI Trans. Intell. Technol. 2025, 10, 1–35. [Google Scholar] [CrossRef]
Li, L.; Gan, Z.; Cheng, Y.; Liu, J. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 10313–10322. [Google Scholar]
Tu, Y.; Li, L.; Su, L.; Lu, K.; Huang, Q. Neighborhood contrastive transformer for change captioning. IEEE Trans. Multimed. 2023, 25, 9518–9529. [Google Scholar] [CrossRef]
Huang, Q.; Liang, Y.; Wei, J.; Cai, Y.; Liang, H.; Leung, H.-F.; Li, Q. Image difference captioning with instance-level fine-grained feature representation. IEEE Trans. Multimed. 2021, 24, 2004–2017. [Google Scholar] [CrossRef]
Qiu, Y.; Yamamoto, S.; Nakashima, K.; Suzuki, R.; Iwata, K.; Kataoka, H.; Satoh, Y. Describing and localizing multiple changes with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 1971–1980. [Google Scholar]
Anwar, S.M.; Majid, M.; Qayyum, A.; Awais, M.; Alnowami, M.; Khan, M.K. Medical image analysis using convolutional neural networks: A review. J. Med. Syst. 2018, 42, 226. [Google Scholar] [CrossRef] [PubMed]
Santeramo, R.; Withey, S.; Montana, G. Longitudinal detection of radiological abnormalities with time-modulated LSTM. In International Workshop on Deep Learning in Medical Image Analysis; Springer International Publishing: Cham, Switzerland, 2018; pp. 326–333. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 8789–8798. [Google Scholar]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Chen, F.-L.; Zhang, D.-Z.; Han, M.-L.; Chen, X.-Y.; Shi, J.; Xu, S.; Xu, B. Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 2023, 20, 38–56. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 9653–9663. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning; PMLR: Brookline, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.-Y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef]
Zhu, Q.; Mathai, T.S.; Mukherjee, P.; Peng, Y.; Summers, R.M.; Lu, Z. Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2023; pp. 189–198. [Google Scholar]
Dou, C.; Liu, C.; Yang, F.; Li, F.; Jia, J.; Chen, M.; Ju, Q.; Wang, S.; Dang, S.; Li, T.; et al. Baichuan-m2: Scaling medical capability with large verifier system. arXiv 2025, arXiv:2509.02208. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A.Y.; Lungren, M.P. CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv 2020, arXiv:2004.09167. [Google Scholar] [CrossRef]
Chen, Z.; Song, Y.; Chang, T.-H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2020, arXiv:2010.16056. [Google Scholar]
Huang, Z.; Zhang, X.; Zhang, S. Kiut: Knowledge-injected u-transformer for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 19809–19818. [Google Scholar]
Huang, L.; Wang, W.; Chen, J.; Wei, X.-Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 4634–4643. [Google Scholar]
Jin, H.; Che, H.; Lin, Y.; Chen, H. Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2024; Volume 38, pp. 2607–2615. [Google Scholar]
Chen, Y.; Xu, S.; Sellergren, A.; Matias, Y.; Hassidim, A.; Shetty, S.; Golden, D.; Yuille, A.L.; Yang, L. CoCa-CXR: Co ntrastive Ca ptioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2025; pp. 78–88. [Google Scholar]
Shen, C.; Menten, M.J.; Bogunović, H.; Schmidt-Erfurth, U.; Scholl, H.P.N.; Sivaprasad, S.; Lotery, A.; Rueckert, D.; Hager, P.; Holland, R. Spatiotemporal representation learning for short and long medical image time series. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2024; pp. 656–666. [Google Scholar]

Figure 1. Overview of the MRID framework. The entire framework is jointly optimized with five self-supervised objectives: masked report modeling (MRM), masked image difference modeling (MIDM), image—report contrastive learning (IRC), image—report matching prediction (IRM), and report generation (RG).

Figure 2. Illustration of the five self-supervised training objectives and the curriculum learning strategy in MRID. Subfigures (1–5) visualize the modality interaction patterns for each objective, and subfigure (6) shows the stage-wise training scheme.

Figure 3. Temporal-reversal data augmentation strategy. Dynamic progression descriptions from the current report are inverted using an LLM and combined with static descriptions from the reference report, to form a new report corresponding to the reversed image pair.

Figure 4. Comparison of report generation results between MRID and BioViL-T, and attention heatmaps of the single- and dual-image encoders in MRID. CrossAttn denotes the dual-image attention heatmap over the reference image, while CrossAttn (Inv.) corresponds to reversed image order. In the visualization, the text is color-coded as follows: orange indicates ground-truth descriptions related to disease progression; blue indicates correctly predicted descriptions; and red indicates incorrectly predicted descriptions.

Table 1. Task weights for each stage in the MRID curriculum learning framework.

Training Stage	Epoch Num	$λ_{1}$ (MRM)	$λ_{2}$ (MIDM)	$λ_{3}$ (IRC)	$λ_{4}$ (IDM)	$λ_{5}$ (RG)
1	15	1	1	1	0	0
2	15	1	0.5	1	1	0.5
3	10	0.5	0.5	0.5	0.5	1
4	10	0	0	0	0	1

Table 2. Comparison of MRID with baseline methods on both the MIMIC-CXR and Longitudinal-MIMIC datasets. Input Resolution indicates the spatial resolution of the input images. The best results are in boldface and the second-best results are underlined.

Dataset	Method	Input	NLG Metrics				CE Metrics
Dataset	Method	Solution	BLEU-1	BLEU-4	METEOR	ROUGE-L	Precision	Recall	F1
MIMIC-CXR (RRG Task)	R2Gen [41]	224 × 224	0.353	0.103	0.142	0.277	0.333	0.273	0.276
	KiUT [42]	256 × 256	0.393	0.113	0.16	0.285	0.371	0.318	0.321
	BioViL-T [15]	448 × 448	-	0.092	-	0.296	-	-	0.318
	HERGen [17]	384 × 384	0.395	0.122	0.156	0.285	0.415	0.301	0.317
	TiBiX [16]	512 × 512	0.324	0.157	0.162	0.331	0.3	0.224	0.25
	PromptMRG [44]	224 × 224	0.398	0.112	0.157	0.268	0.454	0.37	0.389
	RECAP [12]	224 × 224	0.429	0.125	0.168	0.288	0.389	0.443	0.393
	KIA [7]	224 × 224	0.401	0.138	0.167	0.307	0.504	0.425	0.461
	MRID (Ours)	224 × 224	0.403	0.154	0.163	0.323	0.525	0.486	0.498
Longitudinal-MIMIC (LRRG Task)	AoANet [43]	224 × 224	0.272	0.08	0.115	0.249	0.437	0.249	0.371
	R2Gen [41]	224 × 224	0.302	0.087	0.124	0.259	0.5	0.305	0.379
	HERGen [17]	384 × 384	0.389	0.117	0.155	0.282	0.421	0.289	0.295
	PromptMRG [44]	224 × 224	0.39	0.102	0.152	0.263	0.502	0.542	0.492
	MRID (Ours)	224 × 224	0.392	0.104	0.159	0.267	0.513	0.538	0.519

Table 3. Macro-accuracy comparison of MRID and baseline methods for disease progression classification on the MS-CXR-T dataset. Results are reported for zero-/few-shot (10%) and fully supervised learning settings. The best results are in boldface.

Setting	Methods	Consolidation	Pl. Effusion	Pneumonia	Pneumothorax	Edema	Average
Zero-shot	BioViL-T [15]	53.6	59.7	58.0	34.9	64.2	54.1
Zero-shot	MRID (Ours)	56.9	59.6	61.9	32.6	65.6	55.3
Few-shot	BioViL-T [15]	59.7	62.4	60.1	35.3	62.6	56.0
Few-shot	MRID (Ours)	60.0	63.1	64.7	37.8	66.1	58.3
Supervised	CheXRelNet [11]	47.0	47.0	47.0	36.0	49.0	45.2
	BioViL-T [15]	61.1	67.0	61.9	42.6	68.5	60.2
	CoCa-CXR [45]	69.6	68.1	56.4	59.3	70.8	64.8
	MRID (Ours)	67.3	67.8	63.9	62.3	67.7	65.8

Table 4. Ablation results on difference-aware image modeling and dual-image encoder design. Classification average accuracy results are reported on the full evaluation set and the MC Subset, where MC Subset denotes the mixed-change subset containing samples with heterogeneous temporal progression states across disease categories. The best results are in boldface.

Task	Setting	NLG Metrics				CE Metrics			Classification AVG. Acc.
Task	Setting	BL-1	BL-4	MTR	RL	Pre.	Rec.	F1	Full Set	MC Subset
RRG	(a) KIA	0.401	0.138	0.167	0.307	0.504	0.425	0.461	-	-
RRG	(b) MRID (Asym.)	0.403	0.154	0.163	0.323	0.525	0.486	0.498	-	-
LRRG	(b) MRID (Asym.)	0.392	0.104	0.159	0.267	0.513	0.538	0.519	65.8	57.3
	(c) MRID (Bi-CA)	0.390	0.099	0.157	0.270	0.501	0.515	0.507	63.2	49.2
	(d) MRID (Sym.)	0.394	0.104	0.156	0.268	0.509	0.529	0.514	65	54.4

Table 5. Ablation results of multi-task learning and curriculum learning. AVG. Δ denotes the average percentage improvement of each metric relative to Setting (a). Num of Stages indicates the total number of training stages in the curriculum learning strategy. The best results are in boldface.

Setting	Num of Stages	Learning Objective					RRG NLG Metrics					RRG CE Metrics
Setting	Num of Stages	MRM	MIDM	IRC	IRM	RG	BL-1	BL-4	MTR	RL	AVG.Δ	Pre.	Rec.	F1	AVG.Δ
(a)	1					√	0.318	0.116	0.119	0.270	-	0.412	0.305	0.379	-
(b)	2	√				√	0.356	0.127	0.132	0.284	9.5%	0.442	0.315	0.399	5.3%
(c)			√			√	0.329	0.128	0.127	0.272	5.4%	0.485	0.330	0.418	12.0%
(d)		√	√			√	0.359	0.124	0.141	0.281	10.6%	0.493	0.347	0.417	14.5%
(e)				√		√	0.371	0.121	0.140	0.260	8.6%	0.487	0.425	0.434	24.1%
(f)					√	√	0.347	0.134	0.148	0.278	12.8%	0.464	0.354	0.420	13.3%
(g)				√	√	√	0.386	0.143	0.155	0.289	20.4%	0.482	0.436	0.439	25.3%
(h)		√	√	√	√	√	0.306	0.103	0.108	0.248	−8.1%	0.416	0.282	0.385	−1.7%
(i)	3	√	√	√	√	√	0.400	0.153	0.162	0.319	27.9%	0.523	0.480	0.494	38.3%
MRID	4	√	√	√	√	√	0.403	0.154	0.163	0.323	29.0%	0.525	0.486	0.498	39.4%

Table 6. Ablation study of the temporal-reversal data augmentation strategy. Used in Training Stage indicates the curriculum learning stages during which the augmentation is applied. AVG. Δ denotes the average percentage improvement of each metric relative to Setting (a). The best results are in boldface.

Setting	Used in Training Stage			LRRG NLG Metrics					LRRG CE Metrics				Classification
Setting	1	2	3	BL-1	BL-4	MTR	RL	AVG.D	Pre.	Rec.	F1	AVG.D	AVG.Acc.
(a)				0.393	0.106	0.151	0.265	-	0.501	0.506	0.503	-	64.1
(b)	√			0.389	0.106	0.156	0.270	1.0%	0.507	0.523	0.511	2.0%	64.5
(c)	√	√		0.392	0.104	0.159	0.267	0.9%	0.513	0.538	0.519	3.9%	65.8
(d)	√	√	√	0.387	0.102	0.152	0.265	−1.1%	0.511	0.535	0.520	3.7%	65.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, Y.; Wang, P.; Chen, Y.; Zhao, H. MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision. Electronics 2026, 15, 997. https://doi.org/10.3390/electronics15050997

AMA Style

Hao Y, Wang P, Chen Y, Zhao H. MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision. Electronics. 2026; 15(5):997. https://doi.org/10.3390/electronics15050997

Chicago/Turabian Style

Hao, Yongtao, Pandong Wang, Yanming Chen, and Haifeng Zhao. 2026. "MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision" Electronics 15, no. 5: 997. https://doi.org/10.3390/electronics15050997

APA Style

Hao, Y., Wang, P., Chen, Y., & Zhao, H. (2026). MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision. Electronics, 15(5), 997. https://doi.org/10.3390/electronics15050997

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision

Abstract

1. Introduction

2. Related Works

2.1. Medical Image Report Generation

2.2. Image Change Captioning in the General Domain

2.3. Disease Progression Reasoning

3. Architecture of the MRID Framework

3.1. Single-Image Encoder and Text Encoder

3.2. Dual-Image Encoder

3.3. Cross-Modal Encoders

3.4. Text Decoder

3.5. Parameter Sharing

4. Multi-Task Self-Supervised Learning

4.1. Domain-Aware Masked Report Modeling

4.2. Masked Image Difference Modeling

4.3. Image–Report Contrastive Learning

4.4. Image–Report Matching Prediction

4.5. Report Generation

4.6. Curriculum Learning

5. Dataset and Data Preprocessing

5.1. Training and Evaluation Dataset

5.2. Data Preprocessing

5.3. Temporal-Reversal Data Augmentation

6. Experiments

6.1. Implementation Details

6.2. Evaluation Metrics

6.3. Report Generation Results

6.4. Disease Progression Classification Results

6.5. Ablation Study

6.6. Qualitative Result

7. Discussion

7.1. Discussion of Experimental Findings

7.2. Extending Beyond Pairwise Longitudinal Modeling

7.3. Toward More Faithful Progression Evaluation

7.4. Considerations on Modality Generalization

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix B.1. Prompt Design

Appendix B.2. Filtering Statistics and Rejection Analysis

Appendix B.3. Bias and Scope Discussion

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI