MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision
Abstract
1. Introduction
2. Related Works
2.1. Medical Image Report Generation
2.2. Image Change Captioning in the General Domain
2.3. Disease Progression Reasoning
3. Architecture of the MRID Framework
- Uni-modal Encoders. This category includes a single-image encoder , a dual-image encoder , and a text encoder . These modules extract semantic representations from their respective modalities, where further captures fine-grained spatial differences between the paired images.
- Cross-Modal Encoders. Two cross-modal encoders and are built upon cross-attention mechanisms to enable bidirectional fusion between visual and textual modalities.
- Text Decoder. The text decoder consumes the aligned multimodal features and autoregressively generates a coherent report that jointly reflects current clinical findings and longitudinal disease progression.
3.1. Single-Image Encoder and Text Encoder
3.2. Dual-Image Encoder
3.3. Cross-Modal Encoders
3.4. Text Decoder
3.5. Parameter Sharing
- The bidirectional self-attention layers in and ;
- The bidirectional self-attention layers in and ;
- The cross-attention layers in and .
4. Multi-Task Self-Supervised Learning
- Representation modeling objectives (MRM and MIDM), which strengthen the model’s understanding of textual semantic structure and fine-grained visual difference representations through masked modeling;
- Cross-modal alignment objectives (IRC and IRM), which enforce semantic consistency between visual and textual representations in a shared embedding space and enhance discriminative cross-modal reasoning;
- Generation objective (RG), which guides the model to synthesize coherent and clinically meaningful radiology reports by integrating learned multimodal information.
4.1. Domain-Aware Masked Report Modeling
4.2. Masked Image Difference Modeling
4.3. Image–Report Contrastive Learning
4.4. Image–Report Matching Prediction
4.5. Report Generation
4.6. Curriculum Learning
- Stage 1: Representation and alignment warm-up. MRM, MIDM, and IRC are activated to learn difference-aware visual representations and initial cross-modal alignment. This stage establishes a stable multimodal foundation and prevents early overreliance on linguistic priors.
- Stage 2: Fine-grained discrimination enhancement. IRM is introduced to improve instance-level discrimination between matched and mismatched image–report pairs, while RG is incorporated with a small weight. This stage strengthens cross-modal reasoning and gradually adapts the model to generation without disrupting learned representations.
- Stage 3: Generation-oriented joint fine-tuning. The weight of RG is increased, while other objectives serve as auxiliary regularization.
- Stage 4: Task-specific fine-tuning for report generation. Only RG is retained for optimization using the pretrained encoders and difference modeling modules.
5. Dataset and Data Preprocessing
5.1. Training and Evaluation Dataset
5.2. Data Preprocessing
5.3. Temporal-Reversal Data Augmentation
6. Experiments
6.1. Implementation Details
6.2. Evaluation Metrics
- Natural Language Generation Metrics
- Clinical Efficacy Metrics
6.3. Report Generation Results
6.4. Disease Progression Classification Results
6.5. Ablation Study
- Effect of Difference-Aware Representation and Dual-Image Encoder Design
- Effect of Auxiliary Tasks and Curriculum Learning
- Effect of Temporal-Reversal Data Augmentation
6.6. Qualitative Result
- (1)
- Case Study of Generated Reports
- (2)
- Heatmap Visualization of Image Representations
7. Discussion
7.1. Discussion of Experimental Findings
7.2. Extending Beyond Pairwise Longitudinal Modeling
7.3. Toward More Faithful Progression Evaluation
7.4. Considerations on Modality Generalization
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| VLP | Vision-Language Pretraining |
| MRM | Masked Report Modeling |
| MIDM | Masked Image Difference Modeling |
| IRC | Image-Report Contrastive Learning |
| IRM | Image-Report Matching Prediction |
| RRG | Radiological Report Generation |
| LRRG | Longitudinal Radiological Report Generation |
Appendix A
| Disease Progression Type | Tokens |
|---|---|
| Improving | better, cleared, decreased, decreasing, improved, improving, reduced, resolved, resolving, resolution, smaller, alleviated, alleviating, diminished, diminishing |
| Stable | constant, stable, unchanged, no change, maintained |
| Worsening | bigger, developing, developed, enlarged, enlarging, greater, growing, increased, increasing, larger, new, newly, progressing, progressive, worse, worsened, worsening, extended, extending, extension |
Appendix B
Appendix B.1. Prompt Design
- (1)
- Prompts for dynamic description inversion.
- (2)
- Prompts for consistency verification.
Appendix B.2. Filtering Statistics and Rejection Analysis
- (1)
- Confusion Between Worsening, Improvement, and Stability
- Original sentence: “There is smaller left lung volume with associated basilar atelectasis.”
- LLM-inverted sentence: “There is worsened left lung volume loss with associated basilar atelectasis.”
- Reason for rejection: In this context, “smaller lung volume” may indicate worsening atelectasis rather than improvement. The inversion incorrectly assumes monotonic correspondence between size and disease severity, leading to semantic distortion.
- (2)
- Lack of Explicit Temporal Comparability
- Original sentence: “There may be slight interval increase in bilateral interstitial markings.”
- LLM-inverted sentence: “There may be slight interval decrease in bilateral interstitial markings.”
- Reason for rejection: The speculative phrasing (“may be,” “slight”) prevents reliable determination of temporal polarity. Such sentences are classified as Not-Comparable and excluded from augmentation.
- (3)
- Contradiction with Expected Effects of Clinical Interventions
- Original sentence: “Following chest tube placement, the left pneumothorax has decreased.”
- LLM-inverted sentence: “Following chest tube placement, the left pneumothorax has increased.”
- Reason for rejection: The inverted sentence contradicts the anticipated therapeutic effect of chest tube placement, rendering the reversal clinically unreasonable.
Appendix B.3. Bias and Scope Discussion
Appendix C

References
- Jing, B.; Xie, P.; Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2577–2586. [Google Scholar]
- Akhter, Y.; Singh, R.; Vatsa, M. AI-based radiodiagnosis using chest X-rays: A review. Front. Big Data 2023, 6, 1120989. [Google Scholar] [CrossRef] [PubMed]
- Lee, S.; Youn, J.; Kim, H.; Kim, M.; Yoon, S.H. CXR-LLAVA: A multimodal large language model for interpreting chest X-ray images. Eur. Radiol. 2025, 35, 4374–4386. [Google Scholar] [CrossRef] [PubMed]
- Khare, Y.; Bagal, V.; Mathew, M.; Devi, A.; Priyakumar, U.D.; Jawahar, C.V. Mmbert: Multimodal bert pretraining for improved medical vqa. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI); IEEE: New York, NY, USA, 2021; pp. 1033–1036. [Google Scholar]
- Huang, S.-C.; Shen, L.; Lungren, M.P.; Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 3942–3951. [Google Scholar]
- Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the most of text semantics to improve biomedical vision–language processing. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
- Yin, H.; Zhou, S.; Wang, P.; Wu, Z.; Hao, Y. KIA: Knowledge-guided implicit vision-language alignment for chest X-ray report generation. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 4096–4108. [Google Scholar]
- Aideyan, U.O.; Berbaum, K.; Smith, W.L. Influence of prior radiologic information on the interpretation of radiographic examinations. Acad. Radiol. 1995, 2, 205–208. [Google Scholar] [CrossRef] [PubMed]
- Rousan, L.A.; Elobeid, E.; Karrar, M.; Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 2020, 20, 245. [Google Scholar] [CrossRef]
- Zhou, S.; Li, Y.; Liu, Y.; Liu, L.; Wang, L.; Zhou, L. A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation. arXiv 2025, arXiv:2510.12444. [Google Scholar] [CrossRef]
- Karwande, G.; Mbakwe, A.B.; Wu, J.T.; Celi, L.A.; Moradi, M.; Lourentzou, I. Chexrelnet: An anatomy-aware model for tracking longitudinal relationships between chest x-rays. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2022; pp. 581–591. [Google Scholar]
- Hou, W.; Cheng, Y.; Xu, K.; Li, W.; Liu, J. RECAP: Towards precise radiology report generation via dynamic disease progression reasoning. arXiv 2023, arXiv:2310.13864. [Google Scholar] [CrossRef]
- Hu, X.; Gu, L.; An, Q.; Zhang, M.; Liu, L.; Kobayashi, K.; Harada, T.; Summers, R.M.; Zhu, Y. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2023; pp. 4156–4165. [Google Scholar]
- Ramesh, K.K.D.; Kumar, G.K.; Swapna, K.; Datta, D.; Rajest, S.S. A review of medical image segmentation algorithms. EAI Endorsed Trans. Pervasive Health Technol. 2021, 7, e6. [Google Scholar] [CrossRef]
- Bannur, S.; Hyland, S.; Liu, Q.; Perez-Garcia, F.; Ilse, M.; Castro, D.C.; Boecking, B.; Sharma, H.; Bouzid, K.; Thieme, A.; et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 15016–15027. [Google Scholar]
- Sanjeev, S.; Maani, F.A.; Abzhanov, A.; Papineni, V.R.; Almakky, I.; Papież, B.W.; Yaqub, M. Tibix: Leveraging temporal information for bidirectional x-ray and report generation. In MICCAI Workshop on Deep Generative Models; Springer Nature: Cham, Switzerland, 2024; pp. 169–179. [Google Scholar]
- Wang, F.; Du, S.; Yu, L. Hergen: Elevating radiology report generation with longitudinal data. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 183–200. [Google Scholar]
- Wang, J.; Wang, S.; Zhang, Y. Deep learning on medical image analysis. CAAI Trans. Intell. Technol. 2025, 10, 1–35. [Google Scholar] [CrossRef]
- Li, L.; Gan, Z.; Cheng, Y.; Liu, J. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 10313–10322. [Google Scholar]
- Tu, Y.; Li, L.; Su, L.; Lu, K.; Huang, Q. Neighborhood contrastive transformer for change captioning. IEEE Trans. Multimed. 2023, 25, 9518–9529. [Google Scholar] [CrossRef]
- Huang, Q.; Liang, Y.; Wei, J.; Cai, Y.; Liang, H.; Leung, H.-F.; Li, Q. Image difference captioning with instance-level fine-grained feature representation. IEEE Trans. Multimed. 2021, 24, 2004–2017. [Google Scholar] [CrossRef]
- Qiu, Y.; Yamamoto, S.; Nakashima, K.; Suzuki, R.; Iwata, K.; Kataoka, H.; Satoh, Y. Describing and localizing multiple changes with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 1971–1980. [Google Scholar]
- Anwar, S.M.; Majid, M.; Qayyum, A.; Awais, M.; Alnowami, M.; Khan, M.K. Medical image analysis using convolutional neural networks: A review. J. Med. Syst. 2018, 42, 226. [Google Scholar] [CrossRef] [PubMed]
- Santeramo, R.; Withey, S.; Montana, G. Longitudinal detection of radiological abnormalities with time-modulated LSTM. In International Workshop on Deep Learning in Medical Image Analysis; Springer International Publishing: Cham, Switzerland, 2018; pp. 326–333. [Google Scholar]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 8789–8798. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
- Chen, F.-L.; Zhang, D.-Z.; Han, M.-L.; Chen, X.-Y.; Shi, J.; Xu, S.; Xu, B. Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 2023, 20, 38–56. [Google Scholar] [CrossRef]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
- Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 9653–9663. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning; PMLR: Brookline, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
- Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.-Y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef]
- Zhu, Q.; Mathai, T.S.; Mukherjee, P.; Peng, Y.; Summers, R.M.; Lu, Z. Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2023; pp. 189–198. [Google Scholar]
- Dou, C.; Liu, C.; Yang, F.; Li, F.; Jia, J.; Chen, M.; Ju, Q.; Wang, S.; Dang, S.; Li, T.; et al. Baichuan-m2: Scaling medical capability with large verifier system. arXiv 2025, arXiv:2509.02208. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A.Y.; Lungren, M.P. CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv 2020, arXiv:2004.09167. [Google Scholar] [CrossRef]
- Chen, Z.; Song, Y.; Chang, T.-H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2020, arXiv:2010.16056. [Google Scholar]
- Huang, Z.; Zhang, X.; Zhang, S. Kiut: Knowledge-injected u-transformer for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 19809–19818. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X.-Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 4634–4643. [Google Scholar]
- Jin, H.; Che, H.; Lin, Y.; Chen, H. Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2024; Volume 38, pp. 2607–2615. [Google Scholar]
- Chen, Y.; Xu, S.; Sellergren, A.; Matias, Y.; Hassidim, A.; Shetty, S.; Golden, D.; Yuille, A.L.; Yang, L. CoCa-CXR: Co ntrastive Ca ptioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2025; pp. 78–88. [Google Scholar]
- Shen, C.; Menten, M.J.; Bogunović, H.; Schmidt-Erfurth, U.; Scholl, H.P.N.; Sivaprasad, S.; Lotery, A.; Rueckert, D.; Hager, P.; Holland, R. Spatiotemporal representation learning for short and long medical image time series. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer Nature: Cham, Switzerland, 2024; pp. 656–666. [Google Scholar]




| Training Stage | Epoch Num | (MRM) | (MIDM) | (IRC) | (IDM) | (RG) |
|---|---|---|---|---|---|---|
| 1 | 15 | 1 | 1 | 1 | 0 | 0 |
| 2 | 15 | 1 | 0.5 | 1 | 1 | 0.5 |
| 3 | 10 | 0.5 | 0.5 | 0.5 | 0.5 | 1 |
| 4 | 10 | 0 | 0 | 0 | 0 | 1 |
| Dataset | Method | Input | NLG Metrics | CE Metrics | |||||
|---|---|---|---|---|---|---|---|---|---|
| Solution | BLEU-1 | BLEU-4 | METEOR | ROUGE-L | Precision | Recall | F1 | ||
| MIMIC-CXR (RRG Task) | R2Gen [41] | 224 × 224 | 0.353 | 0.103 | 0.142 | 0.277 | 0.333 | 0.273 | 0.276 |
| KiUT [42] | 256 × 256 | 0.393 | 0.113 | 0.16 | 0.285 | 0.371 | 0.318 | 0.321 | |
| BioViL-T [15] | 448 × 448 | - | 0.092 | - | 0.296 | - | - | 0.318 | |
| HERGen [17] | 384 × 384 | 0.395 | 0.122 | 0.156 | 0.285 | 0.415 | 0.301 | 0.317 | |
| TiBiX [16] | 512 × 512 | 0.324 | 0.157 | 0.162 | 0.331 | 0.3 | 0.224 | 0.25 | |
| PromptMRG [44] | 224 × 224 | 0.398 | 0.112 | 0.157 | 0.268 | 0.454 | 0.37 | 0.389 | |
| RECAP [12] | 224 × 224 | 0.429 | 0.125 | 0.168 | 0.288 | 0.389 | 0.443 | 0.393 | |
| KIA [7] | 224 × 224 | 0.401 | 0.138 | 0.167 | 0.307 | 0.504 | 0.425 | 0.461 | |
| MRID (Ours) | 224 × 224 | 0.403 | 0.154 | 0.163 | 0.323 | 0.525 | 0.486 | 0.498 | |
| Longitudinal-MIMIC (LRRG Task) | AoANet [43] | 224 × 224 | 0.272 | 0.08 | 0.115 | 0.249 | 0.437 | 0.249 | 0.371 |
| R2Gen [41] | 224 × 224 | 0.302 | 0.087 | 0.124 | 0.259 | 0.5 | 0.305 | 0.379 | |
| HERGen [17] | 384 × 384 | 0.389 | 0.117 | 0.155 | 0.282 | 0.421 | 0.289 | 0.295 | |
| PromptMRG [44] | 224 × 224 | 0.39 | 0.102 | 0.152 | 0.263 | 0.502 | 0.542 | 0.492 | |
| MRID (Ours) | 224 × 224 | 0.392 | 0.104 | 0.159 | 0.267 | 0.513 | 0.538 | 0.519 | |
| Setting | Methods | Consolidation | Pl. Effusion | Pneumonia | Pneumothorax | Edema | Average |
|---|---|---|---|---|---|---|---|
| Zero-shot | BioViL-T [15] | 53.6 | 59.7 | 58.0 | 34.9 | 64.2 | 54.1 |
| MRID (Ours) | 56.9 | 59.6 | 61.9 | 32.6 | 65.6 | 55.3 | |
| Few-shot | BioViL-T [15] | 59.7 | 62.4 | 60.1 | 35.3 | 62.6 | 56.0 |
| MRID (Ours) | 60.0 | 63.1 | 64.7 | 37.8 | 66.1 | 58.3 | |
| Supervised | CheXRelNet [11] | 47.0 | 47.0 | 47.0 | 36.0 | 49.0 | 45.2 |
| BioViL-T [15] | 61.1 | 67.0 | 61.9 | 42.6 | 68.5 | 60.2 | |
| CoCa-CXR [45] | 69.6 | 68.1 | 56.4 | 59.3 | 70.8 | 64.8 | |
| MRID (Ours) | 67.3 | 67.8 | 63.9 | 62.3 | 67.7 | 65.8 |
| Task | Setting | NLG Metrics | CE Metrics | Classification AVG. Acc. | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| BL-1 | BL-4 | MTR | RL | Pre. | Rec. | F1 | Full Set | MC Subset | ||
| RRG | (a) KIA | 0.401 | 0.138 | 0.167 | 0.307 | 0.504 | 0.425 | 0.461 | - | - |
| (b) MRID (Asym.) | 0.403 | 0.154 | 0.163 | 0.323 | 0.525 | 0.486 | 0.498 | - | - | |
| LRRG | (b) MRID (Asym.) | 0.392 | 0.104 | 0.159 | 0.267 | 0.513 | 0.538 | 0.519 | 65.8 | 57.3 |
| (c) MRID (Bi-CA) | 0.390 | 0.099 | 0.157 | 0.270 | 0.501 | 0.515 | 0.507 | 63.2 | 49.2 | |
| (d) MRID (Sym.) | 0.394 | 0.104 | 0.156 | 0.268 | 0.509 | 0.529 | 0.514 | 65 | 54.4 | |
| Setting | Num of Stages | Learning Objective | RRG NLG Metrics | RRG CE Metrics | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MRM | MIDM | IRC | IRM | RG | BL-1 | BL-4 | MTR | RL | AVG.Δ | Pre. | Rec. | F1 | AVG.Δ | ||
| (a) | 1 | √ | 0.318 | 0.116 | 0.119 | 0.270 | - | 0.412 | 0.305 | 0.379 | - | ||||
| (b) | 2 | √ | √ | 0.356 | 0.127 | 0.132 | 0.284 | 9.5% | 0.442 | 0.315 | 0.399 | 5.3% | |||
| (c) | √ | √ | 0.329 | 0.128 | 0.127 | 0.272 | 5.4% | 0.485 | 0.330 | 0.418 | 12.0% | ||||
| (d) | √ | √ | √ | 0.359 | 0.124 | 0.141 | 0.281 | 10.6% | 0.493 | 0.347 | 0.417 | 14.5% | |||
| (e) | √ | √ | 0.371 | 0.121 | 0.140 | 0.260 | 8.6% | 0.487 | 0.425 | 0.434 | 24.1% | ||||
| (f) | √ | √ | 0.347 | 0.134 | 0.148 | 0.278 | 12.8% | 0.464 | 0.354 | 0.420 | 13.3% | ||||
| (g) | √ | √ | √ | 0.386 | 0.143 | 0.155 | 0.289 | 20.4% | 0.482 | 0.436 | 0.439 | 25.3% | |||
| (h) | √ | √ | √ | √ | √ | 0.306 | 0.103 | 0.108 | 0.248 | −8.1% | 0.416 | 0.282 | 0.385 | −1.7% | |
| (i) | 3 | √ | √ | √ | √ | √ | 0.400 | 0.153 | 0.162 | 0.319 | 27.9% | 0.523 | 0.480 | 0.494 | 38.3% |
| MRID | 4 | √ | √ | √ | √ | √ | 0.403 | 0.154 | 0.163 | 0.323 | 29.0% | 0.525 | 0.486 | 0.498 | 39.4% |
| Setting | Used in Training Stage | LRRG NLG Metrics | LRRG CE Metrics | Classification | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | BL-1 | BL-4 | MTR | RL | AVG.D | Pre. | Rec. | F1 | AVG.D | AVG.Acc. | |
| (a) | 0.393 | 0.106 | 0.151 | 0.265 | - | 0.501 | 0.506 | 0.503 | - | 64.1 | |||
| (b) | √ | 0.389 | 0.106 | 0.156 | 0.270 | 1.0% | 0.507 | 0.523 | 0.511 | 2.0% | 64.5 | ||
| (c) | √ | √ | 0.392 | 0.104 | 0.159 | 0.267 | 0.9% | 0.513 | 0.538 | 0.519 | 3.9% | 65.8 | |
| (d) | √ | √ | √ | 0.387 | 0.102 | 0.152 | 0.265 | −1.1% | 0.511 | 0.535 | 0.520 | 3.7% | 65.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hao, Y.; Wang, P.; Chen, Y.; Zhao, H. MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision. Electronics 2026, 15, 997. https://doi.org/10.3390/electronics15050997
Hao Y, Wang P, Chen Y, Zhao H. MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision. Electronics. 2026; 15(5):997. https://doi.org/10.3390/electronics15050997
Chicago/Turabian StyleHao, Yongtao, Pandong Wang, Yanming Chen, and Haifeng Zhao. 2026. "MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision" Electronics 15, no. 5: 997. https://doi.org/10.3390/electronics15050997
APA StyleHao, Y., Wang, P., Chen, Y., & Zhao, H. (2026). MRID: Modeling Radiological Image Differences for Disease Progression Reasoning via Multi-Task Self-Supervision. Electronics, 15(5), 997. https://doi.org/10.3390/electronics15050997

