Graph-Driven Medical Report Generation with Adaptive Knowledge Distillation
Abstract
1. Introduction
2. Related Work
2.1. Image Captioning
2.2. Medical Report Generation
3. Methods
3.1. Knowledge Integration and Fusion (KIF)
3.1.1. Knowledge Encoding
3.1.2. Image Encoding
3.1.3. Cross-Modal Semantic Fusion
3.2. Diagnosis Prompts Generator
3.3. Initial Report Generation
3.4. Report Optimization via Information Self-Distillation (ISD) Module
4. Experiments
Evaluation Metrics
5. Results
Ablation Validation
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
MRG | Medical Report Generation |
NLG | Natural Language Generation |
CNNs | Convolutional Neural Networks |
RNN | Recurrent Neural Network |
LSTM | Long Short Term Memory |
CE | Clinical Efficacy |
KIF | Knowledge Integration and Fusion |
DA | Dynamic Aggregation |
ISD | Information Self-Distillation |
DPG | Diagnosis Prompts Generator |
BLA | Blank |
POS | Positive |
NEG | Negative |
UNC | Uncertain |
GT | Ground Truth |
TP | True Positive |
FP | False Positive |
FN | False Negative |
LCS | Longest Common Subsequence |
Appendix A. Algorithm Pseudocode
Algorithm 1: Report generation |
Data: Medical image I, word embeddings W, CLIP memory M, generation parameters Result: Generated reports R, predicted classes C, class probabilities P
|
Algorithm 3: ISD training process |
Data: Student model , teacher model (frozen), batch data , base class probabilities P Result: Total loss
|
Appendix B. Detailed Experimental Setup
Appendix C. Comprehensive Comparison of Model Implementations
Model | Innovation | Architecture | External Knowledge | Operational Workflow | Testing and Evaluation |
---|---|---|---|---|---|
Continued on next page | |||||
R2GenCMN [18] | Memory-Driven Cross-Modal Attention: 1. Learnable Memory Matrix: A shared memory matrix stores frequently occurring clinical semantic units (e.g., “cardiomegaly,” “pulmonary opacity”) learned from the dataset. 2. Cross-Modal Interaction: At each decoding step, an attention mechanism dynamically retrieves relevant information from the memory matrix. This retrieved context is fused with the image features to guide the next word generation, ensuring terminological accuracy. | Encoder: CNN Decoder: LSTM | Implicit: Learned from the training data and stored in the memory matrix. | 1. Encode image into features using CNN. 2. Query Memory: Decoder queries the memory matrix via attention for relevant clinical context. 3. Fuse and Generate: Fuses memory context, image features, and previous hidden state to predict the next word. 4. Repeat until the report is complete. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Automated Metrics: Standard NLG scores (BLEU [9], ROUGE [10]). Clinical Metrics: CheXbert [46] F1 score. Ablation: Performance comparison against baselines (e.g., SAT, Transformer) and ablated model (w/o memory module) to prove its efficacy. |
GSKET [20] | Gated Fusion of General and Specific Knowledge: 1. Dual Knowledge Sources: Incorporates “specific knowledge” (detected concepts from the image) and “general knowledge” (descriptions of concepts retrieved from an external medical KB like UMLS). 2. Gating Mechanism: A learned gating network dynamically computes weights for the two knowledge sources, adaptively controlling their contribution to generation for refined knowledge utilization. | Encoder: CNN Decoder: Transformer | Explicit: 1. Specific: Pre-trained visual concept detector. 2. General: Medical knowledge base (UMLS). | 1. Extract concepts from the image using a detector. 2. Retrieve information for concepts from UMLS. 3. Encode and Gate: Encode both knowledge types; the gating network fuses them. 4. Generate: Transformer decoder uses fused knowledge and image features to generate the report. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1). Ablation: Tests variants (e.g., specific-only, general-only, no gate) to validate the necessity of both knowledge sources and the gating mechanism. |
DCL [54] | Dual Contrastive Learning Strategy: 1. Image–Text Contrastive Learning: In a joint embedding space, pulls the features of matched image-report pairs closer and pushes unmatched pairs apart to enhance modality alignment. 2. Image–Image Contrastive Learning: Pulls “abnormal–abnormal” image pairs closer and pushes “abnormal–normal” pairs apart, forcing the model to focus on abnormal features and mitigating the dataset bias towards “normal” cases. | Encoder: CNN Decoder: LSTM | None. Learns better representations from the data itself via the contrastive strategy. | 1. Encode images and reports. 2. Construct Pairs: Build positive/negative pairs within a training batch. 3. Compute Losses: Calculate the dual contrastive losses and the standard generation loss. 4. Generate the report. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1), with a strong focus on performance improvement on abnormal/rare disease cases compared to baselines. |
KiUT [55] | Unified Transformer for Knowledge Injection: Abandons complex fusion designs. Instead, it unifies visual features (converted to tokens) and detected text concepts (keywords) by concatenating them into a single multimodal sequence. This sequence is processed by a standard Transformer, which uses its self-attention mechanism to automatically learn the intrinsic relationships and dependencies between visual and knowledge tokens. | Encoder: CNN + Concept Encoder Decoder: Transformer | Explicit: Pre-trained visual concept detector (provides keywords). | 1. Detect concepts from the image. 2. Tokenize: Map image features to visual tokens; embed concepts into knowledge tokens. 3. Concatenate: Form a single sequence of visual + knowledge tokens. 4. Process and Generate: The unified Transformer encodes the sequence and its decoder autoregressively generates the report. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1). Comparison: Contrasted with models using complex cross-modal attention, demonstrating that its simple unified architecture can achieve superior or comparable performance. |
METransformer [4] | Multi-Edge Sparse Attention: 1. Image as Graph: Treats image patches as nodes in a graph. 2. Sparse Attention: Does not compute attention between all nodes. Instead, it predefines connections (edges) between nodes (e.g., only between spatially adjacent or semantically similar patches). This drastically reduces computational complexity and enhances the model’s ability to model relationships between local key regions. | Encoder: ViT variant (Graph Attention Network) Decoder: Transformer | None. Focuses on internal image structure via sparse attention. | 1. Graph Construction: Image is split into patches; a sparse graph is built defining connections between relevant patches. 2. Sparse Encoding: Graph-based sparse attention is computed only between connected nodes. 3. Generation: The Transformer decoder generates the report from the encoded features. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1). Comparison: Compared against standard ViT/Transformer to demonstrate superior computational efficiency and lower resource consumption while maintaining comparable generation quality. |
RGRG [43] | Retrieval-Augmented Generation (RAG) Framework: 1. Two-Stage Paradigm: First, retrieves the most similar historical cases and their reports (reference templates) from a large database via image retrieval. Then, a generative model conditionally generates a new report based on the current image features and the retrieved report context. 2. Advantage: Generated reports maintain high professionalism (by referencing ground-truth reports) and specificity (by integrating current image features), avoiding generic or vague descriptions. | Encoder: CNN (for retrieval and generation) Decoder: Transformer | Explicit: The entire training set database used as a retrieval corpus. | 1. Retrieve: The query image is used to find K most similar images and their reports from the DB. 2. Encode: The query image and the retrieved report text are encoded. 3. Generate: The Transformer decoder is conditioned on both the query image features and the retrieved report encoding to produce the final report. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1). Ablation: Analysis of retrieval quality impact (e.g., varying K, using different retrievers). Comparison to pure generative models to show advantage in generating clinically accurate terms. |
PromptMRG [45] | Learnable Continuous Prompt Vectors: 1. Dynamic Prompt Generation: A dedicated module encodes detected medical concepts into a set of continuous prompt vectors (“soft prompts”), not natural language. 2. Prompt-Guided Generation: These learned vectors are prepended to the input of a standard Transformer decoder (the report generator). The decoder is specifically trained to attend to these prompts, which guide it to generate coherent and professional reports that include the specified medical findings. | Encoder: CNN (concept detector) Decoder: Transformer | Explicit: Pre-trained visual concept detector. | 1. Detect Concepts: A CNN-based concept detector analyzes the image to output keywords. 2. Generate Prompts: A prompt generator module maps these keywords to a set of continuous prompt vectors. 3. Generate Report: These prompt vectors are fed as a prefix to a Transformer decoder, which is trained to generate the full report conditioned on them. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1). Human Evaluation: Radiologists rate generated reports for fluency, clinical accuracy, and usefulness. Ablation: Tests the impact of different prompt designs and the necessity of the prompt module. |
Ours | 1. Knowledge Integration and Fusion: A knowledge graph framework that explicitly models disease progression patterns while maintaining tight feature–text coupling. 2. Information Self-Distillation: An adaptive knowledge distillation mechanism that dynamically adjusts to clinical contexts | Encoder: CNN(ResNet101) Decoder: Transformer | Explicit: General knowledge graph and report database | 1. Encode Knowledge and Vision: The knowledge encoder processes structured medical concepts into embeddings, while the image encoder extracts visual features from the input X-ray. 2. Inject Knowledge: The K-image encoder fuses these streams via cross-attention, producing knowledge-augmented visual features. 3. Retrieve and Aggregate: A CLIP-based retriever fetches relevant reports from a database, which are dynamically aggregated with visual context. 4. Generate Report: The aggregated features guide a Transformer decoder to produce the final diagnostic report auto-regressively. | Datasets: IU X-Ray [27], MIMIC-CXR [47]. Metrics: Standard automated NLG metrics (BLEU [9], ROUGE [10], METEOR [11]) and clinical efficacy (CheXbert-based [46] recall, F1). |
References
- Li, M.; Cai, W.; Verspoor, K.; Pan, S.; Liang, X.; Chang, X. Cross-modal clinical graph transformer for ophthalmic report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20656–20665. [Google Scholar]
- Li, Y.; Yang, B.; Cheng, X.; Zhu, Z.; Li, H.; Zou, Y. Unify, align and refine: Multi-level semantic alignment for radiology report generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2863–2874. [Google Scholar]
- Wang, J.; Bhalerao, A.; He, Y. Cross-modal prototype driven network for radiology report generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 563–579. [Google Scholar]
- Wang, Z.; Liu, L.; Wang, L.; Zhou, L. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 1–6 October 2023; pp. 11558–11567. [Google Scholar]
- Zhou, H.Y.; Chen, X.; Zhang, Y.; Luo, R.; Wang, L.; Yu, Y. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nat. Mach. Intell. 2022, 4, 32–40. [Google Scholar] [CrossRef]
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. [Google Scholar] [CrossRef]
- Demner-Fushman, D.; Kohli, M.; Rosenman, M.; Shooshan, S.; Rodriguez, L.; Antani, S.; Thoma, G.; McDonald, C. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef] [PubMed]
- Lebret, R.; Grangier, D.; Auli, M. Neural Text Generation from Structured Data with Application to the Biography Domain. arXiv 2016, arXiv:1603.07771. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Denkowski, M.; Lavie, A. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 30 July 2011; pp. 85–91. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
- Hirota, Y.; Nakashima, Y.; Garcia, N. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13450–13459. [Google Scholar]
- Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17980–17989. [Google Scholar]
- Liu, F.; Ge, S.; Zou, Y.; Wu, X. Competence-based multimodal curriculum learning for medical report generation. arXiv 2022, arXiv:2206.14579. [Google Scholar]
- Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2020, arXiv:2010.16056. [Google Scholar]
- Chen, Z.; Shen, Y.; Song, Y.; Wan, X. Cross-modal memory networks for radiology report generation. arXiv 2022, arXiv:2204.13258. [Google Scholar] [CrossRef]
- Li, M.; Liu, R.; Wang, F.; Chang, X.; Liang, X. Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web 2023, 26, 253–270. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.; Wu, X.; Ge, S.; Zhou, S.K.; Xiao, L. Knowledge matters: Chest radiology report generation with general and specific knowledge. Med. Image Anal. 2022, 80, 102510. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Wang, X.; Xu, Z.; Yu, Q.; Yuille, A.; Xu, D. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12910–12917. [Google Scholar]
- Liu, F.; Wu, X.; Ge, S.; Fan, W.; Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13753–13762. [Google Scholar]
- Jing, B.; Xie, P.; Xing, E. On the automatic generation of medical imaging reports. arXiv 2017, arXiv:1711.08195. [Google Scholar]
- Wang, Z.; Han, H.; Wang, L.; Li, X.; Zhou, L. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. IEEE Trans. Med. Imaging 2022, 41, 2803–2813. [Google Scholar] [CrossRef] [PubMed]
- Yan, B.; Pei, M. Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2982–2990. [Google Scholar]
- Johnson, A.; Pollard, T.; Mark, R.; Berkowitz, S.; Horng, S. MIMIC-CXR Database (version 2.1.0). PhysioNet 2024, 13026, C2JT1Q. [Google Scholar] [CrossRef]
- Liu, W.; Xue, Y.; Lin, C.; Boumaraf, S. Dataset: IU X-Ray Dataset. 2025. Available online: https://service.tib.eu/ldmservice/dataset/iu-x-ray-dataset (accessed on 9 October 2025).
- Gu, J.; Wang, G.; Cai, J.; Chen, T. An empirical study of language cnn for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1222–1231. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15465–15474. [Google Scholar]
- Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10685–10694. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018; pp. 684–699. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, Maryland, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. Coca: Contrastive captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
- Liu, G.; Hsu, T.M.H.; McDermott, M.; Boag, W.; Weng, W.H.; Szolovits, P.; Ghassemi, M. Clinically accurate chest x-ray report generation. In Proceedings of the Machine Learning for Healthcare Conference, Ann Arbor, Michigan, USA, 9–10 August 2019; pp. 249–269. [Google Scholar]
- Xue, Y.; Xu, T.; Rodney Long, L.; Xue, Z.; Antani, S.; Thoma, G.R.; Huang, X. Multimodal recurrent model with attention for automated radiology report generation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 457–466. [Google Scholar]
- Xue, Y.; Huang, X. Improved disease classification in chest x-rays with transferred features from report generation. In Proceedings of the International Conference on Information Processing in Medical Imaging, Hong Kong, China, 2–7 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 125–138. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Hou, B.; Kaissis, G.; Summers, R.M.; Kainz, B. Ratchet: Medical transformer for chest x-ray diagnosis and reporting. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 293–303. [Google Scholar]
- Wang, Z.; Tang, M.; Wang, L.; Li, X.; Zhou, L. A medical semantic-assisted transformer for radiographic report generation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 655–664. [Google Scholar]
- Tanida, T.; Müller, P.; Kaissis, G.; Rueckert, D. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7433–7442. [Google Scholar]
- Endo, M.; Krishnan, R.; Krishna, V.; Ng, A.Y.; Rajpurkar, P. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In Proceedings of the Machine Learning for Health, Online, 4 December 2021; pp. 209–219. [Google Scholar]
- Jin, H.; Che, H.; Lin, Y.; Chen, H. Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver Convention Center, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 2607–2615. [Google Scholar]
- Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A.Y.; Lungren, M.P. CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv 2020, arXiv:2004.09167. [Google Scholar] [CrossRef]
- Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
- Goldberger, A.L.; Amaral, L.A.N.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef] [PubMed]
- Nicolson, A.; Dowling, J.; Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 2023, 144, 102633. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2–7 June 2019; Volume 1 (Long and Short Papers). pp. 4171–4186. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Yang, S.; Wu, X.; Ge, S.; Zheng, Z.; Zhou, S.K.; Xiao, L. Radiology report generation with a learned knowledge base and multi-modal alignment. Med. Image Anal. 2023, 86, 102798. [Google Scholar] [CrossRef] [PubMed]
- Li, M.; Lin, B.; Chen, Z.; Lin, H.; Liang, X.; Chang, X. Dynamic graph enhanced contrastive learning for chest x-ray report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3334–3343. [Google Scholar]
- Huang, Z.; Zhang, X.; Zhang, S. Kiut: Knowledge-injected u-transformer for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19809–19818. [Google Scholar]
Dataset | IU X-Ray | MIMIC-CXR | ||||||
---|---|---|---|---|---|---|---|---|
Train | Val | Test | Total | Train | Val | Test | Total | |
Image # | 5226 | 748 | 1496 | 7470 | 368,960 | 2991 | 5159 | 372,470 |
Report # | 2770 | 395 | 790 | 3955 | 270,790 | 2130 | 3858 | 276,778 |
Patient # | 2770 | 395 | 790 | 3955 | 64,586 | 500 | 293 | 65,379 |
Labels # | 2770 | 395 | 790 | 3955 | 270,790 | 2130 | 3858 | 276,778 |
Library Name | Version | Library Name | Version |
---|---|---|---|
Python | 3.8.13 | scipy | 1.10.1 |
Pytorch | 2.4.1 | pandas | 2.0.3 |
transformers | 4.25.0 | scikit-learn | 1.3.2 |
torchvision | 0.19.1 | timm | 0.6.12 |
opencv-python | 4.10.0.84 | Numpy | 1.23.4 |
Dataset | Model | CE Metrics | NLG Metrics | |||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | BLEU-1 | BLEU-4 | METEOR | ROUGE | ||
MIMIC | R2GenCMN [18] | 0.333 | 0.273 | 0.276 | 0.353 | 0.103 | 0.142 | 0.277 |
GSKET [20] | 0.458 | 0.348 | 0.371 | 0.363 | 0.115 | – | 0.284 | |
M2KT [53] | 0.420 | 0.339 | 0.352 | 0.386 | 0.111 | – | 0.274 | |
DCL [54] | 0.471 | 0.352 | 0.373 | – | 0.109 | 0.150 | 0.284 | |
KiUT [55] | 0.371 | 0.318 | 0.321 | 0.393 | 0.113 | 0.160 | 0.285 | |
METransformer [4] | 0.364 | 0.309 | 0.311 | 0.386 | 0.124 | 0.152 | 0.291 | |
RGRG *# [43] | 0.461 | 0.475 | 0.447 | 0.373 | 0.126 | 0.168 | 0.264 | |
PromptMRG [45] | 0.501 | 0.509 | 0.476 | 0.398 | 0.112 | 0.157 | 0.268 | |
Ours | 0.509 | 0.509 | 0.480 | 0.408 | 0.114 | 0.158 | 0.273 | |
IU X-Ray | R2GenCMN † [18] | 0.141 | 0.136 | 0.136 | 0.325 | 0.059 | 0.131 | 0.253 |
M2KT † [53] | 0.153 | 0.145 | 0.145 | 0.371 | 0.078 | 0.153 | 0.261 | |
DCL † [54] | 0.168 | 0.167 | 0.162 | 0.354 | 0.074 | 0.152 | 0.267 | |
RGRG *† [43] | 0.183 | 0.187 | 0.180 | 0.266 | 0.063 | 0.146 | 0.180 | |
PromptMRG † [45] | 0.213 | 0.229 | 0.211 | 0.401 | 0.098 | 0.160 | 0.281 | |
Ours | 0.206 | 0.209 | 0.199 | 0.421 | 0.106 | 0.163 | 0.314 |
Dataset | Components | CE | NLG | ||||||
---|---|---|---|---|---|---|---|---|---|
ISD | KIF | Prec. | Rec. | F1 | B-1 | B-4 | METEOR | RG-L | |
MIMIC-CXR | ✗ | ✗ | 0.505 | 0.466 | 0.455 | 0.387 | 0.100 | 0.157 | 0.268 |
✓ | ✗ | 0.503 | 0.519 | 0.480 | 0.403 | 0.111 | 0.157 | 0.270 | |
✗ | ✓ | 0.519 | 0.519 | 0.490 | 0.402 | 0.110 | 0.157 | 0.270 | |
✓ | ✓ | 0.509 | 0.509 | 0.480 | 0.408 | 0.114 | 0.158 | 0.273 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, J.; Huang, X.; Jiang, M.; Li, Y.; Zou, Z.; Qian, D. Graph-Driven Medical Report Generation with Adaptive Knowledge Distillation. Appl. Sci. 2025, 15, 10974. https://doi.org/10.3390/app152010974
Chen J, Huang X, Jiang M, Li Y, Zou Z, Qian D. Graph-Driven Medical Report Generation with Adaptive Knowledge Distillation. Applied Sciences. 2025; 15(20):10974. https://doi.org/10.3390/app152010974
Chicago/Turabian StyleChen, Jingqian, Xin Huang, Mingfeng Jiang, Yang Li, Zimin Zou, and Diqing Qian. 2025. "Graph-Driven Medical Report Generation with Adaptive Knowledge Distillation" Applied Sciences 15, no. 20: 10974. https://doi.org/10.3390/app152010974
APA StyleChen, J., Huang, X., Jiang, M., Li, Y., Zou, Z., & Qian, D. (2025). Graph-Driven Medical Report Generation with Adaptive Knowledge Distillation. Applied Sciences, 15(20), 10974. https://doi.org/10.3390/app152010974