M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
Abstract
1. Introduction
- KD is integrated into the pre-training pipeline by aligning both attention maps and hidden states. This distillation enables the student to approximate the teacher’s intermediate representations.
- An attention-guided masking strategy is proposed for the MIM objective. This strategy leverages attention maps from the teacher model to identify and mask semantically salient regions in the image. By reconstructing these regions, the student is encouraged to leverage complementary visual and textual features, thereby facilitating fine-grained cross-modal alignment.
2. Related Work
2.1. Vision–Language Model
2.2. Knowledge Distillation
3. Method
3.1. Pre-Training Tasks
3.1.1. Masked Language Modeling
3.1.2. Masked Image Modeling
3.1.3. Image–Text Matching
3.2. Knowledge Distillation
3.2.1. Hidden Layer Distillation
3.2.2. Attention Distillation
3.3. Attention-Guided Masked Image Modeling
3.3.1. Stage 1: Attention Score Matrix Computation
3.3.2. Stage 2: Attention-Guided Progressive Mask Generation
- 1.
- Sort the attention scores in ascending order to obtain the sorted attention matrix .
- 2.
- Progressively adjust the proportion r of high-attention group based on training progress:
- 3.
- Divide the patches into high-attention and low-attention groups according to r. All patches in the high-attention group are treated as key areas and are masked. For the low-attention group, a proportion of is randomly masked to introduce noise and enhance robustness.
- 4.
- The two masked groups are then recombined and restored to their original order, yielding the final binary mask matrix .
3.3.3. Case Study
4. Experiments
4.1. Pre-Training Datasets
4.2. Downstream Datasets
4.3. Implementation Details
4.3.1. Experiment Settings
4.3.2. Teacher–Student Model Comparison
4.4. Results and Discussion
4.4.1. Medical VQA Results
4.4.2. Medical Classification Results
4.4.3. Medical Retrieval Results
4.4.4. Efficiency Comparison Results
4.4.5. Ablation Studies
Module Ablation
Attention Score Source
4.4.6. Case Study
Feature Distribution Visualization via PCA
5. Conclusions
Limitations and Prospects
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Singh, S.; Karimi, S.; Ho-Shon, K.; Hamey, L. From chest x-rays to radiology reports: A multimodal machine learning approach. In Proceedings of the 2019 Digital Image Computing: Techniques and Applications (DICTA), Perth, Australia, 2–4 December 2019; pp. 1–8. [Google Scholar]
- Moon, J.H.; Lee, H.; Shin, W.; Kim, Y.H.; Choi, E. Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE J. Biomed. Health Inform. 2022, 26, 6070–6080. [Google Scholar] [CrossRef] [PubMed]
- Eslami, S.; Meinel, C.; De Melo, G. PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2–6 May 2023; pp. 1151–1163. [Google Scholar]
- Chen, Z.; Du, Y.; Hu, J.; Liu, Y.; Li, G.; Wan, X.; Chang, T.H. Multi-modal masked autoencoders for medical vision-and-language pre-training. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 679–689. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zhao, L.; Qian, X.; Guo, Y.; Song, J.; Hou, J.; Gong, J. MSKD: Structured knowledge distillation for efficient medical image segmentation. Comput. Biol. Med. 2023, 164, 107284. [Google Scholar] [CrossRef] [PubMed]
- Zeng, X.; Ji, Z.; Zhang, H.; Chen, R.; Liao, Q.; Wang, J.; Lyu, T.; Zhao, L. DSP-KD: Dual-stage progressive knowledge distillation for skin disease classification. Bioengineering 2024, 11, 70. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Huang, S.; Du, H.; Qin, Y.; Wang, H.; Zhang, W. MHKD-MVQA: Multimodal hierarchical knowledge distillation for medical visual question answering. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 567–574. [Google Scholar]
- Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
- Wang, T.; Zhou, W.; Zeng, Y.; Zhang, X. Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv 2022, arXiv:2210.07795. [Google Scholar]
- Xie, X.; Pan, X.; Zhang, W.; An, J. A context hierarchical integrated network for medical image segmentation. Comput. Electr. Eng. 2022, 101, 108029. [Google Scholar] [CrossRef]
- Han, Y.; Holste, G.; Ding, Y.; Tewfik, A.; Peng, Y.; Wang, Z. Radiomics-guided global-local transformer for weakly supervised pathology localization in chest X-rays. IEEE Trans. Med. Imaging 2022, 42, 750–761. [Google Scholar] [CrossRef] [PubMed]
- Leem, S.; Seo, H. Attention guided CAM: Visual explanations of vision transformer guided by self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 2956–2964. [Google Scholar]
- Kakogeorgiou, I.; Gidaris, S.; Psomas, B.; Avrithis, Y.; Bursuc, A.; Karantzalos, K.; Komodakis, N. What to hide from your students: Attention-guided masked image modeling. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 300–318. [Google Scholar]
- Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C.M. Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) Workshop, Granada, Spain, 16–20 September 2018; pp. 180–189. [Google Scholar]
- Subramanian, S.; Wang, L.L.; Mehta, S.; Bogin, B.; van Zuylen, M.; Parasa, S.; Singh, S.; Gardner, M.; Hajishirzi, H. Medicat: A dataset of medical images, captions, and textual references. arXiv 2020, arXiv:2010.06000. [Google Scholar]
- Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 1180251. [Google Scholar] [CrossRef] [PubMed]
- Liu, B.; Zhan, L.M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.M. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1650–1654. [Google Scholar]
- Ben Abacha, A.; Hasan, S.A.; Datla, V.V.; Demner-Fushman, D.; Müller, H. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In Proceedings of the CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes, Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
- Wu, T.L.; Singh, S.; Paul, S.; Burns, G.; Peng, N. Melinda: A multimodal dataset for biomedical experiment method classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 14076–14084. [Google Scholar]
- Nguyen, B.D.; Do, T.T.; Nguyen, B.X.; Do, T.; Tjiputra, E.; Tran, Q.D. Overcoming data limitation in medical visual question answering. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; pp. 522–530. [Google Scholar]
- Liu, B.; Zhan, L.M.; Wu, X.M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; pp. 210–220. [Google Scholar]
- Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 21–29. [Google Scholar]
Model Name | Uni-Modal (Layers L/Params M) | Multi-Modal (Layers L/Params M) | Total Params (M) | ||
---|---|---|---|---|---|
Image | Text | Image | Text | ||
Teacher (M3AE) | 12/104 M | 12/124 M | 6/56.7 M | 6/56.7 M | 341.4 M |
Student (Small) | 5/54.7 M | 4/67.9 M | 1/9.5 M | 1/9.5 M | 141.6 M |
Student (Base) | 7/68.9 M | 6/82.1 M | 2/18.9 M | 2/18.9 M | 188.8 M |
Methods | VQA-RAD | SLAKE | VQA-2019 | ||||
---|---|---|---|---|---|---|---|
Open | Closed | Overall | Open | Closed | Overall | Overall | |
MEVF-SAN [27] | 49.20 | 73.90 | 64.10 | 75.30 | 78.40 | 76.50 | 68.90 |
MEVF-BAN [27] | 49.20 | 77.20 | 66.10 | 77.80 | 79.80 | 78.60 | 77.86 |
CPRD-BAN [28] | 52.50 | 77.90 | 67.80 | 79.50 | 83.40 | 81.10 | - |
PubMedCLIP [6] | 60.10 | 80.00 | 72.10 | 78.40 | 82.50 | 80.10 | - |
M3AE [7] | 67.23 | 83.46 | 77.01 | 80.31 | 87.82 | 83.25 | 79.87 |
M3AE-98% [7] | 65.89 | 81.79 | 75.47 | 78.70 | 86.06 | 81.59 | 79.27 |
M3AE-Distill-Small | 64.25 | 79.49 | 73.45 | 78.17 | 83.65 | 80.32 | 74.07 |
M3AE-Distill-Base | 65.92 | 81.87 | 75.55 | 80.57 | 84.62 | 82.16 | 78.46 |
Methods | Modality | Acc |
---|---|---|
ResNet-101 [1] | Image | 63.84 |
RoBERTa [2] | Text | 74.60 |
NLF [26] | Image + Text | 76.60 |
SAN [29] | Image + Text | 72.30 |
M3AE [7] | Image + Text | 78.50 |
M3AE-98% [7] | Image + Text | 76.93 |
M3AE-Distill-Small | Image + Text | 74.31 |
M3AE-Distill-Base | Image + Text | 77.37 |
Methods | T2I | I2T | ||||
---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
ViLT [12] | 9.75 | 28.95 | 41.40 | 11.90 | 31.90 | 43.20 |
PubMedCLIP * [6] | 8.61 | 26.73 | 38.49 | 8.16 | 25.78 | 38.24 |
M3AE * [7] | 16.96 | 46.47 | 61.33 | 17.65 | 46.40 | 60.95 |
M3AE-Distill-Small | 4.05 | 14.96 | 24.36 | 5.40 | 19.95 | 29.20 |
M3AE-Distill-Base | 14.36 | 37.17 | 51.23 | 14.80 | 36.90 | 50.20 |
Model Variant | Training (pairs/s) | Inference (pairs/s) | CPU Inference (ms/pair) | Speedup (Train/Inference/CPU) |
---|---|---|---|---|
Teacher (M3AE) | 84.77 | 119.07 | 370 | 1.00/1.00/1.00 |
Student (Small) | 409.81 | 417.29 | 100 | 4.83/3.51/3.70 |
Student (Base) | 221.61 | 251.47 | 208 | 2.61/2.11/1.78 |
ID | Strategy | Open | Close | Overall |
---|---|---|---|---|
0 | M3AE (Teacher) | 80.31 | 87.82 | 83.25 |
1 | Pre-training | 79.10 | 84.62 | 81.26 |
2 | Pre-training + KD | 79.26 | 85.37 | 81.64 |
3 | Pre-training + KD + Attention Mask | 80.57 | 84.62 | 82.16 |
Attention Score Type | Open | Close | Overall |
---|---|---|---|
Student-only | 80.03 | 83.41 | 81.36 |
Teacher-only | 80.03 | 83.53 | 81.40 |
Student + Teacher | 80.57 | 84.62 | 82.16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, X.; Xie, J.; Zhang, M.; Bi, Z. M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks. Bioengineering 2025, 12, 738. https://doi.org/10.3390/bioengineering12070738
Liang X, Xie J, Zhang M, Bi Z. M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks. Bioengineering. 2025; 12(7):738. https://doi.org/10.3390/bioengineering12070738
Chicago/Turabian StyleLiang, Xudong, Jiang Xie, Mengfei Zhang, and Zhuo Bi. 2025. "M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks" Bioengineering 12, no. 7: 738. https://doi.org/10.3390/bioengineering12070738
APA StyleLiang, X., Xie, J., Zhang, M., & Bi, Z. (2025). M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks. Bioengineering, 12(7), 738. https://doi.org/10.3390/bioengineering12070738