VT-MFLV: Vision–Text Multimodal Feature Learning V Network for Medical Image Segmentation
Abstract
1. Introduction
1.1. Unimodal Medical Image Segmentation Methods
- (1)
- Small-sample dataset methods
- (2)
- Large-Sample Single-Dataset Segmentation Methods
- (3)
- Large-sample multi-dataset segmentation methods
1.2. Multimodal Medical Image Segmentation Methods
- VT-MFLV, a vision–text multimodal feature learning V-shaped network for medical image segmentation, introduces three core components: diagnostic image–text sequence multi-head residual semantic encoding (DIT-RMHSE), multimodal fusion local attention fine-grained feature encoding (FG-MFLA), and multimodal global feature adaptive compression focusing (AGCF). Experimental results demonstrate that VT-MFLV achieves state-of-the-art segmentation performance on both the MosMedData+ and QaTa-COV19 datasets.
- The Diagnostic Image–Text Sequence Multi-Head Residual Semantic Encoding Module (DIT-RMHSE) transforms medical text into high-dimensional semantic representations and captures rich contextual information. It eliminates the need for token type embeddings while effectively modeling complex semantic relationships, thereby improving lesion localization and enhancing multimodal fusion flexibility.
- The multimodal fusion local attention fine-grained feature encoding module (FG-MFLA) combines a multi-head attention mechanism with local attention masks and introduces a cross-modal fusion unit (CMFU). This module optimizes cross-modal feature integration and enhances representation learning. Consequently, it effectively addresses the issue of inaccurate local detail recognition caused by insufficient multimodal fusion.
- The multimodal global feature adaptive compression focusing module (AGCF) employs a squeeze-and-excitation refinement strategy (SER) and pixel-level feature enhancement to adaptively adjust channel weights. By suppressing redundant background information and focusing on critical regions, AGCF alleviates challenges commonly encountered in medical image segmentation, such as blurred boundaries and small lesion volumes. This significantly improves segmentation accuracy and stability.
2. Methodology
2.1. Diagnostic Image–Text Residual Multi-Head Semantic Encoder
2.1.1. Standardized Text Characteristic Representation
2.1.2. Sequential Semantic Representation of Diagnostic Text
- Token Embeddings
- 2.
- Position Embeddings
- 3.
- Embedding Fusion
2.1.3. Multi-Head Residual Context Fusion Encoding (MHR-CFE)
- 1.
- Multi-Head Attention
- 2.
- Residual Normalization
- 3.
- Feedforward Residual Fusion
2.2. Fine-Grained Multimodal Fusion with Local Attention
2.2.1. Generation of Positional Encoding
2.2.2. Feature Integration and Normalization
2.2.3. Local Perception Multi-Head Attention Encoder
- 1.
- Feature normalization
- 2.
- Local Perception Multi-Head Attention
- (1)
- Local perception multi-head attention
- (2)
- Local Attention Mask Normalization
- (3)
- Weighted Summation
- (4)
- Regularization
- 3.
- Information-Preserving Fusion
- 4.
- Nonlinear Feature Transformation
- 5.
- Context-Preserving Enhancement
2.3. Adaptive Global Compression and Focusing
2.3.1. Global and Local Feature Extraction
- 1.
- Global Statistical Feature Extraction
- 2.
- Local Salient Feature Extraction
2.3.2. Multi-Scale Feature Fusion and Compression (MFFC)
- 1.
- Global–Local Feature Fusion
- 2.
- Redundancy Compression
2.3.3. Channel Squeeze and Excitation Refinement (SER)
- 1.
- Feature Compression
- (1)
- lobal Average Pooling
- (2)
- Reshape
- 2.
- Channel Excitation
- (1)
- Dimensionality Reduction
- (2)
- Nonlinear Adaptive Activation
- (3)
- Dimensionality Restoration
- (4)
- Normalization
- 3.
- Feature Refinement
- (1)
- Channel Weighting
- (2)
- Dropout
3. Experimental Results and Analysis
3.1. Datasets
3.2. Evaluation Metrics
3.3. Experimental Details
3.4. Performance Comparison and Analysis
3.4.1. Comparison of Segmentation Visualization Results
- 1
- Visualization Results on the MosMedData+ Dataset
- 2
- Visualization Results on the QaTa-COV19 Dataset
3.4.2. Quantitative Results Comparison
3.4.3. Statistical Significance Evaluation
3.4.4. Computational Efficiency Evaluation
3.4.5. Window Size Selection in Locality-Aware Multi-Head Attention
3.4.6. Ablation Studies
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1
| Symbol | Meaning |
|---|---|
| N | Number of tokens after tokenization |
| L | Length of the original character sequence |
| d | Embedding dimension |
| pos | Position index |
| H | Number of attention heads |
| V | Vocabulary size |
| Z | Output of MHSA module |
| A | Attention map |
| ⊕ | Element-wise addition/ |
| ⊗ | multiplication |
Appendix A.2
| Image ID | Text Description |
|---|---|
| Morozov_study_0266_24.png | Unilateral pulmonary infection, one infected area, left lung and middle right lung. |
| Jun_coronacases_case4_143.png | Unilateral pulmonary infection, one infected area, left lung and middle right lung. |
| Morozov_study_0263_2.png | Unilateral pulmonary infection, one infected area, left lung and middle right lung. |
| Jun_radiopaedia_14_85914_0_case13_19.png | Bilateral pulmonary infection, two infected areas, upper left lung and upper right lung. |
| Morozov_study_0275_12.png | Unilateral pulmonary infection, one infected area, middle left lung. |
| Jun_coronacases_case6_93.png | Unilateral pulmonary infection, one infected area, middle left lung. |
| Morozov_study_0276_22.png | Unilateral pulmonary infection, one infected area, middle left lung. |
| Jun_coronacases_case8_177.png | Unilateral pulmonary infection, two infected areas, upper left lung. |
| Jun_coronacases_case8_269.png | Unilateral pulmonary infection, one infected area, middle left lung. |
| Jun_coronacases_case8_226.png | Bilateral pulmonary infection, three infected areas, upper left lung and middle right lung. |
| Morozov_study_0296_12.png | Bilateral pulmonary infection, two infected areas, upper left lung and middle right lung. |
| Morozov_study_0296_13.png | Unilateral pulmonary infection, four infected areas, upper left lung. |
| Jun_coronacases_case8_70.png | Bilateral pulmonary infection, two infected areas, middle left lung and middle right lung. |
| Morozov_study_0296_11.png | Unilateral pulmonary infection, one infected area, upper left lung. |
| Jun_coronacases_case8_192.png | Unilateral pulmonary infection, three infected areas, upper left lung. |
References
- Boodi, D.; Sudheer, N.; Bidargaddi, A.P.; Shatagar, S.; Telkar, M. Semantic Segmentation of Computed Tomography Scan of Lungs. In Proceedings of the 5th IEEE International Conference for Emerging Technology, INCET 2024, Belgaum, India, 24–26 May 2024; IEEE (Institute of Electrical and Electronics Engineers Inc.): New York, NY, USA, 2024. [Google Scholar]
- Wang, Y.; Mastura Mustaza, S.; Syuhaimi Ab-Rahman, M. Pulmonary Nodule Segmentation Using Deep Learning: A Review. IEEE Access 2024, 12, 119039–119055. [Google Scholar] [CrossRef]
- Jiang, J.; Rangnekar, A.; Veeraraghavan, H. Self-supervised learning improves robustness of deep learning lung tumor segmentation to CT imaging differences. arXiv 2024, arXiv:2405.08657. [Google Scholar] [CrossRef]
- Sharma, S.; Guleria, K. A systematic literature review on deep learning approaches for pneumonia detection using chest X-ray images. Multimed. Tools Appl. 2024, 83, 24101–24151. [Google Scholar] [CrossRef]
- Tang, Y.; Zhan, S.; Guo, L.; Pu, H.; Feng, W.; Liao, J. Pulmonary embolism image segmentation based on an U‑net method with CBAM attention mechanism. In Proceedings of the 3rd International Conference on Electronics, Communications and Information Technology, CECIT 2022, Sanya, China, 23–25 December 2022; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2022; pp. 334–339. [Google Scholar]
- Liu, Y.; Wang, J.; Chen, J.; Pan, D.; Chang, J.; Bi, Y. Advanced UNet++ Architecture for Precise Segmentation of COVID-19 Pulmonary Infections. In Proceedings of the 2023 5th International Conference on Artificial Intelligence and Computer Applications, ICAICA 2023, Dalian, China, 28–30 November 2023; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2023; pp. 155–159. [Google Scholar]
- Auvy, A.A.M.; Zannah, R.; Mahbub-E-Elahi; Sharif, S.; Al Mahmud, W.; Noor, J. Semantic Segmentation with Attention Dense U-Net for Lung Extraction from X-ray Images. In Proceedings of the 6th International Conference on Electrical Engineering and Information and Communication Technology, ICEEICT 2024, Dhaka, Bangladesh, 2–4 May 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 658–663. [Google Scholar]
- Agnes, S.A.; Anitha, J. Efficient multiscale fully convolutional UNet model for segmentation of 3D lung nodule from CT image. J. Med. Imaging 2022, 9, 052402. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Qi, Y.; Li, J.; Ren, Z. Lung Nodule Segmentation Based on Complementary Context-Aware Networks. In Proceedings of the 42nd Chinese Control Conference, CCC 2023, Tianjin, China, 24–26 July 2023; IEEE Computer Society: New York, NY, USA, 2023; pp. 7705–7710. [Google Scholar]
- Pal, O.K.; Roy, S.; Modok, A.K.; Teethi, T.I.; Sarker, S.K. ULung: A Novel Approach for Lung Image Segmentation. In Proceedings of the 6th International Conference on Computing and Informatics, ICCI 2024, Cairo, Egypt, 6–7 March 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 522–527. [Google Scholar]
- Delfan, N.; Moghaddam, H.A.; Modaresi, M.; Afshari, K.; Nezamabadi, K.; Pak, N.; Ghaemi, O.; Forouzanfar, M. CT-LungNet: A Deep Learning Framework for Precise Lung Tissue Segmentation in 3D Thoracic CT Scans. arXiv 2022, arXiv:2212.13971. [Google Scholar]
- Li, J.; Chen, Y.; Wu, C.; Zhang, Q.; Sun, L.; Patel, M.; Xu, H.; Lee, J.; Kumar, S.; Brown, T.; et al. Pulmonary CT Nodules Segmentation Using An Enhanced Square U-Net with Depthwise Separable Convolution. In Proceedings of the Medical Imaging 2023: Image Processing, San Diego, CA, USA, 19–23 February 2023; The Society of Photo-Optical Instrumentation Engineers (SPIE): Bellingham, WA, USA, 2023. [Google Scholar]
- Zhang, J.; Tang, J.; Huo, Y. Semantic segmentation of pulmonary nodules based on attention mechanism and improved 3D U-Net. In Proceedings of the 4th International Conference on Advanced Information Science and System, AISS 2022, Sanya, China, 25–27 November 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
- Liu, F.; Chen, Z.; Sun, P. Detection and segmentation of pulmonary nodules based on improved 3D VNet algorithm. In Proceedings of the 2022 International Conference on Algorithms, Microchips and Network Applications, Zhuhai, China, 18–20 February 2022; Academic Exchange Information Center (AEIC) (SPIE): Bellingham, WA, USA, 2022. [Google Scholar]
- Tan, S.; Li, J.; Zhang, X.; Yan, X.; Zhang, T.; Wu, X.; Liu, Z.; Li, L.; Feng, J.; Han, H.; et al. A design of interactive review for computer aided diagnosis of pulmonary nodules based on active learning. Shengwu Yixue Gongchengxue Zazhi/J. Biomed. Eng. 2024, 41, 503–510. [Google Scholar] [CrossRef]
- Youssef, B.; Alksas, A.; Shalaby, A.; Mahmoud, A.; Van Bogaert, E.; Contractor, S.; Ghazal, M.; Elmaghraby, A.; El-Baz, A. A Novel Technique of Pulmonary Nodules Auto Segmentation Using Modified Convolutional Neural Networks. In Proceedings of the 20th IEEE International Symposium on Biomedical Imaging (ISBI 2023), Cartagena de Indias, Colombia, 18–21 April 2023; IEEE Computer Society: New York, NY, USA, 2023. [Google Scholar]
- Jalali, Y.; Fateh, M.; Rezvani, M.; Abolghasemi, V.; Anisi, M.H. ResBCDU-net: A deep learning framework for lung CT image segmentation. Sensors 2021, 21, 268. [Google Scholar] [CrossRef]
- Li, D.; Yuan, S.; Yao, G. Pulmonary nodule segmentation based on REMU-Net. Phys. Eng. Sci. Med. 2022, 45, 995–1004. [Google Scholar] [CrossRef]
- Xue, X.; Wang, G.; Ma, L.; Jia, Q.; Wang, Y. Adjacent Slice Feature Guided 2.5d Network for Pulmonary Nodule Segmentation. arXiv 2022, arXiv:2211.10597. [Google Scholar] [CrossRef]
- Ramkumar, M.O.; Jayakumar, D.; Yogesh, R. Multi Res U-Net Based Image Segmentation of Pulmonary Tuberculosis Using CT Images. In Proceedings of the 7th International Conference on Smart Structures and Systems, ICSSS 2020, Chennai, India, 23–24 July 2020; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2020. [Google Scholar]
- Luo, D.; He, Q.; Ma, M.; Yan, K.; Liu, D.; Wang, P. ECANodule: Accurate Pulmonary Nodule Detection and Segmentation with Efficient Channel Attention. In Proceedings of the 2023 International Joint Conference on Neural Networks, IJCNN 2023, Business Events Australia; Destination Goldcoast, Queensland, Australia, 18–23 June 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
- Qiu, J.; Li, B.; Liao, R.; Mo, H.; Tian, L. A Contour-Constraint Neural Network with Hierarchical Feature Learning for Lung Nodule Segmentation in 3D CT Images. In Proceedings of the 4th International Conference on Intelligent Computing and Human-Computer Interaction, ICHCI 2023, Guangzhou, China, 4–6 August 2023; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2023; pp. 242–248. [Google Scholar]
- Sabitha, P.; Canessane, R.A.; Minu, M.S.P.; Gowri, V.; Vigil, M.S.A. An Improved Deep Network Model to Isolate Lung Nodules from Histopathological Images Using an Orchestrated and Shifted Window Vision Transformer. Trait. Du Signal 2024, 41, 2081–2091. [Google Scholar] [CrossRef]
- Misra, A.; Rani, G.; Dhaka, V.S. LSEG: Lung Segmentation for Pulmonary Disease Affected Chest Radiographs. In Proceedings of the Joint 9th International Conference on Digital Arts, Media and Technology with 7th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering, ECTI DAMT and NCON 2024, Chiang Mai, Thailand, 31 January–3 February 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 116–121. [Google Scholar]
- Bhattacharjee, A.; Murugan, R.; Goel, T.; Mirjalili, S. Pulmonary Nodule Segmentation Framework Based on Fine-Tuned and Pretrained Deep Neural Network Using CT Images. IEEE Trans. Radiat. Plasma Med. Sci. 2023, 7, 394–409. [Google Scholar] [CrossRef]
- Wei, R.; Shao, J.; Pu, R.; Zhang, X.; Hu, C. Lesion segmentation method based on deep learning CT image of pulmonary tuberculosis. In Proceedings of the 4th Annual International Conference on Data Science and Business Analytics, ICDSBA 2020, Changsha, China, 6–8 November 2020; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2020; pp. 320–323. [Google Scholar]
- Yang, Q.; Chen, J.U.N. An Intelligent Model for Benign and Malignant Pulmonary Nodule Analysis Using U-Net Networks And Multilevel Attention Mechanisms. J. Mech. Med. Biol. 2024, 24, 2440032. [Google Scholar] [CrossRef]
- Talib, L.F.; Amin, J.; Sharif, M.; Raza, M. Transformer-based semantic segmentation and CNN network for detection of histopathological lung cancer. Biomed. Signal Process. Control. 2024, 92, 106106. [Google Scholar] [CrossRef]
- Xiao, F.; Shen, C.; Chen, Y.; Yang, T.; Chen, S.; Liao, Z.; Tang, J. RCGA-Net: An Improved Multi-hybrid Attention Mechanism Network in Biomedical Image Segmentation. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021, Houston, TX, USA, 9–12 December 2021; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2021; pp. 1112–1118. [Google Scholar]
- Xu, Y.; Souza, L.F.; Silva, I.C.; Marques, A.G.; Silva, F.H.; Nunes, V.X.; Han, T.; Jia, C.; de Albuquerque, V.H.C.; Filho, P.P.R. A soft computing automatic based in deep learning with use of fine-tuning for pulmonary segmentation in computed tomography images. Appl. Soft Comput. 2021, 112, 107810. [Google Scholar] [CrossRef]
- Jian, M.; Jin, H.; Zhang, L.; Wei, B.; Yu, H. DBPNDNet: Dual-branch networks using 3DCNN toward pulmonary nodule detection. Med. Biol. Eng. Comput. 2023, 62, 563–573. [Google Scholar] [CrossRef]
- Sui, G.; Liu, X.; Chen, S.; Liu, S.; Zhang, Z. Pulmonary nodules segmentation based on domain adaptation. Phys. Med. Biol. 2023, 68, 155015. [Google Scholar] [CrossRef]
- Qiu, J.; Li, B.; Liao, R.; Mo, H.; Tian, L. A dual-task region-boundary aware neural network for accurate pulmonary nodule segmentation. J. Vis. Commun. Image Represent. 2023, 96, 103909. [Google Scholar] [CrossRef]
- Cai, L.; Long, T.; Dai, Y.; Huang, Y. Mask R-CNN-Based Detection and Segmentation for Pulmonary Nodule 3D Visualization Diagnosis. IEEE Access 2020, 8, 44400–44409. [Google Scholar] [CrossRef]
- Liu, Y.; Zhu, Y.; Xin, Y.; Zhang, Y.; Yang, D.; Xu, T. MESTrans: Multi-scale embedding spatial transformer for medical image segmentation. Comput. Methods Programs Biomed. 2023, 233, 107493. [Google Scholar] [CrossRef]
- Lu, D.; Chu, J.; Zhao, R.; Zhang, Y.; Tian, G. A Novel Deep Learning Network and Its Application for Pulmonary Nodule Segmentation. Comput. Intell. Neurosci. 2022, 2022, 7124902. [Google Scholar] [CrossRef]
- Kim, Y.-G.; Kim, K.; Wu, D.; Ren, H.; Tak, W.Y.; Park, S.Y.; Lee, Y.R.; Kang, M.K.; Gil Park, J.; Kim, B.S.; et al. Deep Learning-Based Four-Region Lung Segmentation in Chest Radiography for COVID-19 Diagnosis (Research Square, 2021). Diagnostics 2022, 12, 101. [Google Scholar] [CrossRef]
- Imran, A.-A.-Z.; Terzopoulos, D. Progressive adversarial semantic segmentation. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Milan, Italy, 10–15 January 2021; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2021; pp. 4910–4917. [Google Scholar]
- Xing, W.; Zhu, Z.; Hou, D.; Yue, Y.; Dai, F.; Li, Y.; Tong, L.; Song, Y.; Ta, D. CM-SegNet: A deep learning-based automatic segmentation approach for medical images by combining convolution and multilayer perceptron. Comput. Biol. Med. 2022, 147, 105797. [Google Scholar] [CrossRef]
- Jia, J.; Zhai, Z.; Bakker, M.E.; Hernández Girón, I.; Staring, M.; Stoel, B.C. Multi-Task Semi-Supervised Learning for Pulmonary Lobe Segmentation. In Proceedings of the 18th IEEE International Symposium on Biomedical Imaging, ISBI 2021, Nice, France, 13–16 April 2021; IEEE Computer Society: New York, NY, USA, 2021; pp. 1329–1332. [Google Scholar]
- Li, Z.; Li, Y.; Li, Q.; Wang, P.; Guo, D.; Lu, L.; Jin, D.; Zhang, Y.; Hong, Q. LViT: Language Meets Vision Transformer in Medical Image Segmentation. IEEE Trans. Med. Imaging 2024, 43, 96–107. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Morozov, S.P.; Andreychenko, A.E.; Pavlov, N.A.; Vladzymyrskyy, A.V.; Ledikhova, N.V.; Gombolevskiy, V.A.; Blokhin, I.A.; Gelezhe, P.B.; Gonchar, A.V.; Chernina, V.Y. MosMedData: Chest CT scans with COVID-19 related findings dataset. arXiv 2020, arXiv:2005.06465. [Google Scholar]
- Degerli, A.; Kiranyaz, S.; Chowdhury, M.E.; Gabbouj, M. Osegnet: Operational segmentation network for covid-19 detection using chest x-ray images. arXiv 2022, arXiv:2202.10185. [Google Scholar]
- Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Tomar, N.K.; Jha, D.; Bagci, U.; Ali, S. TGANet: Text-Guided Attention for Improved Polyp Segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2022 (Lecture Notes in Computer Science, Vol. 13433); Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; pp. 151–160. [Google Scholar] [CrossRef]
- Huang, S.-C.; Shen, L.; Lungren, M.P.; Yeung, S. GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 3942–3951. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer–Assisted Intervention–MICCAI 2015 (Lecture Notes in Computer Science, Vol. 9351); Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. In Proceedings of the 1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands, 4–6 July 2018; pp. 197–207. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar] [CrossRef]
- Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-Wise Perspective with Transformer. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2441–2449. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)/PMLR, Vienna, Austria, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Wahyudi, M.I.; Fauzi, I.; Atmojo, D. Robust Image Watermarking Based on Hybrid IWT-DCT-SVD. Int. J. Adv. Comput. Inform. (IJACI) 2025, 1, 89–98. [Google Scholar] [CrossRef]
- Kusuma, M.R.; Panggabean, S. Robust Digital Image Watermarking Using DWT, Hessenberg, and SVD for Copyright Protection. IJACI Int. J. Adv. Comput. Inform. 2026, 2, 41–52. [Google Scholar] [CrossRef]
- Amrullah, A.; Aminuddin, A. Tamper Localization and Content Restoration in Fragile Image Watermarking: A Review. IJACI Int. J. Adv. Comput. Inform. 2025, 2, 62–74. [Google Scholar] [CrossRef]






| Text | Method | MosMedData+ | QaTa-COV19 | ||
|---|---|---|---|---|---|
| Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | ||
| × | U-Net [49] | 64.58 ± 0.37 | 50.41 ± 0.31 | 78.45 ± 0.40 | 68.76 ± 0.33 |
| × | AttUNet [50] | 66.34 | 52.82 | 79.31 | 70.04 |
| × | nnUNet [44] | 72.59 | 60.36 | 80.42 | 70.81 |
| × | TransUNet [45] | 71.24 | 58.44 | 78.63 | 69.13 |
| × | Swin-UNet [51] | 63.29 | 50.19 | 78.07 | 68.34 |
| × | UCTransNet [52] | 65.90 | 52.69 | 79.15 | 69.60 |
| √ | TGANet [46] | 71.81 | 59.28 | 79.87 | 70.75 |
| √ | CLIP [53] | 71.97 | 59.64 | 79.81 | 69.66 |
| √ | GLoRIA [47] | 72.42 | 60.18 | 79.94 | 70.68 |
| √ | LViT [41] | 74.57 ± 0.39 | 61.33 ± 0.33 | 83.66 ± 0.38 | 75.11 ± 0.39 |
| √ | VT-MFLV (ours) | 75.61 ± 0.32 | 63.98 ± 0.29 | 83.34 ± 0.36 | 72.09 ± 0.30 |
| Dataset | Model | Dice (%) | mIoU (%) | p-Value (Dice) | p-Value (mIoU) |
|---|---|---|---|---|---|
| MosMedData+ | U-Net | 64.58 ± 0.37 | 50.41 ± 0.31 | — | — |
| LViT | 74.57 ± 0.39 | 61.33 ± 0.33 | — | — | |
| VT-MFLV | 75.61 ± 0.32 | 63.98 ± 0.29 | 0.0008 | 0.0006 | |
| QaTa-COV19 | U-Net | 78.45 ± 0.40 | 68.76 ± 0.33 | — | — |
| LViT | 83.66 ± 0.38 | 75.11 ± 0.39 | — | — | |
| VT-MFLV | 83.34 ± 0.36 | 72.09 ± 0.30 | 0.0011 | 0.0010 |
| Model | Text | Parameters (M) | Inference Time (ms/Image) |
|---|---|---|---|
| U-Net | × | 14.8 | 25.5 |
| LViT | √ | 29.7 | 37.8 |
| VT-MFLV (ours) | √ | 28.3 | 30.7 |
| Window Size | MosMedData+ | QaTa-COV19 | ||
|---|---|---|---|---|
| Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | |
| 3 × 3 | 73.21 ± 0.22 | 62.76 ± 0.28 | 80.53 ± 0.31 | 70.41 ± 0.27 |
| 5 × 5 | 75.28 ± 0.19 | 63.84 ± 0.24 | 82.35 ± 0.26 | 71.36 ± 0.29 |
| 7 × 7 (Ours) | 75.61 ± 0.17 | 64.03 ± 0.20 | 83.29 ± 0.22 | 72.10 ± 0.25 |
| 9 × 9 | 75.11 ± 0.25 | 63.42 ± 0.21 | 82.07 ± 0.30 | 71.08 ± 0.26 |
| Method | MosMedData+ | QaTa-COV19 | ||
|---|---|---|---|---|
| Dice (%) | mIoU (%) | Dice (%) | mIoU (%) | |
| FG-MFLA +AGCF | 72.73 | 61.42 | 80.92 | 69.34 |
| DIT-RMHSE+AGCF | 73.24 | 61.75 | 81.12 | 69.71 |
| DIT-RMHSE+ FG-MFLA | 73.68 | 62.09 | 81.46 | 70.05 |
| VT-MFLV (Ours) | 75.61 | 63.98 | 83.34 | 72.09 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, W.; Li, J.; Ye, Z.; Cai, Y.; Wang, Z.; Zhang, R. VT-MFLV: Vision–Text Multimodal Feature Learning V Network for Medical Image Segmentation. J. Imaging 2025, 11, 425. https://doi.org/10.3390/jimaging11120425
Wang W, Li J, Ye Z, Cai Y, Wang Z, Zhang R. VT-MFLV: Vision–Text Multimodal Feature Learning V Network for Medical Image Segmentation. Journal of Imaging. 2025; 11(12):425. https://doi.org/10.3390/jimaging11120425
Chicago/Turabian StyleWang, Wenju, Jiaqi Li, Zinuo Ye, Yuyang Cai, Zhen Wang, and Renwei Zhang. 2025. "VT-MFLV: Vision–Text Multimodal Feature Learning V Network for Medical Image Segmentation" Journal of Imaging 11, no. 12: 425. https://doi.org/10.3390/jimaging11120425
APA StyleWang, W., Li, J., Ye, Z., Cai, Y., Wang, Z., & Zhang, R. (2025). VT-MFLV: Vision–Text Multimodal Feature Learning V Network for Medical Image Segmentation. Journal of Imaging, 11(12), 425. https://doi.org/10.3390/jimaging11120425

