Abstract
Currently, existing multimodal segmentation methods face limitations in effectively leveraging medical text to guide visual feature learning. They often suffer from insufficient multimodal fusion and inadequate accuracy in fine-grained lesion segmentation accuracy. To address these challenges, the Vision–Text Multimodal Feature Learning V Network (VT-MFLV) is proposed. This model exploits the complementarity between medical images and text to enhance multimodal fusion, which consequently improves critical lesion recognition accuracy. VT-MFLV introduces three key modules: Diagnostic Image–Text Residual Multi-Head Semantic Encoding (DIT-RMHSE) module that preserves critical semantic cues while reducing preprocessing complexity; Fine-Grained Multimodal Fusion Local Attention Encoding (FG-MFLA) module that strengthens local cross-modal interaction; and Adaptive Global Feature Compression and Focusing (AGCF) module that emphasizes clinically relevant lesion regions. Experiments are conducted on two publicly available pulmonary infection datasets. On the MosMedData dataset, VT-MFLV achieved Dice and mIoU scores of 75.61 ± 0.32% and 63.98 ± 0.29%. On the QaTa-COV1 dataset, VT-MFLV achieved Dice and mIoU scores of 83.34 ± 0.36% and 72.09 ± 0.30%, both reaching world-leading levels.