1. Introduction
Medical image segmentation constitutes a core problem in medical image analysis and computer vision, with the objective of delineating anatomical structures or pathological regions in a precise and automated manner [
1,
2,
3]. Reliable segmentation is indispensable for a wide range of clinical applications, including disease diagnosis [
4,
5,
6], treatment planning [
7,
8,
9], and longitudinal assessment of disease progression [
10,
11]. Despite substantial advances brought by deep learning, most existing approaches predominantly depend on visual features and densely annotated masks, while largely neglecting the complementary semantic information contained in medical textual descriptions such as radiology reports and diagnostic notes [
12,
13,
14].
These textual descriptions encode clinically meaningful attributes, including morphological characteristics, macro-level anatomical bilateral symmetry (e.g., the symmetric nature of lungs and brain hemispheres), boundary regularity, internal texture, and spatial distribution. Such information is particularly valuable for distinguishing subtle pathological patterns and for constraining segmentation toward anatomically plausible structures. However, conventional segmentation models rarely establish explicit correspondence between these semantic attributes and visual representations. Recent developments in vision–language models (VLMs) provide a potential pathway for multimodal integration through contrastive learning [
15,
16]. Nevertheless, most VLM-based approaches rely on global alignment objectives that capture coarse semantic consistency, while lacking the ability to associate fine-grained textual attributes with localized image regions. This limitation is critical in medical imaging, where clinically relevant descriptions often emphasize subtle morphological variations, irregular boundaries, or asymmetric structural patterns. In addition, global alignment strategies do not sufficiently support segmentation tasks that require spatially precise and structurally coherent predictions, particularly when anatomical symmetry and morphology serve as key diagnostic cues. Consequently, existing vision–language frameworks remain insufficient in jointly modeling semantic fidelity and structural variability in clinical data.
Another fundamental challenge lies in the inherent ambiguity of medical text descriptions. In realistic clinical settings, a single description may correspond to multiple plausible visual realizations, as illustrated in
Figure 1. For example, a report describing “unilateral pulmonary infection, a single infected area, middle left lung” may correspond to lesions with substantially different sizes, shapes, boundary definitions, and degrees of bilateral asymmetry across patients. This phenomenon induces a one-to-many mapping between textual descriptions and segmentation outcomes. Most existing vision–language segmentation methods implicitly assume a deterministic correspondence between image–text pairs and segmentation masks, which does not reflect the variability of pathological morphology observed in practice. As a result, such methods tend to produce rigid predictions that fail to capture the diversity and uncertainty inherent in clinical data.
To overcome these limitations, we propose the Symmetry- and Variability-Perceiving Conditional Variational Autoencoder (SVP-CVAE). While the probabilistic prior-to-posterior inference is a standard mathematical formulation of CVAEs [
17,
18], our core innovation lies in leveraging this framework to explicitly formulate and resolve the domain-specific “one-to-many” ambiguity inherent in clinical medical text descriptions. To ensure that the diverse outcomes generated by the CVAE are clinically valid, we uniquely constrain the stochastic latent space by integrating explicit morphological symmetry perception and fine-grained semantic alignment. Structured representations of medical text, attribute-level multimodal alignment, and latent-variable modeling are integrated within a unified probabilistic formulation. By introducing latent variables conditioned on both visual features and textual attributes, the framework captures diverse yet semantically consistent segmentation outcomes corresponding to the same description.
Specifically, a low-rank adapted [
19] language model is employed to encode medical text into multiscale representations, which serve as a semantic interface for interaction with visual features. Based on these representations, an attribute-level vision–language alignment mechanism is designed to associate image features with individual textual attributes, enabling precise grounding of morphology-related cues. This design facilitates segmentation guided not only by global descriptions but also by localized structural and symmetric patterns. To account for intrinsic variability in lesion morphology, a CVAE formulation is incorporated. The latent variable
z models the distribution of plausible segmentation outcomes conditioned on both image and text. Sampling from this latent space produces diverse predictions that remain consistent with specified semantic attributes, thereby addressing the one-to-many mapping between text and segmentation masks. Furthermore, an attribute-latent contrastive objective is introduced to enforce semantic consistency between latent representations and textual attributes, encouraging
z to encode morphology-aware and discriminative information. This leads to a more interpretable latent space that aligns with clinically meaningful structural descriptions. While horizontal flipping serves as a straightforward macro-level proxy for symmetry perception, it is inherently most effective for naturally bilateral structures such as the lungs and brain. However, evaluating its behavior on inherently asymmetric organs (e.g., the liver) is equally critical. In this work, we demonstrate that by coupling this macro-level structural proxy with a probabilistic latent distribution, our framework can robustly process both highly symmetric anatomies and geometrically asymmetric organs by leveraging global contextual symmetry.
Extensive experiments demonstrate that the proposed framework improves the alignment between textual semantics and segmentation outputs, resulting in more accurate and structurally consistent predictions. In addition, the latent-variable formulation enables the generation of diverse yet plausible segmentation results, better reflecting the variability observed in real-world clinical scenarios.
The main contributions of this work can be summarized as follows:
- (1)
A Symmetry- and Variability-Perceiving Conditional Variational Autoencoder (SVP-CVAE) is proposed. Our key methodological novelty is adapting the standard CVAE framework to specifically address the “one-to-many” ambiguity in medical vision–language segmentation by injecting symmetry-aware and morphology-guided priors into the latent representation.
- (2)
A morphology-aware enhancement module is introduced, incorporating a spatial–semantic grounding module (SSGM) and a cross-bilateral symmetry module (CBSM) to capture localized structural cues and symmetry-related variations.
- (3)
A clinical attribute text encoder (CATE) with low-rank adaptation-based fine-tuning is developed to extract hierarchical representations, enabling fine-grained grounding through a morphology–text fusion (MTF) module.
- (4)
To prevent the CVAE from generating anatomically implausible variations during the one-to-many mapping, we introduce a Semantic Alignment Loss (SAL) as an attribute-latent contrastive objective. This explicitly regularizes the standard latent space, ensuring that sampled variables preserve discriminative and morphology-aware semantic consistency.
3. Method
To address the coarse feature alignment and suboptimal interpretability of existing vision–language models in medical image segmentation, we propose the SVP-CVAE framework. As shown in
Figure 2, built upon an encoder–decoder backbone, our model incorporates a parameter-efficient language adapter module (LAM) within the pre-trained text encoder to capture multi-level morphological attributes, such as boundary sharpness and bilateral asymmetry. Notably, by integrating a text-conditioned latent distribution at the high-level semantic bottleneck, we reformulate the segmentation task as a prior-to-posterior inference process. This stochastic modeling approach effectively mitigates the “one-to-many” mapping challenge inherent in matching clinical descriptions with diverse pathological geometries.
3.1. Morphology-Aware Representation Learning
The accuracy of medical image segmentation is closely related to the ability to characterize fine-grained morphological properties of pathological regions, including boundary clarity, structural irregularity, and bilateral symmetry deviation. Conventional vision–language segmentation frameworks typically adopt generic convolutional or attention-based feature refinement strategies, which do not explicitly encode clinically meaningful structural priors. As a result, morphology-related information, particularly symmetry-aware cues, is insufficiently preserved in deep feature representations. To address this issue, a morphology-aware feature enhancement module is introduced to explicitly encode structural patterns and symmetry variations at the semantic bottleneck stage.
Given the visual token sequence obtained from the vision encoder, where denotes the number of spatial tokens and C represents the channel dimension, the sequence is reshaped into a two-dimensional feature map to facilitate spatially structured processing.
To capture localized structural variations that are critical for accurate boundary delineation, a spatial-aware structural response mechanism is first employed. A depthwise convolution operator is applied to extract channel-wise structural responses:
where
denotes a depthwise convolution with a kernel size of
. This operation emphasizes high-frequency patterns associated with lesion contours and boundary transitions, while maintaining channel-specific semantic consistency.
In addition to local structure modeling, explicit encoding of bilateral symmetry is introduced to capture morphology-related deviations. Many pathological regions exhibit asymmetric patterns with respect to anatomical axes, which serve as important diagnostic indicators. This design is primarily motivated by the global anatomical bilateral symmetry inherently present in organs such as the lungs and the human brain. To model such characteristics, a bilateral structural inconsistency representation is constructed by computing the absolute difference between the feature map and its horizontally reflected counterpart:
where
denotes reflection along the horizontal axis. Although this simple geometric transformation is predominantly tailored for bilateral structures, it also provides valuable global contextual anchoring for inherently asymmetric organs (e.g., the liver) when processed in axial cross-sections. This operation highlights regions that deviate from expected symmetric structures. To further aggregate long-range asymmetric patterns, an anisotropic depthwise convolution is applied:
which captures lateral structural discrepancies and enhances sensitivity to asymmetric morphology.
Formally, Equations (
2) and (
3) provide a mathematical quantification of bilateral symmetry. Let
denote the spatial feature activation at coordinate
. The operation in Equation (
2) calculates the absolute residual between the original feature space and its mirrored counterpart along the vertical anatomical axis. Consequently,
explicitly quantifies the degree and spatial location of structural asymmetry. The subsequent anisotropic convolution in Equation (
3) further aggregates these lateral discrepancies into
, establishing a formal structural prior that represents the quantified symmetry deviations of the pathological regions.
Finally, the original semantic representation, the spatial structural response, and the symmetry-aware representation are integrated through a morphology-aware fusion mechanism. These feature maps are concatenated along the channel dimension and projected using a pointwise transformation:
where
denotes channel-wise concatenation and
represents a
convolution. This fusion process enables adaptive integration of global semantic context with morphology-sensitive and symmetry-aware cues, producing a structurally informed representation.
The resulting feature map is reshaped back into a token sequence , which provides enhanced morphological and symmetry-aware information for subsequent vision–language alignment and probabilistic segmentation modeling.
3.2. Attribute-Conditioned Cross-Modal Token Alignment
Although morphology-aware representation learning provides structurally enriched visual features, accurate vision–language segmentation further requires precise and structured semantic guidance derived from clinical text. In clinical practice, diagnostic reports describe lesions through attribute-specific cues, including morphology, boundary properties, spatial distribution, and symmetry-related deviations. These cues are inherently multi-granular and often correspond to localized visual patterns rather than global semantic concepts. Therefore, effective alignment demands explicit modeling of the hierarchical relationship between language representations and morphology-aware visual features.
To capture the multi-level semantics embedded in clinical descriptions, the CATE is introduced. A pre-trained Transformer backbone (CXR-BERT [
39]) is adopted and adapted to the medical domain using Low-Rank Adaptation. The backbone parameters are fixed, while trainable low-rank matrices are injected into the query, key, value, and feed-forward projections of the last
T layers (with
), enabling efficient domain adaptation without disrupting the original language structure.
To establish hierarchical semantic representations, hidden states from different depths of the Transformer are extracted, denoted as
. These layers correspond to distinct semantic levels, including low-level morphological descriptions, intermediate regional and spatial context, and high-level diagnostic reasoning. Notably, the low-level representations capture fine-grained attributes such as shape irregularity, boundary sharpness, and symmetry-related patterns, which are closely aligned with morphology-aware visual features. These hidden states are projected into a shared embedding space:
where
denotes layer-specific linear projections.
To reduce the semantic discrepancy between visual representations and textual descriptions, a morphology–text fusion (MTF) module is introduced at the semantic bottleneck. This module aligns the morphology-enhanced visual tokens with the low-level textual embeddings , which encode morphology- and symmetry-related attributes.
The MTF module adopts a cross-attention mechanism in which visual tokens act as queries to retrieve relevant structural priors from textual representations. Let
denote learnable projection matrices. The interaction is formulated as
The attention weights, which quantify the relevance between each spatial location in the image and each morphology-aware textual token, are computed as
The grounded visual representation is then obtained through residual aggregation followed by linear projection:
Through this asymmetric cross-modal alignment, the model explicitly associates morphology-aware visual features with corresponding textual attributes, particularly those related to structural variation and symmetry deviation. The low-level language representations guide the localization of fine-grained patterns, while higher-level representations provide contextual and diagnostic consistency. As a result, the learned representation encodes both detailed morphological priors and hierarchical semantic information, forming a text-conditioned and symmetry-aware feature space that supports robust and interpretable probabilistic segmentation.
3.3. Conditional Latent Morphology Distribution Modeling
Medical image segmentation conditioned on textual descriptions inherently involves uncertainty, as similar diagnostic descriptions may correspond to substantially different lesion geometries. To model this one-to-many mapping, we introduce a conditional latent variable framework (shown in
Figure 3) that learns a distribution over morphology-consistent segmentation outcomes. We first summarize global visual and textual semantics by token-wise average pooling:
The conditional prior distribution of the latent morphology variable
is defined as
with parameters predicted by
. During training, geometric information from the ground-truth mask
is incorporated to construct a posterior distribution:
The mask is downsampled and flattened into
, and the posterior parameters are obtained by
Latent samples are then generated using the reparameterization trick:
where
. Rather than injecting latent variables at the decoder stage, we incorporate
directly into the semantic bottleneck. The sampled latent vector is projected to the visual embedding space as
and applied through residual modulation:
To further integrate text and semantic features, we propose a hierarchical progressive decoder that bridges the gap between deep abstract features and shallow spatial cues by injecting multi-granular textual semantics at each resolution scale.
The decoding process is organized into three stages, each corresponding to a specific tier of the clinical hierarchy. At the first upsampling stage, the latent-modulated features
are fused with the
-scale backbone features
. To ground the decoder in anatomical space, we inject the regional text embeddings
. These embeddings provide context regarding the specific anatomical zones described in the report, helping the decoder disambiguate the lesion’s location:
As the decoder approaches the final resolution, the focus shifts from anatomical localization to precise lesion delineation consistent with the clinical text. We utilize the high-level diagnostic embeddings
to supervise the final two stages. This ensures that the generated boundaries are driven by the text prompts:
Finally, the output features are upsampled to the original image resolution to generate the predictive segmentation masks.
3.4. Loss Function
To further regularize the latent space and ensure that the visual features are semantically grounded in the linguistic priors, we introduce based on instance-level contrastive learning. This objective encourages the model to maximize the feature similarity between paired image–text samples while minimizing it for non-paired instances within a mini-batch.
Given the global visual descriptor
and the morphological textual descriptor
, we first project them onto a unit hypersphere by
-normalization:
We then construct a cosine similarity matrix
, where each element
represents the alignment score between the
i-th visual sample and the
j-th textual description. The alignment task is formulated as a multi-class classification problem where the model identifies the corresponding textual partner for each image instance. The loss is defined using the cross-entropy over the similarity logits:
where
is a temperature hyper-parameter (fixed to 1.0 in our implementation).
By optimizing , the semantic bottleneck is forced to aggregate morphology-relevant visual cues that are explicitly described in the clinical reports. This alignment acts as a semantic anchor, preventing the CVAE from generating visually plausible but clinically irrelevant variations in the latent space .
During the training phase, the perception of symmetry is strictly enforced through the Semantic Alignment Loss (
). Rather than relying on a standalone heuristic symmetry loss, our framework enforces symmetry awareness via cross-modal contrastive learning. Specifically, the global visual descriptor
intrinsically aggregates the quantified symmetry representations (
), while the morphological textual descriptor
encodes symmetry-related clinical descriptions (e.g., “unilateral”, “bilateral asymmetry”, “symmetric infection”). By minimizing the
defined in Equation (
20), the network is explicitly penalized if the mathematically quantified visual asymmetry fails to match the linguistic symmetry attributes. This alignment forces the model’s semantic bottleneck to consistently recognize and preserve symmetry-related variations during the optimization process.
The total objective function
is a weighted combination of the segmentation loss
, the KL divergence
, and the cross-modal alignment loss
:
where
denotes the weight for alignment, and
is a time-dependent scheduling weight designed to prevent posterior collapse in the CVAE. Specifically, we employ a two-stage curriculum learning strategy for
based on the current epoch
t:
This warm-up schedule allows the network to prioritize deterministic segmentation in early epochs before smoothly regularizing the latent morphology distribution.
4. Experiments and Results
To comprehensively evaluate the performance of the proposed method on the COVID-19 lung segmentation task, the dataset and experimental setup are first described, followed by quantitative and qualitative evaluations. Finally, ablation studies are conducted to analyze the contribution of each component of the model.
4.1. Datasets and Clinical Relevance
4.1.1. QaTa-COV19 Dataset
QaTa-COV19 [
40] is a large-scale chest X-ray (CXR) benchmark for pixel-level COVID-19 pneumonia segmentation. Dataset composition:The dataset comprises 9258 COVID-19 CXR images with pixel-wise infection annotations, collected primarily from the BIMCV-COVID19+ repository and complemented with previously curated cohorts. Annotation protocol: Ground-truth masks were generated via a human-in-the-loop pipeline. Initially, manually annotated CXRs were used to train segmentation networks (e.g., U-Net variants). These models then generated candidate masks for new images. Expert radiologists reviewed all candidates, manually correcting or re-annotating inaccurate predictions. Task definition and clinical relevance: This task delineates COVID-19 manifestations (e.g., low-contrast ground-glass opacities and consolidations) in CXRs. Accurate localization improves diagnostic interpretability, minimizes model distraction by irrelevant structures, and enables downstream disease assessment. Crucially, while healthy lungs exhibit macroscopic bilateral symmetry, COVID-19 typically causes asymmetric opacities. The proposed method leverages this structural symmetry and pathology-induced asymmetry as a core inductive bias to localize abnormalities. The dataset encompasses diverse disease presentations (e.g., unilateral, bilateral, multifocal). To facilitate vision–language alignment, each image is paired with structured, radiology-style clinical texts. Following standard protocols [
34,
38], the data is split into 5716 training, 1429 validation, and 2113 test samples to enable fair standardized comparisons.
4.1.2. MosMedData+ Dataset
MosMedData+ [
41,
42] is a chest CT-based COVID-19 infection segmentation dataset comprising 2729 axial CT slices with corresponding binary lesion masks. Dataset composition: Derived from the clinical MosMedData cohort, the dataset contains chest CT scans of suspected or confirmed COVID-19 cases. The selected slices capture diverse disease severities, lesion densities, and spatial distributions. Annotation protocol: Pixel-level masks are generated through a clinician-in-the-loop workflow, where initial delineations are refined and validated by expert radiologists. This consistently captures subtle opacities and consolidations, ensuring anatomical correctness for reliable volumetric analysis. Task definition and clinical relevance: This task segments COVID-19 CT abnormalities (ground-glass opacities, consolidations) to quantify infection burden, which is vital for diagnosis, severity stratification, and treatment monitoring. Furthermore, similar to CXRs, axial CTs reveal bilateral lung symmetry; leveraging these symmetric features enables the model to effectively separate asymmetric lesions from normal structures. To facilitate cross-modal learning, structured textual annotations encode clinical semantics (e.g., laterality, distribution, and localization) aligned with radiological conventions [
34]. The dataset is split into 2183 training, 273 validation, and 273 test slices for balanced evaluation.
4.1.3. BraTS 2021 Dataset
BraTS 2021 [
43] is a multi-parametric MRI-based brain tumor segmentation dataset. For this study, a curated subset comprising 1251 multi-modal brain volumes with corresponding multi-class lesion masks is utilized. Dataset composition: Derived from a multi-institutional clinical cohort, the dataset contains pre-operative baseline mpMRI scans (T1, T1Gd, T2, T2-FLAIR) of pathologically confirmed glioma patients. The selected volumes capture highly heterogeneous image qualities, diverse tumor grades, and intrinsic variations in tumor appearance. Annotation protocol: Voxel-level masks are generated through an AI-assisted, clinician-in-the-loop workflow, where initial automated delineations are iteratively refined and approved by board-certified neuroradiologists. This consistently captures complex glioma sub-regions including the enhancing tumor, necrotic core, and peritumoral edema. Task definition and clinical relevance: Specifically, we formulated the segmentation of the whole tumor by extracting paired axial T2-weighted slices and their corresponding masks. Accurately quantifying this tumor burden is vital for surgical treatment planning, radiotherapy mapping, and disease monitoring. Furthermore, healthy brain MRIs exhibit distinct bilateral hemisphere symmetry. Leveraging these symmetric features enables the model to effectively distinguish asymmetric space-occupying lesions from normal brain anatomy. To facilitate visual–language model segmentation, Qwen3-VL [
44] is employed to generate detailed lesion-related descriptions, encoding relevant clinical semantics aligned with radiological conventions. The 1251 samples are split into training, validation, and test sets at a 7:1:2 ratio (875, 125, and 251 samples, respectively) for balanced evaluation.
4.1.4. MSD Liver Dataset
MSD Task03 Liver [
45] is an abdominal CT-based liver and tumor segmentation dataset. For this study, a cohort of 131 patient volumes with corresponding multi-class masks is utilized. Dataset composition: Derived from the Medical Segmentation Decathlon, the dataset contains portal venous phase contrast-enhanced CT scans of patients with various liver tumors. The selected scans capture diverse liver shapes. Annotation protocol: Voxel-level masks are generated through a rigorous clinical workflow, where initial delineations of the liver are manually refined and validated by expert radiologists. This consistently captures subtle lesion boundaries, ensuring anatomical correctness for reliable volumetric analysis. Task definition and clinical relevance: This task segments the whole liver to quantify disease burden and organ volume, which is vital for surgical resection planning, oncology monitoring, and treatment evaluation. Furthermore, while the liver is inherently asymmetric, axial abdominal CTs exhibit global bilateral symmetry. Leveraging these contextual symmetric features enables the model to accurately localize the liver from normal surrounding structures. To facilitate visual–language model segmentation, Qwen3-VL [
44] is employed to generate detailed lesion-related descriptions, encoding relevant clinical semantics aligned with radiological conventions. The 131 3D patient volumes are split into training, validation, and test sets at a 7:1:2 ratio (92, 13, and 26 samples, respectively). To adapt the data for training, 5 representative axial slices are extracted from each volume, ensuring a balanced and diverse evaluation protocol.
4.2. Experiment Setup
All experiments are implemented in PyTorch 2.0.1 with torchvision 0.15.2 and Python 3.10, and are conducted on a Giga computing server MS03-CE0-000 equipped with an NVIDIA RTX 4090 GPU and an Intel(R) Xeon(R) Platinum 8476C CPU. The operating system is Ubuntu 22.04, and the development software is PyCharm (version 2025.2.4). The AdamW optimizer is adopted, with momentum coefficients and . The learning rate is scheduled using a cosine annealing strategy, with a minimum value of 1 × 10−6. Early stopping is applied based on the validation mean intersection-over-union (mIoU), where training is terminated if no performance improvement is observed for 30 consecutive epochs. Data preprocessing and augmentation are implemented using the MONAI pipeline. During training, images and masks are first loaded and converted into a unified format, followed by random scaling and rotation for geometric augmentation. The inputs are then resized to a fixed resolution and normalized on a per-channel basis. For the QaTa-COV19 dataset, the total training time of the proposed SVP-CVAE framework is approximately 1.66 h for 150 epochs. For validation and testing, stochastic augmentations are removed, and only deterministic operations, including loading, resizing, normalization, and tensor conversion, are retained to ensure fair and reproducible evaluation.
4.3. Evaluation Metrics
To comprehensively evaluate segmentation performance, four widely used metrics are adopted, including the Dice similarity coefficient (DSC), mIoU, Hausdorff distance (HD), and average surface distance (ASD). These metrics jointly assess both the overlap accuracy and the boundary delineation quality of the predicted segmentation results.
The DSC measures the similarity between the predicted region
P and the ground-truth region
G, and is defined as
where
denotes the number of pixels or voxels in the corresponding region. A higher DSC value indicates better overlap consistency between prediction and ground truth.
The mIoU evaluates the intersection-over-union ratio between the predicted and reference regions, which is formulated as
Compared with DSC, mIoU imposes a stricter penalty on false positives and false negatives, thereby providing a robust assessment of segmentation accuracy.
To further evaluate boundary precision, the Hausdorff distance (HD) is employed to measure the maximum distance between the boundary points of the predicted segmentation and the ground truth:
where
denotes the Euclidean distance. Lower HD values indicate better boundary alignment and fewer extreme segmentation errors.
In addition, the average surface distance (ASD) is adopted to evaluate the average boundary discrepancy between the predicted and reference contours:
where
and
represent the surface point sets of the predicted and ground-truth regions, respectively. ASD reflects the overall contour consistency and is less sensitive to outlier boundary points compared with HD.
Overall, DSC and mIoU mainly evaluate region overlap accuracy, while HD and ASD focus on boundary-level segmentation quality. The combination of these metrics provides a comprehensive and reliable evaluation of segmentation performance.
4.4. Comparison with State-of-the-Art Models
Extensive experiments are conducted to evaluate the proposed vision–language segmentation framework for medical segmentation across multiple imaging modalities. Due to the extreme scarcity of paired image–text medical data, QaTa-COV19 and MosMedData+ have become standard benchmarks for vision–language segmentation (e.g., LViT [
34], ReclMIS [
38], FMISeg [
36]). We adopt them primarily to ensure fair comparisons with these existing V-L baselines. To further demonstrate our framework’s generalizability beyond these specific domains, evaluations on BraTS21 and MSD Liver are detailed in
Section 4.5. To comprehensively demonstrate its superiority, we benchmark our method against 15 representative state-of-the-art (SOTA) medical image segmentation models. These baselines are systematically categorized into three groups: (1) convolutional and transformer-based architectures (e.g., U-Net [
22], U-Net++ [
23], Swin-UNet [
26], nnUNet [
24], UNetr [
46], TransUnet [
25], CFFormer [
47]), (2) foundational vision models adapted for medical tasks (e.g., SAM Adapter [
48]), and (3) recent domain-specific vision–language multimodal approaches tailored for medical segmentation (e.g., CLIPSeg [
49], LViT [
34], LGMS [
35], RecLMIS [
38], FMISeg [
36], ViTexNet [
50] and HiMix [
51]). For a fair comparison, all SOTA methods are trained and evaluated using identical data splits and a unified preprocessing pipeline. Moreover, all experiments are conducted under consistent hardware conditions and comparable training budgets to avoid performance bias. This controlled setup enables an objective evaluation of segmentation accuracy and generalization capability.
4.4.1. Experiments on QaTa-COV19 Dataset
Table 1 presents the quantitative segmentation results on the QaTa-COV19 dataset. The proposed method achieves the best overall performance, with a DSC of 91.36% and an mIoU of 84.10%. These results slightly surpass those of FMISeg, which attains a DSC of 91.04% and an mIoU of 83.56%, as well as LGMS, which achieves a DSC of 89.85% and an mIoU of 81.78%. This improvement demonstrates consistent gains in region overlap accuracy. In terms of boundary quality, the method achieves the lowest HD value, reaching 20.71, which is lower than the values of 21.49 obtained by FMISeg and 21.66 obtained by HiMix. This result indicates more precise contour localization and fewer extreme boundary deviations. Although the ASD value is 3.55, which is higher than the value of 2.76 achieved by LViT, it remains competitive and is substantially lower than that of most other approaches. This observation suggests that boundary smoothness is preserved while optimizing overlap metrics.
Furthermore, a clear performance gap between text-guided and non-text methods is observed. Conventional architectures such as nnUNet achieve a DSC of 79.05%, while Swin-UNet achieves a DSC of 77.97%, both showing notably inferior performance. Even strong non-text baselines such as SAM Adapter, which achieves a DSC of 89.12% and an mIoU of 80.57%, are still outperformed by several text-based methods. This trend highlights the effectiveness of incorporating semantic priors from textual information to improve both region-level accuracy and boundary delineation. Overall, the results demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics while maintaining a balanced optimization between overlap fidelity and boundary refinement.
The qualitative comparisons shown in
Figure 4 demonstrate that the proposed method produces more accurate and consistent lesion segmentation across diverse cases, with fewer false positives and false negatives than competing approaches. Specifically, nnUNet tends to produce over-segmentation, introducing large false positive regions in multiple cases, whereas LANG and LViT partially mitigate this issue but still exhibit fragmented predictions and missing lesion regions. FMISeg demonstrates improved localization capability; however, it occasionally generates spurious activations and fails to fully capture lesion boundaries in challenging scenarios.
In contrast, the proposed method produces predictions that are more closely aligned with the ground truth, effectively suppressing irrelevant regions while preserving complete lesion structures (as shown in the third and fourth rows). This advantage is particularly evident in small or ambiguous regions, where other methods either fail to detect subtle lesions or introduce noisy responses, whereas the proposed method maintains coherent and compact segmentation. Furthermore, the predicted boundaries are smoother and more precise, with fewer discontinuities and reduced leakage into surrounding normal tissue, indicating stronger spatial consistency.
Notably, compared with other models, the proposed method demonstrates more effective utilization of textual information, where the predicted regions are better aligned with the semantic descriptions provided by the text, reflecting a stronger correspondence between visual features and textual cues. These visual observations are consistent with the quantitative results, confirming that the proposed method improves segmentation reliability by simultaneously reducing false positives and false negatives while enhancing structural completeness.
4.4.2. Experiments on MosMedData+ Dataset
Table 2 presents the quantitative segmentation results on the MosMedData+ dataset. The proposed method achieves the highest region overlap performance, with a DSC of 80.17% and an mIoU of 66.91%. These results exceed those of SAM Adapter, which attains a DSC of 79.12% and an mIoU of 65.45%, as well as FMISeg, which achieves a DSC of 79.49% and an mIoU of 65.94%. This comparison indicates clear advantages in capturing lesion regions across both vision-only and vision–language methods.
In addition to overlap accuracy, the proposed method maintains competitive boundary precision. The ASD is 2.96, which is close to the value of 2.94 achieved by FMISeg, while the HD is 22.93, remaining comparable to the values of 22.35 obtained by SAM Adapter and 22.32 obtained by LGMS. These results indicate that the improvement in overlap performance does not lead to degradation in boundary quality. A broader comparison reveals that text-guided methods generally outperform traditional architectures. For example, U-Net achieves a DSC of 54.12%, and U-Net++ achieves a DSC of 56.84%, both significantly lower than those of text-based approaches.
However, the effectiveness of text-guided methods varies depending on the quality of vision–language alignment, as evidenced by the strong performance of SAM Adapter without textual input. In addition, some methods exhibit inconsistencies between overlap and boundary metrics. For instance, LGMS achieves a relatively high DSC of 78.01% but produces an ASD of 9.70, indicating suboptimal boundary consistency.
In contrast, the proposed method achieves a more balanced trade-off by simultaneously improving region accuracy and maintaining boundary smoothness. These results demonstrate that effective integration of textual priors with visual representations can lead to both quantitative gains and more stable segmentation behavior.
The qualitative comparison on the MosMedData+ dataset shown in
Figure 5 demonstrates that the proposed method produces more accurate and reliable segmentation, particularly in challenging cases involving subtle and small-scale lesions. Compared with nnUNet, LViT, and FMISeg, which exhibit noticeable false positive and false negative regions, the proposed method generates predictions that are more consistent with the ground truth in both lesion extent and structural details.
Specifically, competing methods tend to either miss low-contrast infection regions or introduce spurious activations in normal areas, indicating limitations in distinguishing ambiguous boundaries. In contrast, the proposed method effectively suppresses false positives while recovering difficult-to-detect lesion regions, leading to more complete and cleaner segmentation masks.
This advantage is particularly evident in regions with weak intensity contrast or fragmented lesion patterns, where the model maintains stronger spatial continuity and more accurate boundary delineation. These observations indicate that the proposed method improves both sensitivity to subtle pathological patterns and specificity, resulting in fewer misclassified regions and more clinically reliable segmentation outcomes.
4.5. Generalization Across Diverse Anatomical Symmetry Profiles
We further validate the proposed SVP-CVAE framework across anatomies with different symmetry characteristics. Specifically, we evaluate the framework on two representative datasets with distinct symmetry properties: the BraTS dataset, featuring highly bilateral brain symmetry, and the MSD Liver dataset, characterized by inherently asymmetric organ geometry. Quantitative comparisons with state-of-the-art methods are reported in
Table 3.
Performance on Highly Symmetric Anatomy (BraTS dataset): The human brain provides an ideal scenario for evaluating the explicit symmetry perception capability of SVP-CVAE. As shown in
Table 3, our method achieves the best segmentation performance, with a DSC of 93.47% and an mIoU of 87.74%. Benefiting from the cross-bilateral symmetry mechanism, the morphology-aware enhancement module effectively compares bilateral hemispheric structures, enabling accurate identification of unilateral tumor regions as symmetry-disrupting anomalies. In addition, the attribute-latent contrastive objective aligns these asymmetric features with fine-grained clinical semantic descriptions, resulting in more precise boundary delineation.
Performance on Asymmetric Anatomy (MSD Liver dataset): Unlike the lungs and brain, the liver exhibits substantial anatomical asymmetry and inter-patient morphological variability. Despite the absence of intrinsic organ symmetry, SVP-CVAE still achieves state-of-the-art performance on the MSD Liver dataset, obtaining a DSC of 97.45% and an mIoU of 95.02%, outperforming recent vision–language and transformer-based methods such as CDFormer and LViT.
The strong performance on asymmetric targets stems from two factors. First, although the liver is anatomically asymmetric, surrounding structures in axial CT slices preserve relatively stable global contextual symmetry, which provides reliable spatial cues for localization through the proposed cross-bilateral symmetry mechanism. Second, to address inter-patient structural variability, our CVAE-based vision–language framework integrates probabilistic prior-to-posterior inference with text-guided alignment. This combination enables the model to capture diverse, symmetry-related variations and maintain robust feature representations.
4.6. Ablation Study
4.6.1. Overall Component Analysis
Table 4 validates each component of the proposed framework from the perspectives of latent modeling, morphology awareness, and semantic alignment. Starting from the baseline with Dice of 88.75% and mIoU of 79.78%, introducing conditional latent morphology distribution modeling (CLMD) improves performance to 89.23% and 80.56%, while reducing ASD from 4.63 to 4.43 and HD from 26.16 to 23.49. This indicates that conditional latent modeling enhances global stability and reduces extreme prediction errors. Adding MARL further boosts Dice to 91.05% and mIoU to 83.57%, with ASD and HD reduced to 3.72 and 21.47. The gains of 1.82 Dice and 3.01 mIoU over CLMD demonstrate that morphology-aware modeling significantly improves fine-grained structure and boundary accuracy. Replacing MARL with ACTA achieves 90.88% Dice and 83.29% mIoU, with ASD of 3.74 and HD of 22.48, showing that attribute-level alignment mainly enhances semantic consistency with moderate boundary refinement. Combining MARL and ACTA further improves performance to 91.22% Dice and 83.85% mIoU, with ASD reduced to 3.61. This confirms that morphology modeling and semantic alignment are complementary, although partially overlapping. Finally, introducing
achieves the best results with 91.36% Dice and 84.10% mIoU, and further reduces ASD to 3.55 and HD to 20.71, indicating that attribute-latent consistency regularizes the latent space and improves both structural and semantic quality. Overall, CLMD improves uncertainty-aware representation, MARL enhances morphological precision, ACTA strengthens semantic alignment, and the additional supervision further regularizes the latent distribution, leading to consistent gains across all metrics.
4.6.2. Analysis of CLMD Framework Contributions
To evaluate the effectiveness of the proposed CLMD, ablation studies are conducted by selectively removing its key components, including the prior, posterior, and KL regularization, as summarized in
Table 5.
Removing the entire CVAE module results in a Dice score of 90.54% and an mIoU of 82.72%, indicating a clear performance degradation compared with the full model. This result suggests that purely deterministic modeling is insufficient to capture the inherent ambiguity in medical image segmentation. Introducing only the prior increases the Dice score to 90.81% and the mIoU to 83.16%, demonstrating limited improvement. This observation indicates that stochasticity without posterior guidance is not sufficient to learn task-relevant latent representations.
When both the prior and posterior are retained but the KL regularization is removed, the Dice score further increases to 91.06% and the mIoU to 83.60%. This improvement reflects the strong supervisory role of the posterior conditioned on ground-truth masks. However, the absence of KL regularization prevents explicit alignment between the prior and posterior distributions, which may lead to distribution mismatch during inference. The full model achieves the best performance, with a Dice score of 91.36% and an mIoU of 84.10%. These results demonstrate that jointly modeling the prior and posterior with KL regularization produces a more structured and generalizable latent space. The consistent improvements across different configurations confirm that each component of CLMD contributes in a complementary manner, enabling more robust and uncertainty-aware segmentation.
4.6.3. Analysis of MARL Module Contributions
To further validate the effectiveness of the MARL module, we designed a set of module comparison experiments. The ablation results of MARL in
Table 6 demonstrate that both SSGM and CBSM contribute to performance improvement from distinct structural perspectives, and their combination yields consistent gains across all metrics. Compared to the variant without MARL, introducing only SSGM improves Dice and mIoU to 79.57% and 66.08%, respectively, while substantially reducing ASD from 4.61 to 3.50, indicating enhanced sensitivity to local boundary variations. Similarly, the use of only CBSM achieves further gains and reduces ASD to 3.19, suggesting that modeling bilateral structural inconsistency is particularly effective in capturing global morphological irregularities. When both components are jointly integrated, the full MARL module achieves the best performance, with consistent improvements across all evaluation metrics. The progressive reduction in ASD and HD highlights that SSGM and CBSM provide complementary benefits in refining boundary precision and reducing structural debytions, while their joint modeling enables a more comprehensive characterization of lesion morphology. Overall, these results validate that integrating local morph-awareness with global asymmetry modeling is critical for constructing a robust and morphology-sensitive representation.
4.6.4. Analysis of Vision–Language Fusion Mechanism
To investigate the effectiveness of the proposed text fusion strategy, we conduct a series of ablation studies by varying the usage of textual information and the fusion design, as summarized in
Table 7. Removing textual guidance leads to inferior performance, indicating that purely visual features are insufficient to fully capture semantic context in medical image segmentation. Introducing text features at a single scale yields consistent improvements. Specifically, using only low-level text features achieves a moderate gain, while high-level text features provide a larger improvement, suggesting that semantically richer textual representations are more beneficial for guiding segmentation. Further incorporating multi-scale text features leads to additional performance gains, demonstrating the advantage of leveraging complementary information across different semantic levels. However, this variant employs a simple MLP-based fusion, which lacks explicit modeling of cross-modal interactions. The full model achieves the best performance, showing that the proposed learned fusion mechanism further enhances feature integration by explicitly modeling structured interactions between visual and textual features. The numerical improvement over simple multi-scale fusion remains consistent, indicating that the proposed fusion module provides a more effective and stable way to exploit textual guidance. These results indicate that textual guidance improves segmentation, where high-level and multi-scale representations are more effective, and structured fusion further facilitates cross-modal interaction.
4.6.5. Analysis of Alignment Loss Weight
The sensitivity analysis of the alignment loss weight
(
Figure 6) demonstrates that moderate values are critical for optimal performance. Elevating
from 0.01 to 0.1 increases the mIoU from 0.8399 to 0.841, indicating that tighter attribute-latent semantic consistency promotes more robust cross-modal alignment. However, we observe diminishing returns beyond
; the mIoU decreases to 0.8405 at
and further to 0.8375 at
. This degradation suggests that an over-emphasis on semantic alignment can subvert the primary segmentation task. Similarly, the performance of the HD metric also reflects this trend. Based on these findings, we set
as the default, as it provides the most favorable trade-off between segmentation precision and structural consistency.
4.7. Morph Feature Visualization
The visualization results shown in
Figure 7 demonstrate that CBSM and SSGM exhibit distinct yet complementary spatial attention patterns, which are effectively unified in the proposed method. Specifically, CBSM emphasizes bilaterally symmetric structures, consistently activating along both lung fields and particularly around boundary regions, indicating its ability to capture global structural priors and reinforce contour-aware representations. In contrast, SSGM focuses more on the central regions of the lung fields, where the responses are more concentrated and less affected by peripheral noise, suggesting enhanced sensitivity to intra-pulmonary semantic consistency and intensity variations. By integrating these complementary characteristics, the proposed method produces more spatially coherent and balanced activation maps, achieving a more uniform attention distribution across both peripheral and central regions. Furthermore, the fused features exhibit stronger and more precise responses in lesion-related areas, with activation patterns that more closely align with the ground truth annotations, especially in cases with irregular shapes or diffuse boundaries. This improvement indicates that the integration strategy effectively aggregates multi-scale contextual information while suppressing irrelevant background interference.
As a result, the final segmentation outputs demonstrate improved structural completeness and regional consistency, with reduced false activations outside the lung fields and better coverage of lesion regions. These observations confirm that the proposed feature integration not only enhances local discriminability but also preserves global anatomical structure, which is essential for achieving reliable and accurate medical image segmentation.
4.8. Model Efficiency Analysis
To further evaluate efficiency, we analyze the performance gain per computational cost in
Table 8. To provide a more comprehensive efficiency evaluation, we report not only model size and FLOPs, but also runtime-related metrics such as latency, FPS, and GOPS, which better reflect practical deployment performance. Although our model is not the lightest among all compared methods, it achieves the best overall segmentation performance while maintaining competitive computational complexity. Compared with FMISeg, our method delivers higher Dice and mIoU scores with substantially fewer FLOPs, lower latency, and higher FPS, indicating better deployment efficiency. Overall, these results demonstrate that the proposed framework achieves a favorable balance between effectiveness and efficiency, enabling high-precision segmentation without excessive computational burden.
4.9. Limitations
Although demonstrating superior performance, the proposed SVP-CVAE framework has a few limitations. First, as a vision–language model, it is sensitive to text quality; vague or incomplete clinical prompts can lead to suboptimal semantic alignment and degrade segmentation accuracy. Second, despite using parameter-efficient fine-tuning (Adapter), integrating a text encoder and CVAE sampling introduces higher computational overhead compared to lightweight pure-vision models, potentially limiting deployment on resource-constrained edge devices. Finally, our current implementation processes volumetric CT data slice-by-slice. Extending the framework to native 3D architectures could better exploit continuous volumetric symmetry to further enhance segmentation consistency.
5. Conclusions
In this work, we presented SVP-CVAE, a novel vision–language framework for medical image segmentation that synergistically integrates textual semantics with visual features while explicitly perceiving morphological symmetry and structural variability. By leveraging a conditional variational autoencoder coupled with attribute-level multimodal alignment, the proposed method effectively resolves the inherent one-to-many mapping between clinical descriptions and visual realizations. Extensive evaluations on chest X-ray and CT datasets demonstrate that SVP-CVAE achieves state-of-the-art performance in both regional overlap and boundary fidelity. Our results underscore that the integration of hierarchical textual features, symmetry-aware representation modules, and probabilistic latent modeling consistently enhances segmentation reliability, especially in cases of subtle or anatomically ambiguous lesions. The framework facilitates the generation of semantically consistent predictions, offering a more robust representation of the variability inherent in realistic clinical scenarios. Importantly, the framework demonstrates strong generalization capabilities across diverse anatomical symmetry profiles, successfully delineating both highly symmetric structures (e.g., brain tumors) and inherently asymmetric organs (e.g., liver tumors) by leveraging global contextual cues.
Moving forward, several promising directions remain for future investigation. First, to mitigate the reliance on high-quality manual text prompts, future work will explore integrating large language models (LLMs) for automated prompt generation and semantic refinement. Second, developing lightweight or distilled variants of the probabilistic framework will be a valuable step toward facilitating real-time deployment on resource-constrained clinical edge devices. Third, while our current cross-bilateral mechanism successfully captures macro-level features, extending the symmetry perception to encompass local symmetry, rotational symmetry, and topological modeling will further enhance the delineation of complex irregular lesions in naturally non-symmetrical organs. Finally, transitioning from slice-by-slice processing to native 3D architectures represents a critical next step, which would allow the model to fully exploit continuous volumetric symmetry and spatial context.