Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation

Jiang, Jiu; Zhou, Qi; He, Chu

doi:10.3390/sym18050859

Open AccessArticle

Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation

by

Jiu Jiang

,

Qi Zhou

and

Chu He

^*

Electronic Information School, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(5), 859; https://doi.org/10.3390/sym18050859 (registering DOI)

Submission received: 11 April 2026 / Revised: 10 May 2026 / Accepted: 16 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Hybrid Deep Learning and Explainable AI for Symmetry-Aware and Multiscale Medical Image Analysis)

Download

Browse Figures

Versions Notes

Abstract

Medical image segmentation is challenging due to subtle pathological patterns and the inherent ambiguity of clinical descriptions. Although vision–language models have shown promise, they frequently lack fine-grained perception of structural variability. To address these limitations, we propose the Symmetry- and Variability-Perceiving Conditional Variational Autoencoder (SVP-CVAE). The proposed method integrates a clinical attribute encoder with a morphology-aware enhancement module that incorporates a cross-bilateral symmetry mechanism to explicitly capture symmetry-related variations. By reformulating the segmentation task as a probabilistic prior-to-posterior inference process, SVP-CVAE models the one-to-many mapping between textual attributes and visual realizations. Furthermore, we introduce an attribute-latent contrastive objective to ensure that the latent space encodes discriminative morphological information. Extensive experiments demonstrate that the proposed framework achieves superior segmentation accuracy compared to state-of-the-art methods. Results indicate that SVP-CVAE effectively captures diverse yet anatomically plausible structural variations while maintaining high sensitivity to bilateral symmetry. Comprehensive ablation studies confirm that the performance gains are synergistically driven by the proposed symmetry-perceiving module and the contrastive semantic alignment objective, rather than relying solely on the probabilistic formulation. In conclusion, integrating explicit symmetry perception with probabilistic modeling significantly enhances the reliability and interpretability of multimodal medical image segmentation in complex clinical scenarios.

Keywords:

vision–language models; medical image segmentation; language-guided segmentation; feature fusion

1. Introduction

Medical image segmentation constitutes a core problem in medical image analysis and computer vision, with the objective of delineating anatomical structures or pathological regions in a precise and automated manner [1,2,3]. Reliable segmentation is indispensable for a wide range of clinical applications, including disease diagnosis [4,5,6], treatment planning [7,8,9], and longitudinal assessment of disease progression [10,11]. Despite substantial advances brought by deep learning, most existing approaches predominantly depend on visual features and densely annotated masks, while largely neglecting the complementary semantic information contained in medical textual descriptions such as radiology reports and diagnostic notes [12,13,14].

These textual descriptions encode clinically meaningful attributes, including morphological characteristics, macro-level anatomical bilateral symmetry (e.g., the symmetric nature of lungs and brain hemispheres), boundary regularity, internal texture, and spatial distribution. Such information is particularly valuable for distinguishing subtle pathological patterns and for constraining segmentation toward anatomically plausible structures. However, conventional segmentation models rarely establish explicit correspondence between these semantic attributes and visual representations. Recent developments in vision–language models (VLMs) provide a potential pathway for multimodal integration through contrastive learning [15,16]. Nevertheless, most VLM-based approaches rely on global alignment objectives that capture coarse semantic consistency, while lacking the ability to associate fine-grained textual attributes with localized image regions. This limitation is critical in medical imaging, where clinically relevant descriptions often emphasize subtle morphological variations, irregular boundaries, or asymmetric structural patterns. In addition, global alignment strategies do not sufficiently support segmentation tasks that require spatially precise and structurally coherent predictions, particularly when anatomical symmetry and morphology serve as key diagnostic cues. Consequently, existing vision–language frameworks remain insufficient in jointly modeling semantic fidelity and structural variability in clinical data.

Another fundamental challenge lies in the inherent ambiguity of medical text descriptions. In realistic clinical settings, a single description may correspond to multiple plausible visual realizations, as illustrated in Figure 1. For example, a report describing “unilateral pulmonary infection, a single infected area, middle left lung” may correspond to lesions with substantially different sizes, shapes, boundary definitions, and degrees of bilateral asymmetry across patients. This phenomenon induces a one-to-many mapping between textual descriptions and segmentation outcomes. Most existing vision–language segmentation methods implicitly assume a deterministic correspondence between image–text pairs and segmentation masks, which does not reflect the variability of pathological morphology observed in practice. As a result, such methods tend to produce rigid predictions that fail to capture the diversity and uncertainty inherent in clinical data.

To overcome these limitations, we propose the Symmetry- and Variability-Perceiving Conditional Variational Autoencoder (SVP-CVAE). While the probabilistic prior-to-posterior inference is a standard mathematical formulation of CVAEs [17,18], our core innovation lies in leveraging this framework to explicitly formulate and resolve the domain-specific “one-to-many” ambiguity inherent in clinical medical text descriptions. To ensure that the diverse outcomes generated by the CVAE are clinically valid, we uniquely constrain the stochastic latent space by integrating explicit morphological symmetry perception and fine-grained semantic alignment. Structured representations of medical text, attribute-level multimodal alignment, and latent-variable modeling are integrated within a unified probabilistic formulation. By introducing latent variables conditioned on both visual features and textual attributes, the framework captures diverse yet semantically consistent segmentation outcomes corresponding to the same description.

Specifically, a low-rank adapted [19] language model is employed to encode medical text into multiscale representations, which serve as a semantic interface for interaction with visual features. Based on these representations, an attribute-level vision–language alignment mechanism is designed to associate image features with individual textual attributes, enabling precise grounding of morphology-related cues. This design facilitates segmentation guided not only by global descriptions but also by localized structural and symmetric patterns. To account for intrinsic variability in lesion morphology, a CVAE formulation is incorporated. The latent variable z models the distribution of plausible segmentation outcomes conditioned on both image and text. Sampling from this latent space produces diverse predictions that remain consistent with specified semantic attributes, thereby addressing the one-to-many mapping between text and segmentation masks. Furthermore, an attribute-latent contrastive objective is introduced to enforce semantic consistency between latent representations and textual attributes, encouraging z to encode morphology-aware and discriminative information. This leads to a more interpretable latent space that aligns with clinically meaningful structural descriptions. While horizontal flipping serves as a straightforward macro-level proxy for symmetry perception, it is inherently most effective for naturally bilateral structures such as the lungs and brain. However, evaluating its behavior on inherently asymmetric organs (e.g., the liver) is equally critical. In this work, we demonstrate that by coupling this macro-level structural proxy with a probabilistic latent distribution, our framework can robustly process both highly symmetric anatomies and geometrically asymmetric organs by leveraging global contextual symmetry.

Extensive experiments demonstrate that the proposed framework improves the alignment between textual semantics and segmentation outputs, resulting in more accurate and structurally consistent predictions. In addition, the latent-variable formulation enables the generation of diverse yet plausible segmentation results, better reflecting the variability observed in real-world clinical scenarios.

The main contributions of this work can be summarized as follows:

(1): A Symmetry- and Variability-Perceiving Conditional Variational Autoencoder (SVP-CVAE) is proposed. Our key methodological novelty is adapting the standard CVAE framework to specifically address the “one-to-many” ambiguity in medical vision–language segmentation by injecting symmetry-aware and morphology-guided priors into the latent representation.
(2): A morphology-aware enhancement module is introduced, incorporating a spatial–semantic grounding module (SSGM) and a cross-bilateral symmetry module (CBSM) to capture localized structural cues and symmetry-related variations.
(3): A clinical attribute text encoder (CATE) with low-rank adaptation-based fine-tuning is developed to extract hierarchical representations, enabling fine-grained grounding through a morphology–text fusion (MTF) module.
(4): To prevent the CVAE from generating anatomically implausible variations during the one-to-many mapping, we introduce a Semantic Alignment Loss (SAL) as an attribute-latent contrastive objective. This explicitly regularizes the standard latent space, ensuring that sampled variables preserve discriminative and morphology-aware semantic consistency.

2. Related Work

2.1. Deep Learning Segmentation of Medical Images

Early research primarily focused on fully convolutional networks [20] and their various architectures [21]. Among these, U-Net [22] has become a foundational architecture owing to the symmetric encoder–decoder design and skip connections, which facilitate the preservation of spatial details. UNet++ [23] further improved feature fusion through the introduction of nested and densely connected skip pathways, thereby enhancing the representation capability across multiple semantic scales. And nnU-Net [24] proposes a self-configuring segmentation framework that automatically adapts preprocessing, network architecture, training, and post-processing to new datasets, establishing a strong and widely adopted baseline across diverse biomedical segmentation tasks. With the emergence of Transformer architectures, TransUNet [25] integrates the local feature extraction strengths of convolutional neural networks with the ability of Transformers to capture long-range contextual dependencies, which improves segmentation accuracy for organs with complex morphology. Swin-Unet [26] proposes a pure Transformer-based U-shaped encoder–decoder architecture that leverages hierarchical Swin Transformer with shifted windows to capture both local and long-range dependencies, achieving superior performance over CNN-based and hybrid methods in medical image segmentation. Despite the robust performance of these models in general segmentation tasks, they rely heavily on precise pixel-level annotations and largely underutilize the semantic information contained in accompanying clinical text.

2.2. Vision–Language Models for Medical Images

To enhance the robustness of models, recent studies have explored the application of vision–language pretraining in medical image analysis [27,28]. The central objective is to leverage textual descriptions associated with medical images as high-level semantic guidance. The CLIP model establishes a link between visual features and textual semantics through contrastive learning [29]. Inspired by this paradigm, Anatomy-VLM [30] introduces a fine-grained vision–language model that incorporates anatomical-aware and multi-scale feature alignment to enhance clinically interpretable disease understanding and support downstream segmentation tasks. The segment anything model (SAM) [31] and the medical adaptations of this model, including MedCLIP-SAMv2 [32], which integrates vision–language models with SAM to enable text-driven medical image segmentation, support zero-shot and weakly supervised settings for improved data efficiency and generalization. SAM2 [33] systematically evaluates SAM for 2D and 3D medical image segmentation and provides practical insights into effective prompt strategies for improving performance in volumetric settings. By introducing prompts, these models utilize spatial relations and tissue characteristics encoded in language to guide the segmentation process. However, medical image analysis based on vision–language models primarily focuses on global image–text correspondence in tasks such as classification and retrieval, rendering it difficult to establish fine-grained alignment between visual regions and textual semantics.

2.3. Vision–Language Models for COVID-19 Segmentation

To mitigate the scarcity of pixel-level annotations for COVID-19 infection segmentation, recent studies have introduced textual features into the segmentation pipeline. LViT [34] incorporates medical textual descriptions as auxiliary signals and utilizes a language–vision loss in conjunction with an iterative pseudo-labeling mechanism, which effectively compensates for the absence of visual information. The LGMS [35] framework demonstrates the flexibility of textual prompts in improving the accuracy of infection region segmentation, especially under conditions of constrained training data. To further enhance the efficiency of multimodal feature fusion, FMISeg [36] proposes a frequency-domain interaction module that suppresses irrelevant noise under textual guidance, whereas Dual-LVT [37] utilizes clinical report summaries generated by LLMs to enable language features to produce coarse segmentation maps that guide subsequent refinement. Although these approaches achieve notable performance gains, the internal mechanisms of these methods still present evident limitations. As discussed in RecLMIS [38], most existing vision–language segmentation frameworks adopt implicit and ambiguous alignment strategies, which potentially leads to inconsistencies between the segmentation outputs and textual semantics. More importantly, current vision–language methods for COVID-19 segmentation largely remain at a shallow stage of feature-level fusion. Although textual embeddings are utilized to modulate visual features as adaptive weights or masks, the absence of detailed modeling regarding medical knowledge renders the interaction process a predominantly black-box mapping. Consequently, while models can improve quantitative metrics through textual assistance, they fail to provide interpretable evidence regarding how clinical terminology is precisely grounded in fine-grained lesion regions. This deficiency in deep semantic reasoning limits both robustness and clinical reliability when handling infection areas with complex morphological variations.

3. Method

To address the coarse feature alignment and suboptimal interpretability of existing vision–language models in medical image segmentation, we propose the SVP-CVAE framework. As shown in Figure 2, built upon an encoder–decoder backbone, our model incorporates a parameter-efficient language adapter module (LAM) within the pre-trained text encoder to capture multi-level morphological attributes, such as boundary sharpness and bilateral asymmetry. Notably, by integrating a text-conditioned latent distribution at the high-level semantic bottleneck, we reformulate the segmentation task as a prior-to-posterior inference process. This stochastic modeling approach effectively mitigates the “one-to-many” mapping challenge inherent in matching clinical descriptions with diverse pathological geometries.

3.1. Morphology-Aware Representation Learning

The accuracy of medical image segmentation is closely related to the ability to characterize fine-grained morphological properties of pathological regions, including boundary clarity, structural irregularity, and bilateral symmetry deviation. Conventional vision–language segmentation frameworks typically adopt generic convolutional or attention-based feature refinement strategies, which do not explicitly encode clinically meaningful structural priors. As a result, morphology-related information, particularly symmetry-aware cues, is insufficiently preserved in deep feature representations. To address this issue, a morphology-aware feature enhancement module is introduced to explicitly encode structural patterns and symmetry variations at the semantic bottleneck stage.

Given the visual token sequence

X \in R^{N \times C}

obtained from the vision encoder, where

N = H \times W

denotes the number of spatial tokens and C represents the channel dimension, the sequence is reshaped into a two-dimensional feature map

F \in R^{C \times H \times W}

to facilitate spatially structured processing.

To capture localized structural variations that are critical for accurate boundary delineation, a spatial-aware structural response mechanism is first employed. A depthwise convolution operator is applied to extract channel-wise structural responses:

F_{spatial} = D_{3 \times 3} (F),

(1)

where

D_{3 \times 3} (\cdot)

denotes a depthwise convolution with a kernel size of

3 \times 3

. This operation emphasizes high-frequency patterns associated with lesion contours and boundary transitions, while maintaining channel-specific semantic consistency.

In addition to local structure modeling, explicit encoding of bilateral symmetry is introduced to capture morphology-related deviations. Many pathological regions exhibit asymmetric patterns with respect to anatomical axes, which serve as important diagnostic indicators. This design is primarily motivated by the global anatomical bilateral symmetry inherently present in organs such as the lungs and the human brain. To model such characteristics, a bilateral structural inconsistency representation is constructed by computing the absolute difference between the feature map and its horizontally reflected counterpart:

F_{sym} = |F - Flip (F)|,

(2)

where

Flip (\cdot)

denotes reflection along the horizontal axis. Although this simple geometric transformation is predominantly tailored for bilateral structures, it also provides valuable global contextual anchoring for inherently asymmetric organs (e.g., the liver) when processed in axial cross-sections. This operation highlights regions that deviate from expected symmetric structures. To further aggregate long-range asymmetric patterns, an anisotropic depthwise convolution is applied:

F_{asym} = D_{1 \times 7} (F_{sym}),

(3)

which captures lateral structural discrepancies and enhances sensitivity to asymmetric morphology.

Formally, Equations (2) and (3) provide a mathematical quantification of bilateral symmetry. Let

F (x, y)

denote the spatial feature activation at coordinate

(x, y)

. The operation in Equation (2) calculates the absolute residual between the original feature space and its mirrored counterpart along the vertical anatomical axis. Consequently,

F_{sym}

explicitly quantifies the degree and spatial location of structural asymmetry. The subsequent anisotropic convolution in Equation (3) further aggregates these lateral discrepancies into

F_{asym}

, establishing a formal structural prior that represents the quantified symmetry deviations of the pathological regions.

Finally, the original semantic representation, the spatial structural response, and the symmetry-aware representation are integrated through a morphology-aware fusion mechanism. These feature maps are concatenated along the channel dimension and projected using a pointwise transformation:

F_{morph} = P ([F, F_{spatial}, F_{asym}]),

(4)

where

[\cdot]

denotes channel-wise concatenation and

P (\cdot)

represents a

1 \times 1

convolution. This fusion process enables adaptive integration of global semantic context with morphology-sensitive and symmetry-aware cues, producing a structurally informed representation.

The resulting feature map is reshaped back into a token sequence

X_{morph} \in R^{N \times C}

, which provides enhanced morphological and symmetry-aware information for subsequent vision–language alignment and probabilistic segmentation modeling.

3.2. Attribute-Conditioned Cross-Modal Token Alignment

Although morphology-aware representation learning provides structurally enriched visual features, accurate vision–language segmentation further requires precise and structured semantic guidance derived from clinical text. In clinical practice, diagnostic reports describe lesions through attribute-specific cues, including morphology, boundary properties, spatial distribution, and symmetry-related deviations. These cues are inherently multi-granular and often correspond to localized visual patterns rather than global semantic concepts. Therefore, effective alignment demands explicit modeling of the hierarchical relationship between language representations and morphology-aware visual features.

To capture the multi-level semantics embedded in clinical descriptions, the CATE is introduced. A pre-trained Transformer backbone (CXR-BERT [39]) is adopted and adapted to the medical domain using Low-Rank Adaptation. The backbone parameters are fixed, while trainable low-rank matrices are injected into the query, key, value, and feed-forward projections of the last T layers (with

T = 4

), enabling efficient domain adaptation without disrupting the original language structure.

To establish hierarchical semantic representations, hidden states from different depths of the Transformer are extracted, denoted as

{h_{L - 6}, h_{L - 3}, h_{L - 1}}

. These layers correspond to distinct semantic levels, including low-level morphological descriptions, intermediate regional and spatial context, and high-level diagnostic reasoning. Notably, the low-level representations capture fine-grained attributes such as shape irregularity, boundary sharpness, and symmetry-related patterns, which are closely aligned with morphology-aware visual features. These hidden states are projected into a shared embedding space:

T_{l o c a l}, T_{r e g i o n}, T_{g l o b a l} = ϕ_{l} (h_{L - 6}), ϕ_{m} (h_{L - 3}), ϕ_{h} (h_{L - 1}),

(5)

where

ϕ (\cdot)

denotes layer-specific linear projections.

To reduce the semantic discrepancy between visual representations and textual descriptions, a morphology–text fusion (MTF) module is introduced at the semantic bottleneck. This module aligns the morphology-enhanced visual tokens

F_{morph} \in R^{N \times C}

with the low-level textual embeddings

T_{l o c a l} \in R^{N \times C}

, which encode morphology- and symmetry-related attributes.

The MTF module adopts a cross-attention mechanism in which visual tokens act as queries to retrieve relevant structural priors from textual representations. Let

W_{q}, W_{k}, W_{v} \in R^{C \times C}

denote learnable projection matrices. The interaction is formulated as

Q = F_{morph} W_{q}, K = T_{l o c a l} W_{k}, V = T_{l o c a l} W_{v} .

(6)

The attention weights, which quantify the relevance between each spatial location in the image and each morphology-aware textual token, are computed as

Attn (Q, K) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) .

(7)

The grounded visual representation is then obtained through residual aggregation followed by linear projection:

{\hat{F}}_{morph} = Linear (Attn (Q, K) V + F_{morph}) .

(8)

Through this asymmetric cross-modal alignment, the model explicitly associates morphology-aware visual features with corresponding textual attributes, particularly those related to structural variation and symmetry deviation. The low-level language representations guide the localization of fine-grained patterns, while higher-level representations provide contextual and diagnostic consistency. As a result, the learned representation encodes both detailed morphological priors and hierarchical semantic information, forming a text-conditioned and symmetry-aware feature space that supports robust and interpretable probabilistic segmentation.

3.3. Conditional Latent Morphology Distribution Modeling

Medical image segmentation conditioned on textual descriptions inherently involves uncertainty, as similar diagnostic descriptions may correspond to substantially different lesion geometries. To model this one-to-many mapping, we introduce a conditional latent variable framework (shown in Figure 3) that learns a distribution over morphology-consistent segmentation outcomes. We first summarize global visual and textual semantics by token-wise average pooling:

f_{v} = \frac{1}{N} \sum {\hat{F}}_{morph} .

(9)

f_{t} = \frac{1}{M} \sum T_{l o c a l} .

(10)

The conditional prior distribution of the latent morphology variable

z \in R^{d}

is defined as

p_{θ} (z | I, T) = N (μ_{p}, σ_{p}^{2}),

(11)

with parameters predicted by

ϕ_{p} ([f_{v}, f_{t}])

. During training, geometric information from the ground-truth mask

Y

is incorporated to construct a posterior distribution:

q_{ψ} (z | I, T, Y) = N (μ_{q}, σ_{q}^{2}) .

(12)

The mask is downsampled and flattened into

f_{y}

, and the posterior parameters are obtained by

ϕ_{q} ([f_{v}, f_{t}, f_{y}]) .

(13)

Latent samples are then generated using the reparameterization trick:

z = μ + σ ⊙ ϵ,

(14)

where

ϵ \sim N (0, I)

. Rather than injecting latent variables at the decoder stage, we incorporate

z

directly into the semantic bottleneck. The sampled latent vector is projected to the visual embedding space as

f_{z} = W_{z} (z)

and applied through residual modulation:

X_{latent} = {\hat{F}}_{morph} + f_{z} .

(15)

To further integrate text and semantic features, we propose a hierarchical progressive decoder that bridges the gap between deep abstract features and shallow spatial cues by injecting multi-granular textual semantics at each resolution scale.

The decoding process is organized into three stages, each corresponding to a specific tier of the clinical hierarchy. At the first upsampling stage, the latent-modulated features

X_{latent}

are fused with the

1 / 16

-scale backbone features

V_{16}

. To ground the decoder in anatomical space, we inject the regional text embeddings

T_{region}

. These embeddings provide context regarding the specific anatomical zones described in the report, helping the decoder disambiguate the lesion’s location:

F_{16} = {Decoder}_{16} (X_{latent}, V_{16}, T_{region})

(16)

As the decoder approaches the final resolution, the focus shifts from anatomical localization to precise lesion delineation consistent with the clinical text. We utilize the high-level diagnostic embeddings

T_{global}

to supervise the final two stages. This ensures that the generated boundaries are driven by the text prompts:

F_{8} = {Decoder}_{8} (F_{16}, V_{8}, T_{global})

(17)

F_{4} = {Decoder}_{4} (F_{8}, V_{4}, T_{global})

(18)

Finally, the output features are upsampled to the original image resolution to generate the predictive segmentation masks.

3.4. Loss Function

To further regularize the latent space and ensure that the visual features are semantically grounded in the linguistic priors, we introduce

L_{SAL}

based on instance-level contrastive learning. This objective encourages the model to maximize the feature similarity between paired image–text samples while minimizing it for non-paired instances within a mini-batch.

Given the global visual descriptor

f_{v} \in R^{B \times C}

and the morphological textual descriptor

f_{t} \in R^{B \times C}

, we first project them onto a unit hypersphere by

ℓ_{2}

-normalization:

{\hat{f}}_{v, i} = \frac{f_{v, i}}{{∥ f_{v, i} ∥}_{2}}, {\hat{f}}_{t, j} = \frac{f_{t, j}}{{∥ f_{t, j} ∥}_{2}} .

(19)

We then construct a cosine similarity matrix

S \in R^{B \times B}

, where each element

s_{i, j} = {\hat{f}}_{v, i} \cdot {\hat{f}}_{t, j}^{⊤}

represents the alignment score between the i-th visual sample and the j-th textual description. The alignment task is formulated as a multi-class classification problem where the model identifies the corresponding textual partner for each image instance. The loss is defined using the cross-entropy over the similarity logits:

L_{SAL} = - \frac{1}{B} \sum_{i = 1}^{B} log \frac{exp (s_{i, i} / τ)}{\sum_{j = 1}^{B} exp (s_{i, j} / τ)},

(20)

where

τ

is a temperature hyper-parameter (fixed to 1.0 in our implementation).

By optimizing

L_{SAL}

, the semantic bottleneck is forced to aggregate morphology-relevant visual cues that are explicitly described in the clinical reports. This alignment acts as a semantic anchor, preventing the CVAE from generating visually plausible but clinically irrelevant variations in the latent space

Z

.

During the training phase, the perception of symmetry is strictly enforced through the Semantic Alignment Loss (

L_{SAL}

). Rather than relying on a standalone heuristic symmetry loss, our framework enforces symmetry awareness via cross-modal contrastive learning. Specifically, the global visual descriptor

f_{v}

intrinsically aggregates the quantified symmetry representations (

{\hat{F}}_{morph}

), while the morphological textual descriptor

f_{t}

encodes symmetry-related clinical descriptions (e.g., “unilateral”, “bilateral asymmetry”, “symmetric infection”). By minimizing the

L_{SAL}

defined in Equation (20), the network is explicitly penalized if the mathematically quantified visual asymmetry fails to match the linguistic symmetry attributes. This alignment forces the model’s semantic bottleneck to consistently recognize and preserve symmetry-related variations during the optimization process.

The total objective function

L_{total}

is a weighted combination of the segmentation loss

L_{seg}

, the KL divergence

L_{kl}

, and the cross-modal alignment loss

L_{sal}

:

L_{total} = L_{seg} + β (t) L_{kl} + λ L_{SAL},

(21)

where

λ

denotes the weight for alignment, and

β (t)

is a time-dependent scheduling weight designed to prevent posterior collapse in the CVAE. Specifically, we employ a two-stage curriculum learning strategy for

β (t)

based on the current epoch t:

β (t) = \{\begin{matrix} 0.01, & if t < 50, \\ \min (0.1, (t - 50) \times 0.01), & if t \geq 50 . \end{matrix}

(22)

This warm-up schedule allows the network to prioritize deterministic segmentation in early epochs before smoothly regularizing the latent morphology distribution.

4. Experiments and Results

To comprehensively evaluate the performance of the proposed method on the COVID-19 lung segmentation task, the dataset and experimental setup are first described, followed by quantitative and qualitative evaluations. Finally, ablation studies are conducted to analyze the contribution of each component of the model.

4.1. Datasets and Clinical Relevance

4.1.1. QaTa-COV19 Dataset

QaTa-COV19 [40] is a large-scale chest X-ray (CXR) benchmark for pixel-level COVID-19 pneumonia segmentation. Dataset composition:The dataset comprises 9258 COVID-19 CXR images with pixel-wise infection annotations, collected primarily from the BIMCV-COVID19+ repository and complemented with previously curated cohorts. Annotation protocol: Ground-truth masks were generated via a human-in-the-loop pipeline. Initially, manually annotated CXRs were used to train segmentation networks (e.g., U-Net variants). These models then generated candidate masks for new images. Expert radiologists reviewed all candidates, manually correcting or re-annotating inaccurate predictions. Task definition and clinical relevance: This task delineates COVID-19 manifestations (e.g., low-contrast ground-glass opacities and consolidations) in CXRs. Accurate localization improves diagnostic interpretability, minimizes model distraction by irrelevant structures, and enables downstream disease assessment. Crucially, while healthy lungs exhibit macroscopic bilateral symmetry, COVID-19 typically causes asymmetric opacities. The proposed method leverages this structural symmetry and pathology-induced asymmetry as a core inductive bias to localize abnormalities. The dataset encompasses diverse disease presentations (e.g., unilateral, bilateral, multifocal). To facilitate vision–language alignment, each image is paired with structured, radiology-style clinical texts. Following standard protocols [34,38], the data is split into 5716 training, 1429 validation, and 2113 test samples to enable fair standardized comparisons.

4.1.2. MosMedData+ Dataset

MosMedData+ [41,42] is a chest CT-based COVID-19 infection segmentation dataset comprising 2729 axial CT slices with corresponding binary lesion masks. Dataset composition: Derived from the clinical MosMedData cohort, the dataset contains chest CT scans of suspected or confirmed COVID-19 cases. The selected slices capture diverse disease severities, lesion densities, and spatial distributions. Annotation protocol: Pixel-level masks are generated through a clinician-in-the-loop workflow, where initial delineations are refined and validated by expert radiologists. This consistently captures subtle opacities and consolidations, ensuring anatomical correctness for reliable volumetric analysis. Task definition and clinical relevance: This task segments COVID-19 CT abnormalities (ground-glass opacities, consolidations) to quantify infection burden, which is vital for diagnosis, severity stratification, and treatment monitoring. Furthermore, similar to CXRs, axial CTs reveal bilateral lung symmetry; leveraging these symmetric features enables the model to effectively separate asymmetric lesions from normal structures. To facilitate cross-modal learning, structured textual annotations encode clinical semantics (e.g., laterality, distribution, and localization) aligned with radiological conventions [34]. The dataset is split into 2183 training, 273 validation, and 273 test slices for balanced evaluation.

4.1.3. BraTS 2021 Dataset

BraTS 2021 [43] is a multi-parametric MRI-based brain tumor segmentation dataset. For this study, a curated subset comprising 1251 multi-modal brain volumes with corresponding multi-class lesion masks is utilized. Dataset composition: Derived from a multi-institutional clinical cohort, the dataset contains pre-operative baseline mpMRI scans (T1, T1Gd, T2, T2-FLAIR) of pathologically confirmed glioma patients. The selected volumes capture highly heterogeneous image qualities, diverse tumor grades, and intrinsic variations in tumor appearance. Annotation protocol: Voxel-level masks are generated through an AI-assisted, clinician-in-the-loop workflow, where initial automated delineations are iteratively refined and approved by board-certified neuroradiologists. This consistently captures complex glioma sub-regions including the enhancing tumor, necrotic core, and peritumoral edema. Task definition and clinical relevance: Specifically, we formulated the segmentation of the whole tumor by extracting paired axial T2-weighted slices and their corresponding masks. Accurately quantifying this tumor burden is vital for surgical treatment planning, radiotherapy mapping, and disease monitoring. Furthermore, healthy brain MRIs exhibit distinct bilateral hemisphere symmetry. Leveraging these symmetric features enables the model to effectively distinguish asymmetric space-occupying lesions from normal brain anatomy. To facilitate visual–language model segmentation, Qwen3-VL [44] is employed to generate detailed lesion-related descriptions, encoding relevant clinical semantics aligned with radiological conventions. The 1251 samples are split into training, validation, and test sets at a 7:1:2 ratio (875, 125, and 251 samples, respectively) for balanced evaluation.

4.1.4. MSD Liver Dataset

MSD Task03 Liver [45] is an abdominal CT-based liver and tumor segmentation dataset. For this study, a cohort of 131 patient volumes with corresponding multi-class masks is utilized. Dataset composition: Derived from the Medical Segmentation Decathlon, the dataset contains portal venous phase contrast-enhanced CT scans of patients with various liver tumors. The selected scans capture diverse liver shapes. Annotation protocol: Voxel-level masks are generated through a rigorous clinical workflow, where initial delineations of the liver are manually refined and validated by expert radiologists. This consistently captures subtle lesion boundaries, ensuring anatomical correctness for reliable volumetric analysis. Task definition and clinical relevance: This task segments the whole liver to quantify disease burden and organ volume, which is vital for surgical resection planning, oncology monitoring, and treatment evaluation. Furthermore, while the liver is inherently asymmetric, axial abdominal CTs exhibit global bilateral symmetry. Leveraging these contextual symmetric features enables the model to accurately localize the liver from normal surrounding structures. To facilitate visual–language model segmentation, Qwen3-VL [44] is employed to generate detailed lesion-related descriptions, encoding relevant clinical semantics aligned with radiological conventions. The 131 3D patient volumes are split into training, validation, and test sets at a 7:1:2 ratio (92, 13, and 26 samples, respectively). To adapt the data for training, 5 representative axial slices are extracted from each volume, ensuring a balanced and diverse evaluation protocol.

4.2. Experiment Setup

All experiments are implemented in PyTorch 2.0.1 with torchvision 0.15.2 and Python 3.10, and are conducted on a Giga computing server MS03-CE0-000 equipped with an NVIDIA RTX 4090 GPU and an Intel(R) Xeon(R) Platinum 8476C CPU. The operating system is Ubuntu 22.04, and the development software is PyCharm (version 2025.2.4). The AdamW optimizer is adopted, with momentum coefficients

β_{1} = 0.9

and

β_{2} = 0.999

. The learning rate is scheduled using a cosine annealing strategy, with a minimum value of 1 × 10⁻⁶. Early stopping is applied based on the validation mean intersection-over-union (mIoU), where training is terminated if no performance improvement is observed for 30 consecutive epochs. Data preprocessing and augmentation are implemented using the MONAI pipeline. During training, images and masks are first loaded and converted into a unified format, followed by random scaling and rotation for geometric augmentation. The inputs are then resized to a fixed resolution and normalized on a per-channel basis. For the QaTa-COV19 dataset, the total training time of the proposed SVP-CVAE framework is approximately 1.66 h for 150 epochs. For validation and testing, stochastic augmentations are removed, and only deterministic operations, including loading, resizing, normalization, and tensor conversion, are retained to ensure fair and reproducible evaluation.

4.3. Evaluation Metrics

To comprehensively evaluate segmentation performance, four widely used metrics are adopted, including the Dice similarity coefficient (DSC), mIoU, Hausdorff distance (HD), and average surface distance (ASD). These metrics jointly assess both the overlap accuracy and the boundary delineation quality of the predicted segmentation results.

The DSC measures the similarity between the predicted region P and the ground-truth region G, and is defined as

DSC = \frac{2 | P \cap G |}{| P | + | G |},

(23)

where

| \cdot |

denotes the number of pixels or voxels in the corresponding region. A higher DSC value indicates better overlap consistency between prediction and ground truth.

The mIoU evaluates the intersection-over-union ratio between the predicted and reference regions, which is formulated as

mIoU = \frac{| P \cap G |}{| P \cup G |} .

(24)

Compared with DSC, mIoU imposes a stricter penalty on false positives and false negatives, thereby providing a robust assessment of segmentation accuracy.

To further evaluate boundary precision, the Hausdorff distance (HD) is employed to measure the maximum distance between the boundary points of the predicted segmentation and the ground truth:

HD (P, G) = max \{sup_{p \in P} inf_{g \in G} d (p, g), sup_{g \in G} inf_{p \in P} d (g, p)\},

(25)

where

d (\cdot, \cdot)

denotes the Euclidean distance. Lower HD values indicate better boundary alignment and fewer extreme segmentation errors.

In addition, the average surface distance (ASD) is adopted to evaluate the average boundary discrepancy between the predicted and reference contours:

ASD (P, G) = \frac{1}{| S_{P} | + | S_{G} |} (\sum_{p \in S_{P}} min_{g \in S_{G}} d (p, g) + \sum_{g \in S_{G}} min_{p \in S_{P}} d (g, p)),

(26)

where

S_{P}

and

S_{G}

represent the surface point sets of the predicted and ground-truth regions, respectively. ASD reflects the overall contour consistency and is less sensitive to outlier boundary points compared with HD.

Overall, DSC and mIoU mainly evaluate region overlap accuracy, while HD and ASD focus on boundary-level segmentation quality. The combination of these metrics provides a comprehensive and reliable evaluation of segmentation performance.

4.4. Comparison with State-of-the-Art Models

Extensive experiments are conducted to evaluate the proposed vision–language segmentation framework for medical segmentation across multiple imaging modalities. Due to the extreme scarcity of paired image–text medical data, QaTa-COV19 and MosMedData+ have become standard benchmarks for vision–language segmentation (e.g., LViT [34], ReclMIS [38], FMISeg [36]). We adopt them primarily to ensure fair comparisons with these existing V-L baselines. To further demonstrate our framework’s generalizability beyond these specific domains, evaluations on BraTS21 and MSD Liver are detailed in Section 4.5. To comprehensively demonstrate its superiority, we benchmark our method against 15 representative state-of-the-art (SOTA) medical image segmentation models. These baselines are systematically categorized into three groups: (1) convolutional and transformer-based architectures (e.g., U-Net [22], U-Net++ [23], Swin-UNet [26], nnUNet [24], UNetr [46], TransUnet [25], CFFormer [47]), (2) foundational vision models adapted for medical tasks (e.g., SAM Adapter [48]), and (3) recent domain-specific vision–language multimodal approaches tailored for medical segmentation (e.g., CLIPSeg [49], LViT [34], LGMS [35], RecLMIS [38], FMISeg [36], ViTexNet [50] and HiMix [51]). For a fair comparison, all SOTA methods are trained and evaluated using identical data splits and a unified preprocessing pipeline. Moreover, all experiments are conducted under consistent hardware conditions and comparable training budgets to avoid performance bias. This controlled setup enables an objective evaluation of segmentation accuracy and generalization capability.

4.4.1. Experiments on QaTa-COV19 Dataset

Table 1 presents the quantitative segmentation results on the QaTa-COV19 dataset. The proposed method achieves the best overall performance, with a DSC of 91.36% and an mIoU of 84.10%. These results slightly surpass those of FMISeg, which attains a DSC of 91.04% and an mIoU of 83.56%, as well as LGMS, which achieves a DSC of 89.85% and an mIoU of 81.78%. This improvement demonstrates consistent gains in region overlap accuracy. In terms of boundary quality, the method achieves the lowest HD value, reaching 20.71, which is lower than the values of 21.49 obtained by FMISeg and 21.66 obtained by HiMix. This result indicates more precise contour localization and fewer extreme boundary deviations. Although the ASD value is 3.55, which is higher than the value of 2.76 achieved by LViT, it remains competitive and is substantially lower than that of most other approaches. This observation suggests that boundary smoothness is preserved while optimizing overlap metrics.

Furthermore, a clear performance gap between text-guided and non-text methods is observed. Conventional architectures such as nnUNet achieve a DSC of 79.05%, while Swin-UNet achieves a DSC of 77.97%, both showing notably inferior performance. Even strong non-text baselines such as SAM Adapter, which achieves a DSC of 89.12% and an mIoU of 80.57%, are still outperformed by several text-based methods. This trend highlights the effectiveness of incorporating semantic priors from textual information to improve both region-level accuracy and boundary delineation. Overall, the results demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics while maintaining a balanced optimization between overlap fidelity and boundary refinement.

The qualitative comparisons shown in Figure 4 demonstrate that the proposed method produces more accurate and consistent lesion segmentation across diverse cases, with fewer false positives and false negatives than competing approaches. Specifically, nnUNet tends to produce over-segmentation, introducing large false positive regions in multiple cases, whereas LANG and LViT partially mitigate this issue but still exhibit fragmented predictions and missing lesion regions. FMISeg demonstrates improved localization capability; however, it occasionally generates spurious activations and fails to fully capture lesion boundaries in challenging scenarios.

In contrast, the proposed method produces predictions that are more closely aligned with the ground truth, effectively suppressing irrelevant regions while preserving complete lesion structures (as shown in the third and fourth rows). This advantage is particularly evident in small or ambiguous regions, where other methods either fail to detect subtle lesions or introduce noisy responses, whereas the proposed method maintains coherent and compact segmentation. Furthermore, the predicted boundaries are smoother and more precise, with fewer discontinuities and reduced leakage into surrounding normal tissue, indicating stronger spatial consistency.

Notably, compared with other models, the proposed method demonstrates more effective utilization of textual information, where the predicted regions are better aligned with the semantic descriptions provided by the text, reflecting a stronger correspondence between visual features and textual cues. These visual observations are consistent with the quantitative results, confirming that the proposed method improves segmentation reliability by simultaneously reducing false positives and false negatives while enhancing structural completeness.

4.4.2. Experiments on MosMedData+ Dataset

Table 2 presents the quantitative segmentation results on the MosMedData+ dataset. The proposed method achieves the highest region overlap performance, with a DSC of 80.17% and an mIoU of 66.91%. These results exceed those of SAM Adapter, which attains a DSC of 79.12% and an mIoU of 65.45%, as well as FMISeg, which achieves a DSC of 79.49% and an mIoU of 65.94%. This comparison indicates clear advantages in capturing lesion regions across both vision-only and vision–language methods.

In addition to overlap accuracy, the proposed method maintains competitive boundary precision. The ASD is 2.96, which is close to the value of 2.94 achieved by FMISeg, while the HD is 22.93, remaining comparable to the values of 22.35 obtained by SAM Adapter and 22.32 obtained by LGMS. These results indicate that the improvement in overlap performance does not lead to degradation in boundary quality. A broader comparison reveals that text-guided methods generally outperform traditional architectures. For example, U-Net achieves a DSC of 54.12%, and U-Net++ achieves a DSC of 56.84%, both significantly lower than those of text-based approaches.

However, the effectiveness of text-guided methods varies depending on the quality of vision–language alignment, as evidenced by the strong performance of SAM Adapter without textual input. In addition, some methods exhibit inconsistencies between overlap and boundary metrics. For instance, LGMS achieves a relatively high DSC of 78.01% but produces an ASD of 9.70, indicating suboptimal boundary consistency.

In contrast, the proposed method achieves a more balanced trade-off by simultaneously improving region accuracy and maintaining boundary smoothness. These results demonstrate that effective integration of textual priors with visual representations can lead to both quantitative gains and more stable segmentation behavior.

The qualitative comparison on the MosMedData+ dataset shown in Figure 5 demonstrates that the proposed method produces more accurate and reliable segmentation, particularly in challenging cases involving subtle and small-scale lesions. Compared with nnUNet, LViT, and FMISeg, which exhibit noticeable false positive and false negative regions, the proposed method generates predictions that are more consistent with the ground truth in both lesion extent and structural details.

Specifically, competing methods tend to either miss low-contrast infection regions or introduce spurious activations in normal areas, indicating limitations in distinguishing ambiguous boundaries. In contrast, the proposed method effectively suppresses false positives while recovering difficult-to-detect lesion regions, leading to more complete and cleaner segmentation masks.

This advantage is particularly evident in regions with weak intensity contrast or fragmented lesion patterns, where the model maintains stronger spatial continuity and more accurate boundary delineation. These observations indicate that the proposed method improves both sensitivity to subtle pathological patterns and specificity, resulting in fewer misclassified regions and more clinically reliable segmentation outcomes.

4.5. Generalization Across Diverse Anatomical Symmetry Profiles

We further validate the proposed SVP-CVAE framework across anatomies with different symmetry characteristics. Specifically, we evaluate the framework on two representative datasets with distinct symmetry properties: the BraTS dataset, featuring highly bilateral brain symmetry, and the MSD Liver dataset, characterized by inherently asymmetric organ geometry. Quantitative comparisons with state-of-the-art methods are reported in Table 3.

Performance on Highly Symmetric Anatomy (BraTS dataset): The human brain provides an ideal scenario for evaluating the explicit symmetry perception capability of SVP-CVAE. As shown in Table 3, our method achieves the best segmentation performance, with a DSC of 93.47% and an mIoU of 87.74%. Benefiting from the cross-bilateral symmetry mechanism, the morphology-aware enhancement module effectively compares bilateral hemispheric structures, enabling accurate identification of unilateral tumor regions as symmetry-disrupting anomalies. In addition, the attribute-latent contrastive objective aligns these asymmetric features with fine-grained clinical semantic descriptions, resulting in more precise boundary delineation.

Performance on Asymmetric Anatomy (MSD Liver dataset): Unlike the lungs and brain, the liver exhibits substantial anatomical asymmetry and inter-patient morphological variability. Despite the absence of intrinsic organ symmetry, SVP-CVAE still achieves state-of-the-art performance on the MSD Liver dataset, obtaining a DSC of 97.45% and an mIoU of 95.02%, outperforming recent vision–language and transformer-based methods such as CDFormer and LViT.

The strong performance on asymmetric targets stems from two factors. First, although the liver is anatomically asymmetric, surrounding structures in axial CT slices preserve relatively stable global contextual symmetry, which provides reliable spatial cues for localization through the proposed cross-bilateral symmetry mechanism. Second, to address inter-patient structural variability, our CVAE-based vision–language framework integrates probabilistic prior-to-posterior inference with text-guided alignment. This combination enables the model to capture diverse, symmetry-related variations and maintain robust feature representations.

4.6. Ablation Study

4.6.1. Overall Component Analysis

Table 4 validates each component of the proposed framework from the perspectives of latent modeling, morphology awareness, and semantic alignment. Starting from the baseline with Dice of 88.75% and mIoU of 79.78%, introducing conditional latent morphology distribution modeling (CLMD) improves performance to 89.23% and 80.56%, while reducing ASD from 4.63 to 4.43 and HD from 26.16 to 23.49. This indicates that conditional latent modeling enhances global stability and reduces extreme prediction errors. Adding MARL further boosts Dice to 91.05% and mIoU to 83.57%, with ASD and HD reduced to 3.72 and 21.47. The gains of 1.82 Dice and 3.01 mIoU over CLMD demonstrate that morphology-aware modeling significantly improves fine-grained structure and boundary accuracy. Replacing MARL with ACTA achieves 90.88% Dice and 83.29% mIoU, with ASD of 3.74 and HD of 22.48, showing that attribute-level alignment mainly enhances semantic consistency with moderate boundary refinement. Combining MARL and ACTA further improves performance to 91.22% Dice and 83.85% mIoU, with ASD reduced to 3.61. This confirms that morphology modeling and semantic alignment are complementary, although partially overlapping. Finally, introducing

L_{SAL}

achieves the best results with 91.36% Dice and 84.10% mIoU, and further reduces ASD to 3.55 and HD to 20.71, indicating that attribute-latent consistency regularizes the latent space and improves both structural and semantic quality. Overall, CLMD improves uncertainty-aware representation, MARL enhances morphological precision, ACTA strengthens semantic alignment, and the additional supervision further regularizes the latent distribution, leading to consistent gains across all metrics.

4.6.2. Analysis of CLMD Framework Contributions

To evaluate the effectiveness of the proposed CLMD, ablation studies are conducted by selectively removing its key components, including the prior, posterior, and KL regularization, as summarized in Table 5.

Removing the entire CVAE module results in a Dice score of 90.54% and an mIoU of 82.72%, indicating a clear performance degradation compared with the full model. This result suggests that purely deterministic modeling is insufficient to capture the inherent ambiguity in medical image segmentation. Introducing only the prior increases the Dice score to 90.81% and the mIoU to 83.16%, demonstrating limited improvement. This observation indicates that stochasticity without posterior guidance is not sufficient to learn task-relevant latent representations.

When both the prior and posterior are retained but the KL regularization is removed, the Dice score further increases to 91.06% and the mIoU to 83.60%. This improvement reflects the strong supervisory role of the posterior conditioned on ground-truth masks. However, the absence of KL regularization prevents explicit alignment between the prior and posterior distributions, which may lead to distribution mismatch during inference. The full model achieves the best performance, with a Dice score of 91.36% and an mIoU of 84.10%. These results demonstrate that jointly modeling the prior and posterior with KL regularization produces a more structured and generalizable latent space. The consistent improvements across different configurations confirm that each component of CLMD contributes in a complementary manner, enabling more robust and uncertainty-aware segmentation.

4.6.3. Analysis of MARL Module Contributions

To further validate the effectiveness of the MARL module, we designed a set of module comparison experiments. The ablation results of MARL in Table 6 demonstrate that both SSGM and CBSM contribute to performance improvement from distinct structural perspectives, and their combination yields consistent gains across all metrics. Compared to the variant without MARL, introducing only SSGM improves Dice and mIoU to 79.57% and 66.08%, respectively, while substantially reducing ASD from 4.61 to 3.50, indicating enhanced sensitivity to local boundary variations. Similarly, the use of only CBSM achieves further gains and reduces ASD to 3.19, suggesting that modeling bilateral structural inconsistency is particularly effective in capturing global morphological irregularities. When both components are jointly integrated, the full MARL module achieves the best performance, with consistent improvements across all evaluation metrics. The progressive reduction in ASD and HD highlights that SSGM and CBSM provide complementary benefits in refining boundary precision and reducing structural debytions, while their joint modeling enables a more comprehensive characterization of lesion morphology. Overall, these results validate that integrating local morph-awareness with global asymmetry modeling is critical for constructing a robust and morphology-sensitive representation.

4.6.4. Analysis of Vision–Language Fusion Mechanism

To investigate the effectiveness of the proposed text fusion strategy, we conduct a series of ablation studies by varying the usage of textual information and the fusion design, as summarized in Table 7. Removing textual guidance leads to inferior performance, indicating that purely visual features are insufficient to fully capture semantic context in medical image segmentation. Introducing text features at a single scale yields consistent improvements. Specifically, using only low-level text features achieves a moderate gain, while high-level text features provide a larger improvement, suggesting that semantically richer textual representations are more beneficial for guiding segmentation. Further incorporating multi-scale text features leads to additional performance gains, demonstrating the advantage of leveraging complementary information across different semantic levels. However, this variant employs a simple MLP-based fusion, which lacks explicit modeling of cross-modal interactions. The full model achieves the best performance, showing that the proposed learned fusion mechanism further enhances feature integration by explicitly modeling structured interactions between visual and textual features. The numerical improvement over simple multi-scale fusion remains consistent, indicating that the proposed fusion module provides a more effective and stable way to exploit textual guidance. These results indicate that textual guidance improves segmentation, where high-level and multi-scale representations are more effective, and structured fusion further facilitates cross-modal interaction.

4.6.5. Analysis of Alignment Loss Weight $λ$

The sensitivity analysis of the alignment loss weight

λ

(Figure 6) demonstrates that moderate values are critical for optimal performance. Elevating

λ

from 0.01 to 0.1 increases the mIoU from 0.8399 to 0.841, indicating that tighter attribute-latent semantic consistency promotes more robust cross-modal alignment. However, we observe diminishing returns beyond

λ = 0.1

; the mIoU decreases to 0.8405 at

λ = 0.2

and further to 0.8375 at

λ = 0.5

. This degradation suggests that an over-emphasis on semantic alignment can subvert the primary segmentation task. Similarly, the performance of the HD metric also reflects this trend. Based on these findings, we set

λ = 0.1

as the default, as it provides the most favorable trade-off between segmentation precision and structural consistency.

4.7. Morph Feature Visualization

The visualization results shown in Figure 7 demonstrate that CBSM and SSGM exhibit distinct yet complementary spatial attention patterns, which are effectively unified in the proposed method. Specifically, CBSM emphasizes bilaterally symmetric structures, consistently activating along both lung fields and particularly around boundary regions, indicating its ability to capture global structural priors and reinforce contour-aware representations. In contrast, SSGM focuses more on the central regions of the lung fields, where the responses are more concentrated and less affected by peripheral noise, suggesting enhanced sensitivity to intra-pulmonary semantic consistency and intensity variations. By integrating these complementary characteristics, the proposed method produces more spatially coherent and balanced activation maps, achieving a more uniform attention distribution across both peripheral and central regions. Furthermore, the fused features exhibit stronger and more precise responses in lesion-related areas, with activation patterns that more closely align with the ground truth annotations, especially in cases with irregular shapes or diffuse boundaries. This improvement indicates that the integration strategy effectively aggregates multi-scale contextual information while suppressing irrelevant background interference.

As a result, the final segmentation outputs demonstrate improved structural completeness and regional consistency, with reduced false activations outside the lung fields and better coverage of lesion regions. These observations confirm that the proposed feature integration not only enhances local discriminability but also preserves global anatomical structure, which is essential for achieving reliable and accurate medical image segmentation.

4.8. Model Efficiency Analysis

To further evaluate efficiency, we analyze the performance gain per computational cost in Table 8. To provide a more comprehensive efficiency evaluation, we report not only model size and FLOPs, but also runtime-related metrics such as latency, FPS, and GOPS, which better reflect practical deployment performance. Although our model is not the lightest among all compared methods, it achieves the best overall segmentation performance while maintaining competitive computational complexity. Compared with FMISeg, our method delivers higher Dice and mIoU scores with substantially fewer FLOPs, lower latency, and higher FPS, indicating better deployment efficiency. Overall, these results demonstrate that the proposed framework achieves a favorable balance between effectiveness and efficiency, enabling high-precision segmentation without excessive computational burden.

4.9. Limitations

Although demonstrating superior performance, the proposed SVP-CVAE framework has a few limitations. First, as a vision–language model, it is sensitive to text quality; vague or incomplete clinical prompts can lead to suboptimal semantic alignment and degrade segmentation accuracy. Second, despite using parameter-efficient fine-tuning (Adapter), integrating a text encoder and CVAE sampling introduces higher computational overhead compared to lightweight pure-vision models, potentially limiting deployment on resource-constrained edge devices. Finally, our current implementation processes volumetric CT data slice-by-slice. Extending the framework to native 3D architectures could better exploit continuous volumetric symmetry to further enhance segmentation consistency.

5. Conclusions

In this work, we presented SVP-CVAE, a novel vision–language framework for medical image segmentation that synergistically integrates textual semantics with visual features while explicitly perceiving morphological symmetry and structural variability. By leveraging a conditional variational autoencoder coupled with attribute-level multimodal alignment, the proposed method effectively resolves the inherent one-to-many mapping between clinical descriptions and visual realizations. Extensive evaluations on chest X-ray and CT datasets demonstrate that SVP-CVAE achieves state-of-the-art performance in both regional overlap and boundary fidelity. Our results underscore that the integration of hierarchical textual features, symmetry-aware representation modules, and probabilistic latent modeling consistently enhances segmentation reliability, especially in cases of subtle or anatomically ambiguous lesions. The framework facilitates the generation of semantically consistent predictions, offering a more robust representation of the variability inherent in realistic clinical scenarios. Importantly, the framework demonstrates strong generalization capabilities across diverse anatomical symmetry profiles, successfully delineating both highly symmetric structures (e.g., brain tumors) and inherently asymmetric organs (e.g., liver tumors) by leveraging global contextual cues.

Moving forward, several promising directions remain for future investigation. First, to mitigate the reliance on high-quality manual text prompts, future work will explore integrating large language models (LLMs) for automated prompt generation and semantic refinement. Second, developing lightweight or distilled variants of the probabilistic framework will be a valuable step toward facilitating real-time deployment on resource-constrained clinical edge devices. Third, while our current cross-bilateral mechanism successfully captures macro-level features, extending the symmetry perception to encompass local symmetry, rotational symmetry, and topological modeling will further enhance the delineation of complex irregular lesions in naturally non-symmetrical organs. Finally, transitioning from slice-by-slice processing to native 3D architectures represents a critical next step, which would allow the model to fully exploit continuous volumetric symmetry and spatial context.

Author Contributions

J.J.: Writing—original draft, methodology, validation, software, data curation; Q.Z.: writing—review and editing, data curation, software; C.H.: writing—review and editing, supervision, project administration, investigation, funding acquisition, conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program of China under Grants 2021YFC2500102 and 2016YFC0803000, and in part by the National Natural Science Foundation of China under Grants 41371342 and 82571371.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alnaggar, O.A.M.F.; Jagadale, B.N.; Saif, M.A.N.; Ghaleb, O.A.; Ahmed, A.A.; Aqlan, H.A.A.; Al-Ariki, H.D.E. Efficient artificial intelligence approaches for medical image processing in healthcare: Comprehensive review, taxonomy, and analysis. Artif. Intell. Rev. 2024, 57, 221. [Google Scholar] [CrossRef]
Brar, K.K.; Goyal, B.; Dogra, A.; Mustafa, M.A.; Majumdar, R.; Alkhayyat, A.; Kukreja, V. Image segmentation review: Theoretical background and recent advances. Inf. Fusion 2025, 114, 102608. [Google Scholar] [CrossRef]
Fayaz, M.; Hagsong, K.; Danish, S.; Dang, L.M.; Sadeghi-Niaraki, A.; Moon, H. Demystifying Artificial Intelligence: A Systematic Review of Explainable Artificial Intelligence in Medical Imaging. Sensors 2026, 26, 2131. [Google Scholar] [CrossRef]
Xiong, Y.; Shu, X.; Liu, Q.; Yuan, D. Hcmnet: A hybrid cnn-mamba network for breast ultrasound segmentation for consumer assisted diagnosis. IEEE Trans. Consum. Electron. 2025, 71, 8045–8054. [Google Scholar] [CrossRef]
Kathirvel, N.; Sasidhar, A.; Rajasekaran, M.; Saravana Kumar, K. Optimized interpretable generalized additive neural network-based human brain diagnosis using medical imaging. Knowl.-Based Syst. 2025, 309, 112862. [Google Scholar]
Oad, A.; Koondhar, I.H.; Dong, F.; Liu, W.; Zou, B.; Liu, W.; Chen, Y.; Wu, Y. Symmetry-Aware SwinUNet with Integrated Attention for Transformer-Based Segmentation of Thyroid Ultrasound Images. Symmetry 2026, 18, 141. [Google Scholar] [CrossRef]
Jin, Y.; Pepe, A.; Li, J.; Gsaxner, C.; Chen, Y.; Puladi, B.; Zhao, F.H.; Pomykala, K.; Kleesiek, J.; Frangi, A.F.; et al. Aortic vessel tree segmentation for cardiovascular diseases treatment: Status quo. ACM Comput. Surv. 2025, 57, 1–35. [Google Scholar] [CrossRef]
Putz, F.; Beirami, S.; Schmidt, M.A.; May, M.S.; Grigo, J.; Weissmann, T.; Schubert, P.; Höfler, D.; Gomaa, A.; Hassen, B.T.; et al. The Segment Anything foundation model achieves favorable brain tumor auto-segmentation accuracy in MRI to support radiotherapy treatment planning. Strahlenther. Onkol. 2025, 201, 255–265. [Google Scholar] [CrossRef] [PubMed]
Ficici, C.; Erogul, O.; Telatar, Z.; Kocak, O. Automatic brain tumor detection and volume estimation in multimodal MRI scans via a symmetry analysis. Symmetry 2023, 15, 1586. [Google Scholar] [CrossRef]
Shu, Y.; Li, H.; Xiao, B.; Bi, X.; Li, W. Cross-mix monitoring for medical image segmentation with limited supervision. IEEE Trans. Multimed. 2022, 25, 1700–1712. [Google Scholar] [CrossRef]
Wu, S.; Zhao, P.; Xu, H.; Wang, Z. Enhancing Early Skin Cancer Detection: A Deep Learning Approach with Multi-Scale Feature Refinement and Fusion. Symmetry 2026, 18, 612. [Google Scholar] [CrossRef]
Lu, Z.; Li, J.; Liu, Z.; Cao, Q.; Tian, T.; Wang, X.; Huang, Z. Semi-Supervised Retinal Vessel Segmentation Based on Pseudo Label Filtering. Symmetry 2025, 17, 1462. [Google Scholar] [CrossRef]
Huang, S.; Wang, S.; Zhang, K.; Wu, W.; Liu, Y.; Liu, T.; Pang, S. Enhancing Semi-supervised Medical Image Segmentation via Semantic Transfer. Pattern Recognit. 2026, 175, 113039. [Google Scholar] [CrossRef]
Miao, J.; Chen, C.; Yuan, Y.; Li, Q.; Heng, P.A. SAM-Driven Cross Prompting with Adaptive Sampling Consistency for Semi-supervised Medical Image Segmentation. Med. Image Anal. 2026, 110, 103973. [Google Scholar] [CrossRef]
Liu, J.; Zhang, Y.; Chen, J.N.; Xiao, J.; Lu, Y.; Landman, B.A.; Yuan, Y.; Yuille, A.; Tang, Y.; Zhou, Z. Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 21152–21164. [Google Scholar]
Liu, D.; Yang, M.; Qu, X.; Zhou, P.; Cheng, Y.; Hu, W. A survey of attacks on large vision–language models: Resources, advances, and future trends. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 19525–19545. [Google Scholar] [CrossRef]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 2015, 28, 3483–3491. [Google Scholar]
Kohl, S.; Romera-Paredes, B.; Meyer, C.; De Fauw, J.; Ledsam, J.R.; Maier-Hein, K.; Eslami, S.; Jimenez Rezende, D.; Ronneberger, O. A probabilistic u-net for segmentation of ambiguous images. Adv. Neural Inf. Process. Syst. 2018, 31, 6965–6975. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Yang, D.; Yang, B.; Yan, J. MAEM-ResUNet: Accurate Glioma Segmentation in Brain MRI via Symmetric Multi-Directional Mamba and Dual-Attention Modules. Symmetry 2025, 18, 1. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Lai, Y.; Zhong, J.; Li, M.; Zhao, S.; Li, Y.; Psounis, K.; Yang, X. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models. IEEE Trans. Med. Imaging 2026. [Google Scholar] [CrossRef]
Li, T.; Su, Y.; Li, W.; Fu, B.; Chen, Z.; Huang, Z.; Wang, G.; Ma, C.; Chen, Y.; Hu, M.; et al. Gmai-vl & gmai-vl-5.5 m: A large vision-language model and a comprehensive multimodal dataset towards general medical ai. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; Volume 40, pp. 23177–23185. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Gu, D.; Gao, Y.; Zhou, M.; Metaxas, D. Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 6–10 March 2026; pp. 2838–2847. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Koleilat, T.; Asgariandehkordi, H.; Rivaz, H.; Xiao, Y. Medclip-samv2: Towards universal text-driven medical image segmentation. Med. Image Anal. 2025, 106, 103749. [Google Scholar] [CrossRef]
Dong, H.; Gu, H.; Chen, Y.; Yang, J.; Chen, Y.; Mazurowski, M.A. Segment anything model 2: An application to 2D and 3D medical images. IEEE Trans. Biomed. Eng. 2026. [Google Scholar] [CrossRef]
Li, Z.; Li, Y.; Li, Q.; Wang, P.; Guo, D.; Lu, L.; Jin, D.; Zhang, Y.; Hong, Q. Lvit: Language meets vision transformer in medical image segmentation. IEEE Trans. Med. Imaging 2023, 43, 96–107. [Google Scholar] [CrossRef]
Zhong, Y.; Xu, M.; Liang, K.; Chen, K.; Wu, M. Ariadne’s thread: Using text prompts to improve segmentation of infected areas from chest x-ray images. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 724–733. [Google Scholar]
Yu, B.; Yang, J.; Du, Z.; Huang, Y.; Li, C.; Wang, L. Frequency-domain multi-modal fusion for language-guided medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 278–288. [Google Scholar]
Zhang, S.; Zheng, S.; Ma, M. Dual-LVT: A Dual Attention Language-Vision Transformer for Tumor Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 1172–1181. [Google Scholar]
Huang, X.; Li, H.; Cao, M.; Chen, L.; You, C.; An, D. Cross-modal conditioned reconstruction for language-guided medical image segmentation. IEEE Trans. Med. Imaging 2024, 44, 1821–1835. [Google Scholar] [CrossRef]
Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the most of text semantics to improve biomedical vision–language processing. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
Degerli, A.; Kiranyaz, S.; Chowdhury, M.E.; Gabbouj, M. Osegnet: Operational segmentation network for Covid-19 detection using chest x-ray images. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2022; pp. 2306–2310. [Google Scholar]
Morozov, S.P.; Andreychenko, A.E.; Pavlov, N.A.; Vladzymyrskyy, A.; Ledikhova, N.V.; Gombolevskiy, V.A.; Blokhin, I.A.; Gelezhe, P.B.; Gonchar, A.; Chernina, V.Y. Mosmeddata: Chest ct scans with Covid-19 related findings dataset. arXiv 2020, arXiv:2005.06465. [Google Scholar]
Hofmanninger, J.; Prayer, F.; Pan, J.; Röhrich, S.; Prosch, H.; Langs, G. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp. 2020, 4, 50. [Google Scholar] [CrossRef]
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese, E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura, F.C.; Pati, S.; et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]
Bai, S.; Cai, Y.; Chen, R.; Chen, K.; Chen, X.; Cheng, Z.; Deng, L.; Ding, W.; Gao, C.; Ge, C.; et al. Qwen3-vl technical report. arXiv 2025, arXiv:2511.21631. [Google Scholar] [CrossRef]
Antonelli, M.; Reinke, A.; Bakas, S.; Farahani, K.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; Ronneberger, O.; Summers, R.M.; et al. The medical segmentation decathlon. Nat. Commun. 2022, 13, 4128. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 574–584. [Google Scholar]
Li, J.; Xu, Q.; He, X.; Liu, Z.; Zhang, D.; Wang, R.; Qu, R.; Qiu, G. CFFormer: Cross CNN-Transformer channel attention and spatial feature fusion for improved segmentation of heterogeneous medical images. Expert Syst. Appl. 2026, 295, 128835. [Google Scholar] [CrossRef]
Wu, J.; Wang, Z.; Hong, M.; Ji, W.; Fu, H.; Xu, Y.; Xu, M.; Jin, Y. Medical sam adapter: Adapting segment anything model for medical image segmentation. Med. Image Anal. 2025, 102, 103547. [Google Scholar] [CrossRef]
Dhakal, M.; Adhikari, R.; Thapaliya, S.; Khanal, B. Vlsm-adapter: Finetuning vision-language segmentation efficiently with lightweight blocks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 712–722. [Google Scholar]
Bhardwaj, R.; Tambe, U.Y.; Neog, D.R. ViTexNet: Vision-text guided dynamic convolution network for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 690–699. [Google Scholar]
Hwang, S.; Sim, J.; Kim, W.H. HiMix: Hierarchical Visual-Textual Mixing Network for Lesion Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 6–10 March 2026; pp. 5332–5341. [Google Scholar]

Figure 1. Illustration of description-driven ambiguity in medical image segmentation. Given the same textual description, different images can yield substantially different segmentation targets and quantitative outcomes. This variability is reflected in multiple morphological and spatial metrics, including centroid percentage coordinates (CC), relative area (RA), perimeter (PM), compactness (CP), and solidity (SD). The figure highlights that identical semantic guidance does not guarantee consistent segmentation behavior across heterogeneous images, underscoring the challenge of aligning vision–language descriptions with precise, image-specific anatomical structures.

Figure 2. Illustration of the proposed framework for morphology-aware vision–language medical image segmentation.

Figure 3. Overview of the conditional latent morphology distribution modeling process.

Figure 4. The visualization of the comparison experiment on the test set of QaTa-COV19 dataset. Green, red, and blue, indicate true positive, false negative, and false positive pixels.

Figure 5. The visualization of the comparison experiment on the test set of MosMedData+ dataset. Green, red, and blue, indicate true positive, false negative, and false positive pixels.

Figure 6. Trends of mIoU and HD under different weight settings of alignment loss weight

λ

on the QaTa-COV19 dataset.

Figure 6. Trends of mIoU and HD under different weight settings of alignment loss weight

λ

on the QaTa-COV19 dataset.

Figure 7. The visualization of the class activation heatmap of different layers. In the heatmaps, red and blue correspond to high and low activation levels, respectively. In the final two columns, red overlays represent the predicted masks, and green overlays represent the ground truth.

Table 1. Performance comparison between our method and state-of-the-art segmentation methods on the QaTa-COV19 dataset. The best value is shown in bold font. × denotes methods without text input and ✓ denotes methods with text input. † Results are directly reported from the original papers.

Methods	Text	Metrics
Methods	Text	DSC (%)	mIoU (%)	HD	ASD
U-Net	×	70.81	55.56	55.47	17.78
U-Net++	×	70.05	56.72	51.52	12.14
Swin-UNet	×	77.97	63.89	34.86	6.07
nnUNet	×	79.05	69.65	37.05	7.84
UNetr	×	78.63	64.78	45.55	11.08
TransUnet	×	76.15	61.49	34.12	11.85
SAM Adapter	×	89.12	80.57	29.15	5.03
CFFormer	×	79.03	69.53	35.91	5.48
CLIPSeg	✓	82.24	73.05	21.69	6.79
LViT	✓	81.59	72.54	25.25	2.76
LGMS	✓	89.85	81.78	28.38	4.47
RecLMIS	✓	84.56	76.20	22.97	3.61
FMISeg	✓	91.04	83.56	21.49	3.49
ViTexNet †	✓	90.76	83.25	-	-
HiMix	✓	90.94	83.44	21.66	3.75
Ours	✓	91.36	84.10	20.71	3.55

Table 2. Performance comparison between our method and state-of-the-art segmentation methods on the MosMedData+ dataset. The best value is shown in bold font. × denotes methods without text input and ✓ denotes methods with text input. † Results are directly reported from the original papers.

Methods	Text	Metrics
Methods	Text	DSC (%)	mIoU (%)	HD	ASD
U-Net	×	54.12	41.67	81.35	18.25
U-Net++	×	56.84	44.20	72.10	14.50
Swin-UNet	×	75.14	60.18	35.68	4.78
nnUNet	×	73.21	60.99	49.49	6.55
UNetr	×	58.50	46.12	75.40	15.60
SAM Adapter	×	79.12	65.45	22.35	4.33
CFFormer	×	66.12	53.28	26.45	3.74
CLIPSeg	✓	59.52	45.29	81.51	9.51
LViT	✓	73.56	61.56	25.86	3.97
LGMS	✓	78.01	63.96	22.32	9.70
RecLMIS	✓	73.24	59.91	34.54	9.85
FMISeg	✓	79.49	65.94	23.74	2.94
ViTexNet †	✓	78.19	64.04	-	-
HiMix	✓	78.73	65.71	23.37	10.62
Ours	✓	80.17	66.91	22.93	2.96

Table 3. Quantitative comparison of the proposed method against state-of-the-art models on the BraTS21 and MSD Liver datasets. The best value is shown in bold font.

Datasets	Methods	Metrics
Datasets	Methods	DSC (%)	mIoU (%)	HD	ASD
Brats	CFFormer	91.70	86.54	12.13	1.51
	LViT	90.17	83.68	13.85	2.26
	LGMS	91.47	85.23	12.64	1.82
	ReCLMIS	91.65	85.64	13.99	2.14
	FMISeg	92.82	86.61	13.12	1.83
	HiMix	91.73	85.51	12.38	1.94
	Proposed	93.47	87.74	11.27	1.61
MSD Liver	CFFormer	97.35	94.88	13.72	1.82
	LViT	96.93	94.21	16.64	1.43
	LGMS	96.79	93.85	13.34	1.37
	ReCLMIS	97.13	94.52	11.55	1.24
	FMISeg	97.22	94.61	16.95	1.51
	HiMix	96.99	94.19	10.13	1.42
	Proposed	97.45	95.02	9.39	1.21

Table 4. Comprehensive ablation study on the QaTa-COV19 dataset demonstrating the contribution of each component across multiple evaluation metrics. The best value is shown in bold font.

Method	Metrics
Method	Dice (%)	mIoU (%)	ASD	HD
Baseline (ConvNext + UNetr Decoder)	88.75	79.78	4.63	26.16
Baseline + CLMD	89.23	80.56	4.43	23.49
Baseline + CLMD + MARL	91.05	83.57	3.72	21.47
Baseline + CLMD + ACTA	90.88	83.29	3.74	22.48
Baseline + CLMD + MARL + ACTA	91.22	83.85	3.61	21.68
Baseline + CLMD + MARL + ACTA + $L_{SAL}$	91.36	84.10	3.55	20.71

Table 5. Comprehensive ablation study of CLMD on the QaTa-COV19 dataset. The best value is shown in bold font. × denotes the absence of the module, and ✓ denotes the presence of the module.

Method	Prior	Posterior	KL	DSC (%)	mIoU (%)
w/o CVAE	×	×	×	90.54	82.72
Prior only	✓	×	✓	90.81	83.16
w/o KL	✓	✓	×	91.06	83.60
Full Model	✓	✓	✓	91.36	84.10

Table 6. Comprehensive ablation study of MARL on the MosMedData+ dataset. The best value is shown in bold font.

Method	Metrics
Method	Dice (%)	mIoU (%)	ASD	HD
w/o MARL	79.39	65.83	4.61	24.06
only SSGM	79.57	66.08	3.50	23.77
only CBSM	79.76	66.34	3.19	23.51
Ours	80.17	66.91	2.96	22.93

Table 7. Comprehensive ablation study of vision–language fusion mechanism on the QaTa-COV19 dataset. The best value is shown in bold font. × denotes methods without text input and ✓ denotes methods with text input.

Method	Text	Fusion Type	Scale	Metrics
Method	Text	Fusion Type	Scale	DSC (%)	mIoU (%)
w/o Text	×	-	-	90.53	82.71
+ Text (low only)	✓	-	low	90.86	83.26
+ Text (high only)	✓	-	high	91.10	83.66
+ Text (multi-scale)	✓	MLP	multi	91.25	83.92
Full Model	✓	learned fusion	multi	91.36	84.10

Table 8. Model efficiency comparison between our method and other methods on the QaTa-COV19 dataset.

Method	Param (M)	Flops (G)	Latency (ms)	FPS	GOPS	Metrics
Method	Param (M)	Flops (G)	Latency (ms)	FPS	GOPS	Dice	mIoU
LViT	29.7	54.2	7.44	134.39	7278.1	81.59	72.54
LGMS	146.9	22.36	8.19	122.16	2731.3	89.85	81.78
RecLMIS	74.2	36.62	7.43	134.62	6498.7	84.56	76.20
FMISeg	213.5	42.62	13.51	74.01	3154.3	91.04	83.56
HiMix	146.9	22.53	14.53	68.8	1550.3	90.94	83.44
Ours	175.7	23.76	10.24	97.65	2320.7	91.36	84.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, J.; Zhou, Q.; He, C. Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation. Symmetry 2026, 18, 859. https://doi.org/10.3390/sym18050859

AMA Style

Jiang J, Zhou Q, He C. Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation. Symmetry. 2026; 18(5):859. https://doi.org/10.3390/sym18050859

Chicago/Turabian Style

Jiang, Jiu, Qi Zhou, and Chu He. 2026. "Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation" Symmetry 18, no. 5: 859. https://doi.org/10.3390/sym18050859

APA Style

Jiang, J., Zhou, Q., & He, C. (2026). Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation. Symmetry, 18(5), 859. https://doi.org/10.3390/sym18050859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Perceiving Symmetry and Variability: A Probabilistic Vision–Language Framework for Medical Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning Segmentation of Medical Images

2.2. Vision–Language Models for Medical Images

2.3. Vision–Language Models for COVID-19 Segmentation

3. Method

3.1. Morphology-Aware Representation Learning

3.2. Attribute-Conditioned Cross-Modal Token Alignment

3.3. Conditional Latent Morphology Distribution Modeling

3.4. Loss Function

4. Experiments and Results

4.1. Datasets and Clinical Relevance

4.1.1. QaTa-COV19 Dataset

4.1.2. MosMedData+ Dataset

4.1.3. BraTS 2021 Dataset

4.1.4. MSD Liver Dataset

4.2. Experiment Setup

4.3. Evaluation Metrics

4.4. Comparison with State-of-the-Art Models

4.4.1. Experiments on QaTa-COV19 Dataset

4.4.2. Experiments on MosMedData+ Dataset

4.5. Generalization Across Diverse Anatomical Symmetry Profiles

4.6. Ablation Study

4.6.1. Overall Component Analysis

4.6.2. Analysis of CLMD Framework Contributions

4.6.3. Analysis of MARL Module Contributions

4.6.4. Analysis of Vision–Language Fusion Mechanism

4.6.5. Analysis of Alignment Loss Weight λ

4.7. Morph Feature Visualization

4.8. Model Efficiency Analysis

4.9. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6.5. Analysis of Alignment Loss Weight $λ$