SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation

Xian, Yukai; Lee, Yurui; Yan, Liang; Shen, Te; Lan, Ping; Zhao, Qijun; Zhang, Yi

doi:10.3390/sym17101643

Open AccessArticle

SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation

by

Yukai Xian

^1,†

,

Yurui Lee

^2,†,

Liang Yan

^2,*,

Te Shen

¹,

Ping Lan

¹

,

Qijun Zhao

^1,3,* and

Yi Zhang

⁴

¹

School of Information Science and Technology, Tibet University, Lhasa 850000, China

²

School of Humanities, Tibet University, Lhasa 850000, China

³

College of Computer Science, Sichuan University, Chengdu 610065, China

⁴

School of Business and Media, Lanzhou University of Finance and Economics, Lanzhou 730020, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2025, 17(10), 1643; https://doi.org/10.3390/sym17101643

Submission received: 21 August 2025 / Revised: 26 September 2025 / Accepted: 2 October 2025 / Published: 3 October 2025

(This article belongs to the Special Issue Symmetry/Asymmetry in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Thangka paintings, as intricate forms of Tibetan Buddhist art, present unique challenges for image segmentation due to their densely arranged symbolic elements, complex color patterns, and strong structural symmetry. To address these difficulties, we propose SPIRIT, a structure-aware and prompt-guided diffusion segmentation framework tailored for Thangka images. Our method incorporates a support-query-encoding scheme to exploit limited labeled samples and introduces semantic guided attention fusion to integrate symbolic knowledge into the denoising process. Moreover, we design a symmetry-aware refinement module to explicitly preserve bilateral and radial symmetries, enhancing both accuracy and interpretability. Experimental results on our curated Thangka dataset and the artistic ArtBench benchmark demonstrate that our approach achieves 88.3% mIoU on Thangka and 86.1% mIoU on ArtBench, outperforming the strongest baseline by 6.1% and 5.6% mIoU, respectively. These results confirm that SPIRIT not only captures fine-grained details, but also excels in segmenting structurally complex regions of artistic imagery.

Keywords:

Thangka image segmentation; diffusion models; symmetry-aware segmentation; cultural heritage preservation; prompt-guided learning

1. Introduction

Thangka, as a traditional Tibetan scroll painting, holds profound cultural and artistic significance. Distinguished by its intricate iconography, rich color schemes, and detailed narrative compositions, Thangka is not only a sacred art form, but also a valuable heritage asset requiring careful preservation. With the advent of digital technologies, how to efficiently analyze, segment, and preserve Thangka images has become a pressing research topic in cultural heritage computing. Among these tasks, image segmentation serves as the foundation for downstream applications such as recognition, restoration, and style transfer.

However, the unique characteristics of Thangka images—dense components, fine-grained textures, and a lack of annotated datasets—pose significant challenges to conventional segmentation methods. Existing annotated resources are generally limited in both scale and stylistic diversity, which restricts their ability to cover the wide range of variations in color, layout, and symbolic representation found in Thangka. Furthermore, some datasets suffer from imbalanced class distributions and incomplete annotations, particularly for rare or degraded elements, making it difficult to train models that generalize robustly across different artistic styles. Similar issues are encountered in the segmentation of ancient murals and art paintings, where intricate patterns, rich color variation, and ambiguous boundaries often defeat standard approaches [1,2,3]. As a result, researchers have developed a series of specialized algorithms for Thangka and related artistic images. These include spatial-prior-enhanced models [4], line drawing augmentation, multi-scale attention mechanisms [3], and end-to-end deep networks tailored for capturing both regular and blurred structures [5,6]. The integration of edge information, region proposals, and spatial priors has further improved segmentation accuracy for critical semantic elements, such as headdresses and figures, in Thangka and mural art [1,7,8].

Early image segmentation approaches predominantly relied on convolutional neural networks (CNNs). The seminal work of U-Net [9] established a widely adopted encoder–decoder architecture, achieving remarkable success in biomedical imaging. Subsequent advances, such as Context Prior [10] and CondNet [11], enhanced contextual modeling and feature discrimination, but their effectiveness was often limited by the local receptive fields of convolutional operations. To address these shortcomings in the context of Thangka and mural images, researchers introduced hybrid models that combine multi-scale convolutions, atrous convolutions, and attention modules, boosting feature extraction and edge detail preservation in complex scenes [3,5,6].

The advent of Vision Transformers (ViTs) revolutionized the field, as architectures like Swin Transformer [12] introduced hierarchical attention mechanisms that balance computational efficiency with global context modeling. Further developments, such as SegFormer [13], Segmenter [14], K-Net [15], MaskFormer [16], and Mask2Former [17], unified various segmentation tasks within scalable frameworks. In the specific case of Thangka images, the use of multi-scale attention- and coordinate-aware modules has been shown to effectively capture both global semantic context and intricate details, helping to overcome over-segmentation and edge confusion [3].

Meanwhile, open-vocabulary and vision–language models have broadened the scope of segmentation. GroupViT [18], FC-CLIP [19], and ClipSeg [20] have enabled zero-shot and prompt-based segmentation by leveraging text supervision and large-scale pretrained backbones. The Segment Anything Model (SAM) [21] and GenSAM [22] further expanded segmentation capabilities through promptable and cross-modal frameworks. In the domain of art, domain adaptation and training-free methods have also gained traction, addressing the scarcity of annotated data for paintings by using style transfer and generative techniques to synthesize training samples or perform segmentation directly [2,23].

Efforts to enhance universality and efficiency have led to transformer-based frameworks such as Mask DINO [24], OneFormer [25], and methods like ReMaX [26], SegNeXt [27], and PlainSeg [28]. Recent works have also explored dense decoding and decoder design improvements, offering practical benefits for high-resolution and complex images [29,30]. In heritage imaging, multi-attribute feature-fusion- and knowledge-driven detection have enabled more nuanced retrieval and recognition of objects in Thangka, leveraging both appearance and domain-specific cues [8,31,32].

Beyond single-modal methods, multi-modal and multi-task approaches have become increasingly prominent. DPLNet [33] introduced dual-prompt learning for efficient RGB-D and RGB-T segmentation, while GeminiFusion [34] and ODIN [35] unified 2D and 3D segmentation tasks. Other frameworks such as Delivering Arbitrary-Modal Segmentation [36] and SwinMTL [37] scaled segmentation to handle multi-modal and multi-task scenarios. In Thangka and mural segmentation, integrating spatial, chromatic, and contextual cues within multi-branch or fusion networks has further improved semantic understanding and detail recovery [1,3,38]. Additionally, superpixel-based and clustering approaches have been explored for high-precision segmentation of murals with complex color appearances [38].

Robustness and data efficiency remain active research frontiers. Self-ensemble strategies [39] and representation separation techniques [40] have addressed issues around over-smoothing and model robustness. At the same time, weakly supervised and box-level annotation methods have been proposed to reduce the high cost of pixel-level labeling, showing strong performance in element segmentation for Thangka and portrait images [7,41]. In parallel, recent works, such as Li et al. [42], explored attention-based augmentation and multi-view learning for few-shot surface defect detection, providing insights into how tailored architectures can enhance feature representation under limited-data scenarios.

More recently, generative and diffusion models have emerged as promising alternatives to discriminative approaches. Bui et al. [43] introduced a diffusion-based framework for RGB-D semantic segmentation, demonstrating superior performance in handling noisy data and capturing intricate spatial structures. The iterative denoising nature of diffusion models enables enhanced robustness and generalization, which aligns well with the challenges posed by Thangka and related artistic images. In addition, approaches such as PaintSeg [23] offer training-free, adversarial segmentation pipelines that can generalize across diverse artistic styles.

Thangka images present unique challenges for segmentation due to their symbolic complexity, densely packed visual elements, and strong symmetrical layouts. Many regions, such as lotus petals and backlight halos, share similar textures and colors, making it difficult for traditional models to distinguish them. In addition, the main figures are often embedded within symmetrical compositions, which require structural awareness to segment accurately. Meanwhile, the textual descriptions associated with Thangka images contain rich symbolic meaning that is often missing from the visual channel, limiting the model’s ability to fully understand semantic context.

To address these issues, we propose SPIRIT, a structure-aware and prompt-guided diffusion segmentation framework. Our model integrates information from image, text, and support samples to guide the generation of accurate masks. It uses a diffusion-based architecture with support-query encoding, semantic-guided attention for multi-modal fusion, and a dedicated symmetry-aware module for structural consistency. This design enhances the model’s capacity to handle fine-grained categories, artistic variations, and visually ambiguous regions in Thangka paintings.

Contributions: The main contributions of this work are as follows:

We construct a high-quality Thangka segmentation dataset with pixel-level expert annotations, covering diverse artistic schools and materials.
We propose SPIRIT, a diffusion-based segmentation framework that incorporates cultural symmetry-priors to better capture structural regularities in Thangka compositions.
We design novel modules including the Symbolic-Guided Attention Field (SGAF), Symmetry-aware Affinity Module (SyAM), and Text-guided Augmentation and Refinement (TAR), each of which is validated using ablation studies.
Extensive experiments on the Thangka and ArtBench datasets demonstrate that our method consistently outperforms strong baselines, with significant gains in challenging categories such as halo and lotus.

Organization: The remainder of this paper is organized as follows: Section 2 reviews related work on segmentation, diffusion models, and cultural heritage imaging. Section 3 introduces the proposed SPIRIT framework and its key components. Section 4 presents the experimental setup, results, and ablation studies. Finally, Section 5 concludes the paper and discusses future directions.

2. Related Work

2.1. Diffusion Models for Image Segmentation

Diffusion models have emerged as a powerful generative framework for dense prediction tasks, offering promising solutions for semantic segmentation in complex and low-data scenarios. Unlike discriminative models that directly predict segmentation masks, diffusion models iteratively refine outputs through denoising processes, capturing fine-grained structures and complex spatial dependencies. Bui et al. [43] introduced a diffusion-based RGB-D segmentation framework leveraging deformable attention transformers, achieving superior performance on noisy depth data by modeling intricate spatial relationships. Baranchuk et al. [44] proposed a label-efficient segmentation approach using diffusion models to generate high-quality pseudo-labels, significantly reducing annotation costs. Similarly, Wu et al. [45] proposed DiffuMask, which synthesizes pixel-level annotated images using diffusion models, alleviating the dependency on large-scale manual annotations and improving segmentation performance in data-scarce scenarios. In the medical domain, Gu et al. [46] applied diffusion models to medical image segmentation, highlighting their robustness in capturing subtle anatomical structures.

More recently, Wolleb et al. [47] introduced implicit segmentation ensembles with diffusion models, generating diverse masks and uncertainty maps. Zhu et al. [48] explored latent diffusion for few-shot semantic segmentation, employing support-query fusion strategies to enhance generalization. MaskDiffusion [49] leveraged pretrained Stable Diffusion for open-vocabulary segmentation without requiring large-scale finetuning. In addition, DPUSegDiff [50] combined dual-path U-Net structures with diffusion processes, achieving notable improvements in medical image segmentation tasks. These advances further demonstrate the adaptability of diffusion models to various segmentation challenges, aligning well with the requirements of Thangka image analysis.

2.2. Text-Guided Semantic Segmentation

Text-guided segmentation has gained significant attention with the advent of vision–language models, enabling flexible and scalable dense prediction through natural language prompts. Early works like ClipSeg [20] demonstrated that incorporating text prompts into segmentation pipelines allows for models to perform zero-shot segmentation without pixel-level annotations. SAM [21] further advanced this paradigm by introducing promptable segmentation with interactive inputs, facilitating versatile applications across domains. To overcome the reliance on manual prompts, Hu et al. [22] proposed GenSAM, which employs cross-modal reasoning to automatically generate visual prompts from generic text inputs, enhancing usability in real-world scenarios. Additionally, DenseCLIP [51] extended CLIP’s capabilities to dense prediction tasks, leveraging language-guided contextual prompting to refine segmentation outputs.

More recently, research has further expanded the scope of text-guided and open-vocabulary segmentation. LSeg [52] introduced a framework that directly aligns pixel-level representations with language embeddings, enabling the segmentation of arbitrary categories defined by text. OpenSeg [53] advanced this idea by grouping pixels into mask proposals and associating them with words from image captions, thereby scaling segmentation from image-level supervision. X-Decoder [54] proposed a generalized decoding architecture capable of predicting both segmentation masks and language tokens within a unified semantic space. Building on this line, SEEM [55] developed an interactive model that accepts diverse prompts such as text, points, scribbles, and boxes, while integrating them into a single segmentation framework. Collectively, these works highlight the growing synergy between vision and language, establishing a strong foundation for parsing visually ambiguous and semantically rich domains such as Thangka imagery.

3. Methods

We introduce SPIRIT, a support-guided and symmetry-aware diffusion framework for Thangka image segmentation. The overall architecture is shown in Figure 1. The core idea of SPIRIT is to combine cross-modal support guidance with symmetry-priors that are inherent in Thangka art.

The support image and its mask are first encoded into latent features. These features are combined with the support text features and sent into a diffusion U-Net to generate enriched support context features. SPIRIT then uses these features to guide the query branch. The query image is processed by another diffusion U-Net, where a fusion module injects the support context to help generate more accurate mask features. A symmetry affinity module further improves the query features by enforcing structural consistency, while the query text features provide semantic guidance. In this way, the predicted masks match the target regions and also follow the symmetrical design of Thangka paintings. The refined mask features are then compared with the original query mask features for training.

Through these designs, SPIRIT effectively unifies support-based cross-modal guidance and symmetry-aware refinement, resulting in more precise and reliable segmentation of complex Thangka artworks.

3.1. Preprocessing with Text-Guided Attribute Augmentation and Latent Visual Encoding

To enable semantically guided Thangka segmentation, we introduce a dual-stream preprocessing pipeline that processes both visual and textual inputs into structured embeddings. This step ensures that the downstream diffusion model can access the aligned representations enriched with symbolic guidance.

On the visual side, we adopt a shared latent encoder

E_{v}

based on the Variational Autoencoder (VAE) architecture from Stable Diffusion. It is used not only to compress the input Thangka image I, but also the corresponding segmentation mask M into their respective latent embeddings:

z_{I} = E_{v} (I), z_{M} = E_{v} (M),

(1)

where

z_{I}

and

z_{M}

denote the latent visual and mask features, respectively. By operating in the same latent space, the model benefits from a consistent representation that supports learning in the denoising process while reducing computational complexity.

On the textual side, raw Thangka captions are often verbose and narrative, making them difficult to directly align with visual features. To extract structured symbolic knowledge and improve semantic clarity, we introduce a large language model (LLM)-based augmentation pipeline, as illustrated in Figure 2. Given a raw caption c, the LLM performs two parallel operations to produce an enriched textual representation.

One branch focuses on attribute extraction and template generation. The LLM identifies a set of symbolic attributes

A = {{attr}_{1}, {attr}_{2}, \dots, {attr}_{m}}

from the caption, such as “single person,” “multi-limbed,” or “colorful base,” with each attribute associated with categorical labels (Yes, No, or Unknown). These attributes are then converted into natural language phrases using predefined templates, resulting in structured symbolic statements that highlight the key semantic aspects of the image.

The other branch involves direct rewriting of the original caption. The LLM refines c into a concise and semantically enhanced version

c^{'}

guided by the extracted attribute context:

c^{'} = LLMRewrite (c, A) .

(2)

The final input

c^{″}

is formed by concatenating the rewritten caption

c^{'}

with the template-generated symbolic phrases:

c^{″} = c^{'} ∥ Templates (A),

(3)

where

∥

denotes sentence-level concatenation.

This enriched caption

c^{″}

is then encoded using the CLIP text encoder

E_{t}

to obtain the final text embedding:

t = E_{t} (c^{″}) .

(4)

By combining narrative clarity and symbolic abstraction, this attribute-guided textual augmentation enhances cross-modal alignment, which is especially important for Thangka segmentation tasks characterized by intricate iconographic structures and layered semantic cues.

3.2. Cross-Modal Semantic Fusion in Diffusion Models

Diffusion models have recently emerged as powerful tools for dense prediction tasks, offering a progressive denoising mechanism to recover structured outputs from noisy latent variables. Formally, given a noisy representation

x_{t}

at time step t, the model learns to estimate the added noise

ϵ_{θ} (x_{t}, t)

and predict the clean latent variable

x_{0}

:

{\hat{x}}_{0} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \sqrt{1 - α_{t}} \cdot ϵ_{θ} (x_{t}, t)),

(5)

where

α_{t}

denotes the noise schedule. Through iterative refinement over multiple steps, this formulation enables the model to capture both local structure and global semantics in complex visual data.

To fully exploit this potential in our Thangka segmentation setting, we propose a dual-pathway denoising architecture in which both the query image and its associated segmentation mask are encoded into latent spaces and processed in parallel diffusion branches. Although both branches follow the same denoising pipeline, their inputs serve complementary purposes: the image path captures visual semantics, while the mask path focuses on structural supervision.

Rather than treating these branches independently, we enforce shared weights between their diffusion modules. This design allows for the model to jointly learn from both modalities while maintaining consistent feature representations across image and mask domains. The shared denoising U-Net not only reduces parameter overhead, but also encourages semantic alignment throughout the denoising trajectory.

In Thangka segmentation, relying solely on visual or structural signals may be insufficient, as symbolic attributes (e.g., “lotus posture,” “multiple arms”) often govern region delineation. These high-level semantics are crucial for disambiguating visually similar patterns or culturally encoded structures. To inject both visual context and symbolic knowledge into the denoising process, we design a two-stage attention mechanism within the Semantic-Guided Attention Fusion (SGAF) module, as illustrated in Figure 3. This module first performs visual self-attention enhanced by support image features, and then applies cross-attention guided by symbolic textual embeddings.

In the first stage, we perform a self-attention operation over the query image features, but enrich the key and value branches by incorporating support image information. Specifically, given a query image, we extract its query, key, and value embeddings

Q_{q}

,

K_{q}

, and

V_{q}

, and similarly extract

K_{s}

and

V_{s}

from the corresponding support image. We concatenate these representations along the key and value branches:

K_{q s} = K_{q} ∥ K_{s}, V_{q s} = V_{q} ∥ V_{s},

(6)

where ‖ denotes feature concatenation followed by a linear projection. The enhanced visual representation is obtained by

X = softmax (\frac{Q_{q} K_{q s}^{⊤}}{\sqrt{d}}) V_{q s},

(7)

which captures both intra-image context and inter-image priors from the support sample.

In the second stage, we aim to inject symbolic semantics derived from textual features. To preserve spatial sensitivity in the semantic representation, we first apply a depth-wise convolution to the textual query feature

Q_{t}

:

{\tilde{Q}}_{t} = DWConv (Q_{t}),

(8)

producing a locally-aware symbolic embedding. We then perform a cross-attention operation where the updated textual queries

{\tilde{Q}}_{t}

attend over the fused visual representation X, obtained from the previous stage, which serves as both the key and value:

F = softmax (\frac{{\tilde{Q}}_{t} X^{⊤}}{\sqrt{d}}) X .

(9)

This second-stage attention allows for symbolic text-derived queries to directly modulate visual features, aligning visual patterns with high-level semantics. The resulting representation F reflects stronger cross-modal consistency and enhanced awareness of symbolic cues, which is particularly beneficial for segmenting the intricate iconographic structures of Thangka paintings.

3.3. Symmetry-Aware Mask Refinement

To better exploit the inherent bilateral or rotational symmetry of Thangka images and strengthen cross-modal alignment, we design a Symmetry Affinity Module (SyAM). The motivation behind this module is that visual features often exhibit structured symmetries, which, if explicitly modeled, can provide strong regularization and enhance the semantic consistency between vision and text. The overall framework is illustrated in Figure 4, and we detail the mathematical formulation step by step below.

First, let the visual feature map be denoted as

F \in R^{C \times H \times W}

, where C is the channel dimension, and

H \times W

represents the spatial resolution. A flipped counterpart of the feature map is obtained through a symmetry transformation

T flip (\cdot)

. Both the original and flipped features are processed by Global Average Pooling (GAP) followed by normalization:

f = Norm (GAP (F)), f^{flip} = Norm (GAP (T flip (F))) .

(10)

Here, f and

f^{flip} \in R^{C}

are compact global descriptors of the original and flipped features.

Next, a similarity function

Similarity (\cdot, \cdot)

is applied to f and

f^{flip}

to produce the symmetry affinity map:

S = Similarity (f, f^{flip}) .

(11)

The output S measures the structural correlation between the original and flipped feature representations.

Meanwhile, the text embedding is denoted as

T \in R^{d}

, where d is the textual feature dimension. It is first normalized by Layer Normalization (LN), then projected into a latent space by a Multi-Layer Perceptron (MLP). The result is passed through a sigmoid activation

δ (\cdot)

to generate the modulation vector:

α = δ (MLP (LN (T))) .

(12)

Here,

α

acts as a text-induced gating vector that modulates the affinity map.

This modulation is achieved by element-wise multiplication (Hadamard product) between S and

α

, resulting in the modulated symmetry map:

S_{M} = S ⊙ α,

(13)

where ⊙ denotes element-wise product.

To enhance stability, a residual refinement process is introduced. The modulated map

S_{M}

is compressed via GAP and then passed through an MLP with sigmoid activation to generate a correction signal. This signal is combined with

α

by residual addition (⊕):

H = δ (MLP (GAP (S_{M}))) \oplus α .

(14)

Subsequently, the corrected modulation is applied to the original symmetry map:

δ' = S ⊙ H .

(15)

Here,

δ'

represents the refined symmetry response.

Finally, the refined response

δ'

is fused back with the original visual feature map F through a residual connection:

M = F \oplus δ' .

(16)

The resulting representation M serves as the symmetry-enhanced feature, which incorporates both visual structural symmetry and text-guided modulation for improved cross-modal alignment.

3.4. Loss Function

To effectively train our semantic-aware Thangka segmentation framework, we design a composite loss function that integrates multiple objectives from different modules in the architecture. Each loss term supervises a specific component or interaction, ensuring that both the visual structure and symbolic semantics are properly captured and aligned throughout the denoising and refinement stages.

For the dual-branch diffusion model, we supervise both the image and mask diffusion branches using standard noise prediction loss. Given a clean latent

x_{0}

and its noisy counterpart

x_{t}

at timestep t, the model predicts the added noise

ϵ_{θ} (x_{t}, t)

and is optimized via a simple

ℓ_{2}

reconstruction loss:

L_{diff} = E_{x_{0}, t, ϵ} {∥ϵ - ϵ_{θ} (x_{t}, t)∥}_{2}^{2} .

(17)

This loss is applied to both the image and mask diffusion branches with shared denoising parameters, promoting consistent noise estimation and generation across modalities.

For the cross-modal semantic fusion module, we supervise the output of the denoised latent mask

z_{M}^{gen}

by comparing it with the ground truth latent segmentation

z_{M}

using a latent-level mean squared error:

L_{latent} = {|z_{M}^{gen} - z_{M}|}_{2}^{2} .

(18)

This term ensures that the denoised result remains semantically and structurally aligned with the annotated segmentation layout.

We incorporate a text-image contrastive loss

L clip

to encourage better alignment between the symbolic text embedding t and the visual latent features

z_{I}

. Inspired by CLIP-style contrastive learning, we minimize the cosine distance between matched image–text pairs while maximizing it for unmatched pairs:

L clip = - log \frac{exp (sim (z_{I}, t) / τ)}{\sum_{j} exp (sim (z_{I}, t_{j}) / τ)},

(19)

where

sim (\cdot, \cdot)

denotes cosine similarity and

τ

is a learnable temperature parameter.

For the symmetry-aware mask refinement module, we enforce structural consistency via a symmetry consistency loss

L sym

, which penalizes the discrepancy between the predicted mask and its symmetric counterpart:

L sym = {|m - T_{s} (m)|}_{1},

(20)

where m is the generated mask and

T_{s} (\cdot)

denotes a symmetry transformation (e.g., horizontal flip). This loss encourages the model to respect underlying symmetrical patterns in Thangka layouts.

Finally, the total loss is a weighted combination of all the above components:

L total = λ_{1} L diff + λ_{2} L latent + λ_{3} L clip + λ_{4} L_{sym},

(21)

where

λ_{1}

–

λ_{4}

are scalar weights that balance the contributions of each loss term. In practice, these values are selected via validation performance to ensure stable convergence and optimal segmentation accuracy.

This multi-level loss design enables our model to jointly optimize for denoising fidelity, semantic–symbolic alignment, contrastive guidance, and structural regularity, which is crucial for tackling the rich, intricate compositions found in Thangka art.

4. Experiments

4.1. Evaluation Metrics

We evaluate segmentation performance using four standard metrics: mean Intersection over Union (mIoU), Dice coefficient, mean Accuracy (mAcc), and Pixel Accuracy (PixAcc). These metrics capture complementary aspects of segmentation quality.

The mIoU measures the overlap between predictions and ground truth. For class c,

{IoU}_{c} = \frac{T P_{c}}{T P_{c} + F P_{c} + F N_{c}}, mIoU = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c} .

(22)

The Dice coefficient, equivalent to the F1 score, emphasizes boundary accuracy:

{Dice}_{c} = \frac{2 T P_{c}}{2 T P_{c} + F P_{c} + F N_{c}} .

(23)

The mAcc evaluates per-class accuracy:

{Acc}_{c} = \frac{T P_{c}}{T P_{c} + F N_{c}}, mAcc = \frac{1}{C} \sum_{c = 1}^{C} {Acc}_{c} .

(24)

Finally, PixAcc measures the overall correctness:

PixAcc = \frac{\sum_{c = 1}^{C} T P_{c}}{\sum_{c = 1}^{C} (T P_{c} + F P_{c} + F N_{c})} .

(25)

4.2. Experimental Setup

We implement all models using the PyTorch (version 1.12.1) framework and train them on a single NVIDIA RTX 3080 Ti GPU. Input Thangka images are resized to

640 \times 640

and normalized before training. The model is optimized using the RAdam optimizer, which combines the stability of Adam with Rectified Adaptive Learning Rate for better convergence in the early stages. We set the initial learning rate to 0.001 and adopt a cosine annealing schedule with linear warm-up over the first ten epochs. The total number of training epochs is 300, with a batch size of 32. Weight decay is set to

5 \times 10^{- 4}

to prevent overfitting.

Each loss component is weighted to balance training objectives:

λ_{1} = 1.0

is assigned to the segmentation objective,

λ_{2} = 0.3

to the auxiliary classification loss,

λ_{3} = 0.5

to the cross-modal alignment loss, and

λ_{4} = 0.2

to the symmetry consistency loss. All

λ

values are selected via grid search on the validation set to ensure stable convergence and optimal segmentation accuracy.

4.3. Dataset and Augmentation

We construct a Thangka image segmentation dataset consisting of 2210 high-resolution digital images collected from public repositories and archival resources. The dataset covers a wide range of artistic schools, including Karma Gadri, Menri, Rebgong, Mensar, Chintse, Gyiwugang, and Nipali, as well as multiple materials such as silk, canvas, and paper-based Thangka paintings. Each image is manually annotated at the pixel level by domain experts through a two-stage verification process to ensure accuracy, with five semantic categories defined: figure, halo, backlight, background, and lotus. These categories capture the typical structural and symbolic elements commonly found in Thangka compositions. In addition, we deliberately included rare styles and historically degraded samples (e.g., pigment loss, cracks, fading) to better reflect real-world cultural heritage scenarios and evaluate robustness. In addition, most images are accompanied by corresponding textual descriptions, which offer rich semantic context and support potential multi-modal modeling. The segmentation task is formulated as a standard supervised learning problem. With its scale and diversity, the dataset provides a strong foundation for training and evaluating models under complex, real-world conditions in cultural heritage image analysis. To further assess the generalization ability of our method in broader artistic domains, we also incorporate the ArtBench dataset as a supplementary benchmark. ArtBench is a publicly available dataset for semantic segmentation in artistic paintings, featuring a variety of art styles and compositions. It serves as a valuable reference for evaluating segmentation performance beyond the Thangka domain.

To improve robustness and generalization, we design a data augmentation pipeline that combines general-purpose and domain-specific strategies. The general augmentations include random grayscale conversion, color jittering, Gaussian blur, random occlusion, horizontal flipping, scaling, and rotation. These operations simulate common variations such as lighting changes, sensor noise, and viewpoint diversity. In addition, we introduce augmentations tailored to the unique characteristics of Thangka images. Since many historical works suffer from degradation (e.g., pigment loss, stains, and cracks), we simulate such patterns during training. Moreover, to account for domain shifts in background tones across different artistic schools, we employ CycleGAN-based style transfer to convert complex colorful backgrounds into simplified variants such as pure red, black, or blue. Finally, we apply a multi-scale cropping and pasting strategy, where resized patches are reinserted into the original image to enrich spatial hierarchies and enhance foreground–background separation. These tailored augmentations significantly improve the model’s ability to handle the visual complexity and degradation present in Thangka art.

4.4. Comparative Experiments

To comprehensively validate the effectiveness and generalization capability of our method, we compare it with several representative segmentation approaches, including diffusion-based baselines (SegDiff [56] and MedSegDiffv2 [46]), as well as support-guided or prompt-based paradigms such as Painter [57], SegGPT [58], Matcher [59], and VLP-SAM [60]. The evaluation is conducted on two representative datasets: the domain-specific Thangka dataset constructed by us and the public art-oriented benchmark ArtBench.

We adopt standard segmentation evaluation metrics, mIoU, Dice coefficient, and mAcc, to provide a holistic assessment of each model’s performance. As shown in Table 1, our method consistently achieves the best results across all metrics and datasets. On the Thangka dataset, our model achieves 88.3% mIoU, 94.2% Dice, and 90.0% mAcc, outperforming the strongest baseline (VLP-SAM) by 6.1%, 4.9%, and 4.7%, respectively. On the ArtBench dataset, our model achieves 86.1% mIoU, 92.4% Dice, and 87.4% mAcc, again surpassing all comparison methods with a clear margin. These results validate the robustness and generalizability of our segmentation framework, especially in the challenging context of complex artistic images.

To further investigate model behavior, we report per-class IoU and macro-AUC scores on the Thangka dataset in Table 2. This finer-grained evaluation provides two key insights. First, our method achieves the most pronounced gains in challenging categories such as halo and lotus (improvements of +8.1 and +8.9 IoU over the strongest baseline, respectively), confirming that SPIRIT is particularly effective in handling fine-grained and structurally complex regions. Background segmentation remains comparable to baseline, which demonstrates that the improvements mainly arise from addressing difficult symbolic components rather than easier classes. Second, the additional AUC metric, computed from pixel-wise probability maps, further validates the effectiveness of our approach. Our method achieves the highest macro-AUC (0.962), indicating that it not only produces accurate masks after thresholding, but also maintains strong discriminative capability across varying decision thresholds. Together, these results reinforce the robustness of SPIRIT both in overall segmentation performance and in fine-grained, classification-oriented evaluation.

In addition to quantitative results, we provide qualitative comparisons of segmentation performance across different models, as shown in Figure 5. The examples are selected from the Thangka dataset, representing typical compositional and stylistic characteristics of Tibetan Buddhist art. Painter and SegGPT tend to produce coarse masks that fail to tightly fit object boundaries, often including background artifacts or missing fine details, especially in regions with color similarities or intricate ornaments. Matcher offers relatively sharper outlines, but still struggles with semantic confusion in complex areas. VLP-SAM improves visual alignment, but still exhibits occasional over-segmentation, particularly in regions with overlapping symbolic elements. In contrast, our method generates precise and smooth masks that are better aligned with the underlying visual structures. It preserves the integrity of the figure contours and background separation, handles ambiguous boundaries more accurately, and shows consistent semantic parsing across different compositions. Notably, our method maintains mask coherence in challenging zones such as halos, hands, and base platforms, which are often overlooked by other models.

These visual results complement the quantitative findings, demonstrating the advantage of our framework in capturing fine-grained details and structure-aware semantics in art-domain segmentation.

4.5. Ablation Study

To assess the effectiveness of our proposed modules, we conduct a series of ablation experiments, focusing on three key components: the Symbolic-Guided Attention Field (SGAF), the Symmetry-aware Affinity Module (SyAM), and the Text-guided Augmentation and Refinement module (TAR). The quantitative results are summarized in Table 3.

Starting from the baseline, which yields 77.8% mIoU and 89.3% Dice, we first analyze the impact of each module individually. Incorporating SGAF leads to a noticeable improvement of +3.2 mIoU and +2.3 Dice, confirming that integrating symbolic priors into the denoising process helps to recover fine-grained object boundaries and improves spatial awareness. When applying SyAM alone, the model gains +2.4 mIoU, showing that enforcing symmetry consistency—especially for inherently symmetric Thangka structures—enhances the overall mask coherence and reduces deformation artifacts. Adding TAR independently contributes a +3.1 mIoU gain, suggesting that leveraging external textual semantics for augmentation and refinement can enrich the representation of rare or complex patterns that are otherwise difficult to learn from visual features alone.

We then explore the joint effects of combining these modules. The integration of SGAF and SyAM achieves 82.8% mIoU, reflecting that symbolic attention and symmetry constraints are complementary in guiding the generation process. The combination of SGAF and TAR further improves performance to 83.2%, indicating that symbolic priors can be effectively reinforced by textual cues during denoising. SyAM and TAR together result in 82.5% mIoU, suggesting that structural and semantic regularities jointly provide a strong inductive bias when appearance cues are insufficient. Finally, integrating all three modules yields the best performance across all metrics: 88.3% mIoU, 94.2% Dice, 90.0% mean accuracy, and 92.6% pixel accuracy. Compared to the baseline, this configuration brings an overall improvement of +10.5 mIoU and +4.9 Dice, validating that our proposed components are mutually reinforcing and contribute to enhanced segmentation precision from semantic, structural, and generative perspectives.

To evaluate the contribution of our LLM-based caption rewriting and attribute extraction strategy, we conducted a dedicated ablation on the Text-guided Augmentation and Refinement (TAR) module. As reported in Table 4, both sub-components provide measurable gains over the baseline model: caption rewriting improves mIoU from 82.8% to 84.5%, while attribute extraction further increases performance to 85.1%. When both are jointly applied, the model achieves the best performance (88.3% mIoU and 94.2% Dice), confirming that rewriting enhances textual diversity whereas attribute extraction provides explicit structural cues, and their combination yields complementary benefits. This analysis verifies the individual and synergistic contributions of TAR to segmentation accuracy.

We also analyzed the robustness of the Symmetry-aware Affinity Module (SyAM) under different prior strengths, controlled by the weight

λ_{4}

in Equation (21). As shown in Table 4, a weak prior (

λ_{4} = 0.1

) and a strong prior (

λ_{4} = 0.3

) yield suboptimal results (86.0% and 86.5% mIoU, respectively) compared to the medium setting (

λ_{4} = 0.2

, 88.3% mIoU). This trend indicates that insufficient prior weighting fails to enforce structural regularity, whereas overly strong constraints reduce flexibility in non-symmetric regions. The medium prior provides the best balance between structural guidance and adaptive fitting, confirming the robustness of SyAM in handling symmetrical yet diverse Thangka compositions.

4.6. Visual Error Analysis

To further evaluate the performance of our segmentation model, we visualize the confusion matrix across the five semantic categories. This provides a more detailed understanding of how different classes are distinguished and where errors predominantly occur. The confusion matrix is shown in Figure 6.

From the visualization, it can be observed that the model achieves relatively high accuracy on dominant categories such as Background and Figure, as indicated by strong diagonal responses. However, confusions occur between visually similar classes, for example, Lotus and Backlight, which often share overlapping color distributions and fine-grained textures. Another common confusion appears between Halo and Backlight due to their co-occurrence in similar spatial regions around the central figure.

These misclassifications highlight the challenges in segmenting fine-grained symbolic elements in Thangka paintings. While the backbone model provides robust general recognition, improvements may be achieved by incorporating additional structural priors or relation-aware modeling to better distinguish semantically close categories.

In addition to the confusion matrix analysis, we further examine specific failure cases where the model struggles with unusual or challenging Thangka styles, as illustrated in Figure 7. The upper example depicts a modern-style Thangka with highly saturated flat colors and minimal texture. Although the visual boundaries appear clear to the human eye, the lack of shading and texture cues significantly hinders the model’s ability to differentiate between adjacent semantic regions. In particular, large homogeneous areas are often misclassified as the “figure” class, especially when the color similarity between the figure and surrounding elements such as the seat or background is high. Moreover, the uniform color palette causes the model to confuse visually adjacent elements like the halo and the backlight, leading to inaccurate segmentation around the figure’s silhouette.

In contrast, the lower case shows an ancient, severely degraded Thangka image with notable pigment loss, faded details, and low contrast. In this scenario, the model faces two major challenges: first, the conventional low-level features—such as color, edges, and contrast—that typical segmentation models rely on are severely diminished; second, the artistic inconsistencies and deterioration introduce irregular deformations, occlusions, and noise. These conditions make it difficult for the model to locate precise boundaries, often resulting in misclassification between semantically close regions such as “backlight” and “figure” or between “halo” and “background”.

These two examples represent edge cases of over-simplified and extremely degraded inputs, respectively, and they highlight the limitations of the current model’s generalization ability. The first case reveals the model’s dependence on texture cues for disambiguation, while the second underscores its vulnerability to quality degradation and style variance. Addressing these issues may require incorporating style-aware augmentation strategies, degradation modeling tailored to historical art, or symmetry-based priors that can better support segmentation under such challenging conditions.

5. Conclusions

Our proposed diffusion-based segmentation framework demonstrates strong performance on both Thangka and ArtBench datasets, effectively identifying symbolic regions such as figures, halos, lotus bases, and backlights. This superiority arises from the iterative refinement process inherent in diffusion models, which allows for the better preservation of structure and detail compared to traditional CNN- or Transformer-based methods. By integrating minimal priors such as prompt guidance and symmetry refinement, the model achieves high accuracy without reliance on large-scale annotations. Despite these strengths, certain limitations remain.

As shown in the visual error analysis, the model performs less reliably on flat-colored modern Thangka due to a lack of internal texture cues, and on degraded historical Thangka where pigment loss and noise obscure semantic boundaries. These issues expose sensitivity to style shifts and visual degradation. Although trained with limited supervision, the model still depends on well-selected examples, which may restrict its performance in diverse real-world scenarios. Potential solutions include domain adaptation or multi-style augmentation. Furthermore, while evaluation metrics such as mAcc and mIoU provide useful indicators, they may not fully reflect perceptual correctness or cultural relevance. Minor boundary shifts can reduce scores, even when the segmentation remains visually and semantically acceptable.

Future work may consider incorporating expert-informed metrics to better capture the nuanced requirements of heritage segmentation tasks. In addition, we plan to investigate texture-invariant feature extractors and degradation-robust priors, which can reduce the dependence on fine-grained surface textures and improve segmentation under style variations or deterioration. Incorporating multi-scale frequency-domain cues such as wavelet or Fourier representations may further strengthen robustness against pigment loss and noise. Overall, our results confirm the potential of diffusion models in this domain while pointing toward directions for enhancing robustness, generalization, and interpretability.

Author Contributions

Conceptualization, Y.X. and L.Y.; Methodology, Y.X., Q.Z., and T.S.; Software, Y.X. and T.S.; Validation, Y.L. and P.L.; Formal analysis, Y.X. and Q.Z.; Investigation, Y.X., Q.Z., and Y.L.; Resources, P.L. and Y.Z.; Data curation, Y.L. and P.L.; Writing—original draft preparation, Y.X.; Writing—review and editing, L.Y., Q.Z., and P.L.; Visualization, Y.L.; Supervision, L.Y. and Q.Z.; Project administration, L.Y.; Funding acquisition, L.Y., P.L., and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Special Research Program on the Exchanges and Integration of the Chinese Nation (Grant No. JDZD2025002), the Major Project of the 2024 Research Base for Forging the Sense of Community of the Chinese Nation (Grant No. 24JJDM005), the Lhasa Science and Technology Plan Project (Grant No. LSKJ202405), and the National Social Science Fund Youth Program of China (Grant No. 20CZW058).

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to concerns related to cultural sensitivity and image authorization agreements. However, the data are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Networks
ViTs	Vision Transformers
SAM	Segment Anything Model
VAE	Variational Autoencoder
LLM	Large Language Model
SGAF	Semantic-Guided Attention Field
GAP	Global Average Pooling
MLP	Multi-Layer Perceptron
mIoU	mean Intersection over Union
mAcc	mean Accuracy
PixAcc	Pixel Accuracy
SyAM	Symmetry-aware Affinity Module
TAR	Text-guided Augmentation and Refinement

References

Cao, J.; Cao, Z.; Chen, Z.; Wang, F.; Wang, X.; Yang, Z. Ancient mural segmentation based on multiscale feature fusion and dual attention enhancement. Herit. Sci. 2024, 12, 58. [Google Scholar] [CrossRef]
Cohen, N.; Newman, Y.; Shamir, A. Semantic segmentation in art paintings. Comput. Graph. Forum 2022, 41, 261–275. [Google Scholar] [CrossRef]
Wang, T.; Wu, J.; Guo, X.; Duan, T.; Wei, Y.; Wu, C. Thangka Element Semantic Segmentation with an Integrated Multi-Scale Attention Mechanism. Electronics 2025, 14, 2533. [Google Scholar] [CrossRef]
Meng, J.; Hu, W.; Jia, L.; He, G.; Xue, P. A semantic segmentation model for headdresses in Thangka image based on line drawing augmentation and spatial prior knowledge. IEEE Sens. J. 2021, 21, 25161–25170. [Google Scholar] [CrossRef]
Wang, H.; Hu, J.; Xue, R.; Liu, Y.; Pan, G. Thangka image segmentation method based on enhanced receptive field. IEEE Access 2022, 10, 89687–89695. [Google Scholar] [CrossRef]
Cao, J.; Tian, X.; Chen, Z.; Rajamanickam, L.; Jia, Y. Ancient mural segmentation based on a deep separable convolution network. Herit. Sci. 2022, 10, 11. [Google Scholar] [CrossRef]
Hu, W.; Meng, J.; Jia, L.; Zeng, F.; Xue, P. Weakly Supervised Semantic Segmentation for Headdress of Thangka Images. J. Inf. Sci. Eng. 2023, 39, 1. [Google Scholar]
Xian, Y.; Xiang, Y.; Yang, X.; Zhao, Q.; Cairang, X. Thangka school image retrieval based on multi-attribute features. npj Herit. Sci. 2025, 13, 179. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Yu, C.; Wang, J.; Gao, C.; Yu, G.; Shen, C.; Sang, N. Context prior for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12416–12425. [Google Scholar]
Yu, C.; Shao, Y.; Gao, C.; Sang, N. CondNet: Conditional classifier for scene segmentation. IEEE Signal Process. Lett. 2021, 28, 758–762. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L.C. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. Adv. Neural Inf. Process. Syst. 2023, 36, 32215–32234. [Google Scholar]
Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7086–7096. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Hu, J.; Lin, J.; Gong, S.; Cai, W. Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, UK, 20–27 February 2024; Volume 38, pp. 12511–12518. [Google Scholar]
Li, X.; Lin, C.C.; Chen, Y.; Liu, Z.; Wang, J.; Raj, B. Paintseg: Training-free segmentation via painting. arXiv 2023, arXiv:2305.19406. [Google Scholar]
Li, F.; Zhang, H.; Xu, H.; Liu, S.; Zhang, L.; Ni, L.M.; Shum, H.Y. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3041–3050. [Google Scholar]
Jain, J.; Li, J.; Chiu, M.T.; Hassani, A.; Orlov, N.; Shi, H. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar]
Sun, S.; Wang, W.; Howard, A.; Yu, Q.; Torr, P.; Chen, L.C. Remax: Relaxing for better training on efficient panoptic segmentation. Adv. Neural Inf. Process. Syst. 2023, 36, 73480–73496. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Hong, Y.; Wang, J.; Sun, W.; Pan, H. Minimalist and high-performance semantic segmentation with plain vision transformers. arXiv 2023, arXiv:2310.12755. [Google Scholar] [CrossRef]
Tang, X.; Huang, Y.; Yin, G.; Duan, L. VPNeXt–Rethinking Dense Decoding for Plain Vision Transformer. arXiv 2025, arXiv:2502.16654. [Google Scholar]
Wen, Q.; Li, C.G. Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective. arXiv 2024, arXiv:2411.03033. [Google Scholar]
Xian, Y.; Lee, Y.; Shen, T.; Lan, P.; Zhao, Q.; Yan, L. Enhanced Object Detection in Thangka Images Using Gabor, Wavelet, and Color Feature Fusion. Sensors 2025, 25, 3565. [Google Scholar] [CrossRef]
Xian, Y.; Zhao, Q.; Yang, X.; Gao, D. Thangka Image Object Detection Method Based on Improved YOLOv8. In Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition, Tianjin, China, 25–27 October 2024; pp. 214–219. [Google Scholar]
Dong, S.; Feng, Y.; Yang, Q.; Huang, Y.; Liu, D.; Fan, H. Efficient multimodal semantic segmentation via dual-prompt learning. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: New York, NY, USA, 2024; pp. 14196–14203. [Google Scholar]
Jia, D.; Guo, J.; Han, K.; Wu, H.; Zhang, C.; Xu, C.; Chen, X. Geminifusion: Efficient pixel-wise multimodal fusion for vision transformer. arXiv 2024, arXiv:2406.01210. [Google Scholar]
Jain, A.; Katara, P.; Gkanatsios, N.; Harley, A.W.; Sarch, G.; Aggarwal, K.; Chaudhary, V.; Fragkiadaki, K. Odin: A single model for 2D and 3D segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3564–3574. [Google Scholar]
Zhang, J.; Liu, R.; Shi, H.; Yang, K.; Reiß, S.; Peng, K.; Fu, H.; Wang, K.; Stiefelhagen, R. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1136–1147. [Google Scholar]
Taghavi, P.; Langari, R.; Pandey, G. SwinMTL: A shared architecture for simultaneous depth estimation and semantic segmentation from monocular camera images. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: New York, NY, USA, 2024; pp. 4957–4964. [Google Scholar]
Liang, J.; Liu, A.; Zhou, J.; Xin, L.; Zuo, Z.; Liu, Z.; Luo, H.; Chen, J.; Hu, X. Optimized method for segmentation of ancient mural images based on superpixel algorithm. Front. Neurosci. 2022, 16, 1031524. [Google Scholar] [CrossRef]
Bousselham, W.; Thibault, G.; Pagano, L.; Machireddy, A.; Gray, J.; Chang, Y.H.; Song, X. Efficient self-ensemble for semantic segmentation. arXiv 2021, arXiv:2111.13280. [Google Scholar]
Hong, Y.; Pan, H.; Sun, W.; Yu, X.; Gao, H. Representation separation for semantic segmentation with vision transformers. arXiv 2022, arXiv:2212.13764. [Google Scholar] [CrossRef]
Wang, H.H.; Liu, X.J. Summary on Thangka image segmentation. In Proceedings of the 2020 International Conference on Intelligent Computing, Automation and Systems (ICICAS), Chongqing, China, 11–13 December 2020; IEEE: New York, NY, USA, 2020; pp. 72–77. [Google Scholar]
Li, P.; Tao, H.; Zhou, H.; Zhou, P.; Deng, Y. Enhanced Multiview attention network with random interpolation resize for few-shot surface defect detection. Multimed. Syst. 2025, 31, 36. [Google Scholar] [CrossRef]
Bui, M.; Alexis, K. Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer. arXiv 2024, arXiv:2409.15117. [Google Scholar]
Baranchuk, D.; Rubachev, I.; Voynov, A.; Khrulkov, V.; Babenko, A. Label-efficient semantic segmentation with diffusion models. arXiv 2021, arXiv:2112.03126. [Google Scholar]
Wu, W.; Zhao, Y.; Shou, M.Z.; Zhou, H.; Shen, C. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1206–1217. [Google Scholar]
Wu, J.; Ji, W.; Fu, H.; Xu, M.; Jin, Y.; Xu, Y. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, UK, 20–27 February 2024; Volume 38, pp. 6030–6038. [Google Scholar]
Wolleb, J.; Sandkühler, R.; Bieder, F.; Valmaggia, P.; Cattin, P.C. Diffusion models for implicit image segmentation ensembles. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; PMLR. 2022; pp. 1336–1348. [Google Scholar]
Zhu, M.; Liu, Y.; Luo, Z.; Jing, C.; Chen, H.; Xu, G.; Wang, X.; Shen, C. Unleashing the potential of the diffusion model in few-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 2024, 37, 42672–42695. [Google Scholar]
Kawano, Y.; Aoki, Y. Maskdiffusion: Exploiting pre-trained diffusion models for semantic segmentation. IEEE Access 2024, 12, 127283–127293. [Google Scholar] [CrossRef]
Fan, Y.; Song, J.; Lu, Y.; Fu, X.; Huang, X.; Yuan, L. DPUSegDiff: A Dual-Path U-Net Segmentation Diffusion model for medical image segmentation. Electron. Res. Arch. 2025, 33, 2947–2971. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18082–18091. [Google Scholar]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven semantic segmentation. arXiv 2022, arXiv:2201.03546. [Google Scholar] [CrossRef]
Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 540–557. [Google Scholar]
Zou, X.; Dou, Z.Y.; Yang, J.; Gan, Z.; Li, L.; Li, C.; Dai, X.; Behl, H.; Wang, J.; Yuan, L.; et al. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15116–15127. [Google Scholar]
Zou, X.; Yang, J.; Zhang, H.; Li, F.; Li, L.; Wang, J.; Wang, L.; Gao, J.; Lee, Y.J. Segment everything everywhere all at once. Adv. Neural Inf. Process. Syst. 2023, 36, 19769–19782. [Google Scholar]
Amit, T.; Shaharbany, T.; Nachmani, E.; Wolf, L. Segdiff: Image segmentation with diffusion probabilistic models. arXiv 2021, arXiv:2112.00390. [Google Scholar]
Wang, X.; Wang, W.; Cao, Y.; Shen, C.; Huang, T. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6830–6839. [Google Scholar]
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. Seggpt: Towards segmenting everything in context. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1130–1140. [Google Scholar]
Liu, Y.; Zhu, M.; Li, H.; Chen, H.; Wang, X.; Shen, C. Matcher: Segment anything with one shot using all-purpose feature matching. arXiv 2023, arXiv:2305.13310. [Google Scholar]
Sakurai, K.; Shimizu, R.; Goto, M. Vision and Language Reference Prompt into SAM for Few-shot Segmentation. arXiv 2025, arXiv:2502.00719. [Google Scholar] [CrossRef]

Figure 1. Overall framework of SPIRIT. The support branch encodes image-mask pairs and combines them with text features to guide the query branch through a fusion module. A symmetry affinity module further refines the query features with semantic and structural constraints, leading to more accurate and symmetry-aware segmentation results.

Figure 2. Illustration of the text augmentation pipeline. A raw Thangka caption is simultaneously processed by an LLM to extract symbolic attributes for template-based generation and to produce a rewritten version of the original description. The two outputs are concatenated into an enriched input, which is then encoded by the CLIP text encoder.

Figure 3. Illustration of the proposed SGAF module. The query image features are decomposed into query, key, and value representations. External support text features provide additional key and value embeddings, which are fused with the visual features to form

K_{q s}

and

V_{q s}

. Cross-attention operations guided by these fused features enable text-informed refinement of visual representations within the diffusion process.

Figure 3. Illustration of the proposed SGAF module. The query image features are decomposed into query, key, and value representations. External support text features provide additional key and value embeddings, which are fused with the visual features to form

K_{q s}

and

V_{q s}

. Cross-attention operations guided by these fused features enable text-informed refinement of visual representations within the diffusion process.

Figure 4. Overview of the SyAM. F denotes visual feature maps, T denotes textual embeddings, GAP represents Global Average Pooling, LN is Layer Normalization, and MLP is a Multi-Layer Perceptron. The Affinity block constructs symmetry-aware cross-modal similarity,

δ

is sigmoid, ⊙ denotes element-wise multiplication, and ⊕ represents residual addition. The final output M is the symmetry-enhanced feature representation.

Figure 4. Overview of the SyAM. F denotes visual feature maps, T denotes textual embeddings, GAP represents Global Average Pooling, LN is Layer Normalization, and MLP is a Multi-Layer Perceptron. The Affinity block constructs symmetry-aware cross-modal similarity,

δ

is sigmoid, ⊙ denotes element-wise multiplication, and ⊕ represents residual addition. The final output M is the symmetry-enhanced feature representation.

Figure 5. Qualitative comparison of segmentation results across different methods on the Thangka dataset. From left to right: original image, ground truth, Painter [57], SegGPT [58], Matcher [59], VLP-SAM [60], and our method.

Figure 6. Confusion matrix of the segmentation results across five semantic categories. Each row represents the ground truth labels, while each column represents the predicted labels. The diagonal entries indicate correctly classified pixels, whereas off-diagonal values highlight the confusions among categories.

Figure 7. Failure cases under different Thangka styles. Top: Modern flat-colored Thangka with ambiguous object boundaries; Bottom: Ancient degraded Thangka with pigment loss and structural damage. Middle: Ground truth masks; Right: Predicted segmentation results.

Table 1. Comparative segmentation performance on the Thangka and ArtBench datasets. Metrics include mean IoU (mIoU), Dice coefficient (Dice), and mean Accuracy (mAcc). All values are in %. Bold values indicate the best performance.

Methods	Thangka Dataset			ArtBench Dataset
Methods	mIoU	Dice	mAcc	mIoU	Dice	mAcc
SegDiff [56]	71.2	80.4	77.6	71.3	72.5	75.4
Painter [57]	72.7	81.8	76.2	68.1	71.3	76.6
SegGPT [58]	76.6	88.3	83.4	75.2	86.1	80.5
Matcher [59]	74.2	85.1	81.0	72.8	83.9	78.3
MedSegDiffv2 [46]	80.5	88.7	84.6	78.9	87.2	83.0
VLP-SAM [60]	82.2	89.3	85.3	80.5	87.8	83.2
Ours	88.3	94.2	90.0	86.1	92.4	87.4

Table 2. Per-class segmentation IoU (%) and macro-AUC comparison on the Thangka dataset. Bold values indicate the best performance.

Methods	Background	Lotus	Backlight	Figure	Halo	AUC
SegGPT [58]	88.0	68.5	69.5	82.1	75.0	0.931
MedSegDiffv2 [46]	90.5	73.2	75.3	85.0	78.7	0.944
VLP-SAM [60]	92.4	75.2	80.6	84.5	78.3	0.949
Ours (SPIRIT)	91.9	84.1	88.5	90.2	86.8	0.962

Table 3. Ablation study on our three proposed modules: the Symbolic-Guided Attention Field (SGAF), the Symmetry-aware Affinity Module (SyAM), and the Text-guided Augmentation and Refinement module (TAR). Metrics are in %. Bold values indicate the best performance.

SGAF	SyAM	TAR	mIoU ↑	Dice ↑	mAcc ↑	PixAcc ↑
			77.8	89.3	84.1	90.2
✓			81.0	91.6	86.8	91.1
	✓		80.2	91.0	86.1	90.9
		✓	80.9	91.5	86.5	91.2
✓	✓		82.8	92.7	88.1	91.8
✓		✓	83.2	93.1	88.6	92.0
	✓	✓	82.5	92.4	87.9	91.7
✓	✓	✓	88.3	94.2	90.0	92.6

Table 4. Analysis of the Text-guided Augmentation and Refinement (TAR) module and robustness under different symmetry-prior strengths (SyAM). Metrics are in %.

Configuration	mIoU ↑	Dice ↑	mAcc ↑	PixAcc ↑
Baseline (w/o TAR)	82.8	92.7	88.1	91.8
+TAR (caption rewriting only)	84.5	93.3	88.9	92.2
+TAR (attribute extraction only)	85.1	93.5	89.2	92.1
+TAR (full: rewriting + attributes)	88.3	94.2	90.0	92.6
SyAM (weak prior, $λ_{4} = 0.1$ )	86.0	93.6	89.4	92.3
SyAM (medium prior, $λ_{4} = 0.2$ )	88.3	94.2	90.0	92.6
SyAM (strong prior, $λ_{4} = 0.3$ )	86.5	93.8	89.6	92.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xian, Y.; Lee, Y.; Yan, L.; Shen, T.; Lan, P.; Zhao, Q.; Zhang, Y. SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation. Symmetry 2025, 17, 1643. https://doi.org/10.3390/sym17101643

AMA Style

Xian Y, Lee Y, Yan L, Shen T, Lan P, Zhao Q, Zhang Y. SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation. Symmetry. 2025; 17(10):1643. https://doi.org/10.3390/sym17101643

Chicago/Turabian Style

Xian, Yukai, Yurui Lee, Liang Yan, Te Shen, Ping Lan, Qijun Zhao, and Yi Zhang. 2025. "SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation" Symmetry 17, no. 10: 1643. https://doi.org/10.3390/sym17101643

APA Style

Xian, Y., Lee, Y., Yan, L., Shen, T., Lan, P., Zhao, Q., & Zhang, Y. (2025). SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation. Symmetry, 17(10), 1643. https://doi.org/10.3390/sym17101643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SPIRIT: Symmetry-Prior Informed Diffusion for Thangka Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Diffusion Models for Image Segmentation

2.2. Text-Guided Semantic Segmentation

3. Methods

3.1. Preprocessing with Text-Guided Attribute Augmentation and Latent Visual Encoding

3.2. Cross-Modal Semantic Fusion in Diffusion Models

3.3. Symmetry-Aware Mask Refinement

3.4. Loss Function

4. Experiments

4.1. Evaluation Metrics

4.2. Experimental Setup

4.3. Dataset and Augmentation

4.4. Comparative Experiments

4.5. Ablation Study

4.6. Visual Error Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI