Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation

Qin, Peijun; Liu, Long; Wang, Hongjuan; Ma, Siyuan; Chen, Cui; Han, Zixuan; Cheng, Mingzhi

doi:10.3390/electronics15081578

Open AccessArticle

Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation

by

Peijun Qin

,

Long Liu

,

Hongjuan Wang

,

Siyuan Ma

,

Cui Chen

,

Zixuan Han

and

Mingzhi Cheng

^*

School of New Media, Beijing Institute of Graphic Communication, Beijing 102600, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1578; https://doi.org/10.3390/electronics15081578

Submission received: 18 February 2026 / Revised: 28 March 2026 / Accepted: 4 April 2026 / Published: 9 April 2026

(This article belongs to the Special Issue AI-Driven Image Generation: Algorithms, Architectures, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Digital stylization of Dunhuang murals can support cultural heritage revitalization by transferring their distinctive aesthetics to modern images, but existing methods face practical limitations. Transformer-based models can yield high visual quality, but often at a prohibitive computational cost. In contrast, standard state space models (SSMs) are more efficient but tend to incur issues such as semantic loss, inconsistent stylization, and an undesired coupling between color and structure when processing the complex textures of historical murals. To address these issues, we propose Dh-Mamba, a hierarchical visual Mamba framework tailored for high-fidelity Dunhuang mural style transfer. Dh-Mamba introduces a CrossMamba in-context style injection mechanism. This mechanism prefixes the style token sequence to the content sequence, which enables globally consistent style propagation as a persistent memory and retains linear-time efficiency. We also designed two additional components: a Modulated Style Perception Module (Δt) and an Orthogonal Decoupled HSV Modulator. The former adaptively regulates texture injection based on style complexity. The latter models mineral pigment palettes and mitigates oxidation-related artifacts by disentangling hue, saturation, and value. Experiments on a custom Dunhuang dataset show that Dh-Mamba improves content preservation and produces more natural mural textures than recent state-of-the-art methods; multiple quantitative metrics corroborate these gains. With 20.04 million parameters, Dh-Mamba provides a resource-efficient solution suitable for deployment in resource-constrained terminal applications for cultural heritage preservation

Keywords:

style transfer; Dunhuang murals; visual mamba; State Space Models (SSM); cultural heritage preservation

1. Introduction

Image style transfer (IST) aims to render a content image in the artistic appearance of a reference image while preserving the underlying semantic structure. This technique has considerable potential for artistic creation, augmented reality, and cultural heritage preservation [1,2]. As a cornerstone of Silk Road civilization, the Dunhuang murals are globally renowned for their distinctive artistic aesthetics. As illustrated in Figure 1, these murals emphasize the depiction of essential elements and predominantly employ vibrant cinnabar and viridian hues to enhance decorative color expression. This approach stands in sharp contrast to the pursuit of color realism that characterizes much of Western art [3,4]. Related studies in Chinese-art computing have also explored traditional Chinese painting classification, ink-and-wash style transfer, and decorative pattern analysis, highlighting the value of domain-specific priors for modeling Chinese visual forms and textures [5,6,7].

Nevertheless, due to the complexity of the painting techniques (such as fluid iron-wire line drawing and heavy mineral pigment shading) and the distinctive artistic characteristics, current arbitrary style transfer methods face substantial challenges on the Dunhuang mural dataset. Early convolutional neural network (CNN)-based approaches, such as AdaIN [8] and SANet [9], achieve real-time transfer by aligning feature statistics. However, these methods are constrained by the local receptive fields of their convolution kernels. This makes it difficult for them to capture the long-range narrative structures found in murals, often resulting in ambiguous semantics or structural degradation [10]. Moreover, these CNN-based methods typically perform patch-wise color transfer, which can produce location-inconsistent outputs for pixels with identical values, leading to severe color distortion and difficulty balancing red–green hue distributions [11,12].

To mitigate locality, Transformer-based methods have been introduced for style transfer [13,14]. Methods such as StyTr2 [13] and QuantArt [15] leverage self-attention to enable global interactions between content and style features, thereby improving structural consistency. However, Transformers exhibit a clear computational bottleneck: the complexity scales quadratically with image resolution, i.e.,

O (N^{2})

[16], and patch-based tokenization can introduce high-frequency grid artifacts [17]. Consequently, memory consumption and inference latency become prohibitive for high-resolution digital restoration of Dunhuang murals [18].

In recent years, structured state space models (SSMs), particularly Mamba, have offered a promising alternative for addressing the efficiency–quality trade-off because of their linear-time complexity O(N) and strong long-sequence modeling capability [19,20,21]. Vision Mamba [22] and VMamba [23] adapt SSMs to visual tasks through bidirectional or two-dimensional selective scanning. However, directly applying standard Mamba to artistic style transfer remains limited in three aspects. First, existing visual Mamba variants are primarily designed for discriminative tasks, and their state-transition parameters (e.g., Δt) are conditioned only on the input image, lacking the ability to perceive target-style complexity; this leads to poor control over the granularity of style injection [24]. Second, the recursive update in standard SSMs can exhibit a “forgetting” effect on long sequences, making it difficult to explicitly retrieve style features as attention-based mechanisms do [25]. Third, for the characteristic “form persisting with color fading” deterioration observed in Dunhuang murals (e.g., lead-red oxidation and binder aging), conventional feature fusion may spuriously couple color and structure, causing unrealistic greening or tonal contamination artifacts [26,27].

In summary, the main contributions of this work are organized hierarchically as follows:

Core Framework: We propose Dh-Mamba, a novel hierarchical visual Mamba framework tailored for high-fidelity cultural heritage style transfer. By replacing quadratic attention with linear-time in-context prompting (CrossMamba), it achieves globally consistent style propagation while maintaining high computational efficiency ( $O (N)$ ).
Key Mechanisms: Within this framework, we design a style-aware dynamic $Δ t$ modulation strategy. By dynamically regulating the memory horizon of the state-space model based on target texture complexity, the network achieves adaptive control over the granularity and continuity of the generated strokes.
Physically Motivated Design: Motivated by the multidimensional physical degradation of Dunhuang murals (e.g., pigment oxidation), we introduce an Orthogonal Decoupled HSV Modulator. By enforcing orthogonal regularization in the latent space, it facilitates the disentanglement of color and structure attributes, mitigating oxidation-related artifacts and accurately reproducing authentic mineral pigment palettes without structural contamination.

2. Related Work

2.1. CNN-Based and GAN-Based Style Transfer

Early research on style transfer primarily focused on convolutional neural networks (CNNs) and generative adversarial networks (GANs). CycleGAN [28] introduced cycle consistency loss, enabling translation between image domains in the absence of paired training data, thereby establishing itself as a classical baseline for unsupervised style transfer. However, because CycleGAN lacks explicit modeling of style features, it often struggles to generate fine texture details. To address this limitation, F-LSeSim [29] proposed a contrastive learning framework based on Functional Linear Self-Similarity, significantly enhancing the quality of texture detail synthesis by imposing local and global self-similarity constraints within the feature space. Although the aforementioned CNN-based methods have achieved progress in texture representation, the inherent local receptive field of convolution kernels limits their capacity to perceive the global structure of images. When processing images such as Dunhuang murals, which exhibit complex spatial layouts and long-range semantic dependencies, these methods often struggle to preserve overall structural coherence, resulting in the distortion or loss of content semantics.

2.2. Transformer-Based Style Transfer

To overcome the locality limitations of CNNs, Transformer-based methods employ the self-attention mechanism to introduce global context modeling capabilities. AdaAttN [30] redesigns the attention mechanism by computing shallow and deep correlations between content and style features, thereby enabling more precise alignment of style feature distributions. AesPA-Net [31], an aesthetic-aware attention module, is further introduced, extracting aesthetic features via multi-scale discrete wavelet transform, aiming to preserve the aesthetic attributes of artworks during style transfer. StyTr2 [13] A pure Transformer-based dual-stream encoder architecture was devised for the first time, separately performing long-range modeling of content and style sequences, and generating high-quality stylized images through a multi-layer decoder. Although Transformer demonstrates excellent performance in maintaining structural consistency, its core limitation lies in computational efficiency: the computational complexity of the self-attention mechanism increases quadratically with image resolution (

O (N^{2})

). This causes a significant rise in memory footprint and inference latency when processing high-resolution Dunhuang murals restoration tasks, restricting its applicability on resource-constrained devices.

2.3. State Space Models for Style Transfer

Recently, Structured State Space Models (SSMs), especially the Mamba architecture, have emerged as a new paradigm in visual tasks due to their linear computational complexity (O(N)) and global receptive field. Following this trend, several recent efforts have begun exploring Mamba’s potential in diverse stylization tasks. For example, text-conditioned style transfer methods such as ClipStyler [32] and StyleMamba [33] demonstrate that conditional guidance can effectively steer stylization; StyleMamba further utilizes a conditional SSM for highly efficient text-driven image style transfer. Prior to our submission, SaMam [20], Mamba-ST [21], and the recently proposed StyMam [34] represented the primary state-of-the-art implementations of SSMs specifically designed for arbitrary image style transfer. Mamba-ST pioneered this by demonstrating the effectiveness of SSMs through simple feature fusion. SaMam further advanced the field by proposing a style-aware Mamba decoder and introducing a Zigzag scanning mechanism to enhance spatial neighborhood correlations, achieving highly efficient texture generation. Similarly, StyMam attempts to solve the artifacts in stylized images by introducing a residual dual-path strip scanning mechanism and spatial attention to jointly capture local and global dependencies.

However, there are fundamental mechanistic differences between our Dh-Mamba and these existing works, particularly the highly efficient SaMam and StyMam. First, regarding style injection, most aforementioned methods (including Mamba-ST, SaMam, and StyMam) rely on continuous spatial feature fusion. This still faces the “style forgetting” challenge, as the standard SSM’s recursive update naturally discards information from earlier sequence steps. Dh-Mamba fundamentally reformulates stylization as an in-context learning task via the CrossMamba module, prepending style tokens as a “visual prompt” to establish a persistent global memory without decay. Second, regarding domain adaptation, existing methods lack physical priors. While SaMam and StyMam improve geometric texture correlations and spatial dependencies, they systematically neglect the unique physical degradation factors of historical artworks, such as pigment oxidation and color-structure coupling. Dh-Mamba explicitly addresses this limitation through our Orthogonal HSV Modulator, physically disentangling color attributes to achieve authentic mineral tone restoration.

3. Methodology

3.1. Prerequisite Knowledge: Discretized State-Space Model

The Structured State Space Model (SSM) employs a hidden state

h (t) \in ℝ^{N}

to map a one-dimensional input sequence

x (t) \in ℝ

to the output

y (t) \in ℝ

. This system is governed by a linear ordinary differential equation (ODE):

h^{'} (t) = A h (t) + B x (t), y (t) = C h (t)

(1)

where A, B, and C are evolution parameters. To process digital image signals, the continuous system is discretized using the zero-order hold (ZOH) method and a time-scale parameter.

Δ

The discretized state transition matrix is:

\bar{A} = e x p (Δ A), \bar{B} = {(Δ A)}^{- 1} (e x p (Δ A) - I) \cdot Δ B

(2)

Through this discretization, the system can be computed efficiently in a recursive manner.

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

Parameters

Δ

play a crucial role in this process: larger

Δ

values signify that the system places greater emphasis on the current input (high-frequency information), whereas smaller

Δ

The value permits the hidden state

h_{t}

to retain longer-term historical information (low-frequency/global information). Within our framework,

Δ

it is redefined as a key variable controlling the texture generation granularity.

3.2. Overview

As illustrated in the macro view of Figure 2, the proposed Dh-Mamba framework employs a hierarchical U-Net architecture specifically designed for high-fidelity artistic style transfer. This process handles the content image

I_{c}

and style image

I_{s}

through separate encoding paths and integrates their features within a multi-scale decoder to generate the final stylized image

I_{c s}

.

3.3. CrossMamba: In-Context Style Injection

Traditional Transformer methods rely on Cross-Attention to compute similarity matrices (Attention(Q,K,V)), whose complexity is

O (N^{2})

. To leverage the linear complexity of SSM,

O (N)

and overcome the “forgetting” issue in long sequences, we propose the CrossMamba mechanism as illustrated in Figure 3a. Inspired by In-Context Learning in large language models (LLM) [35], we model style transfer as a sequence continuation task.

First, to enable SSM to explicitly distinguish heterogeneous feature sources, we introduce learnable modal embeddings (Token Type Embeddings).

E_{t y p e} \in ℝ^{2 \times C}

, we flatten the content feature maps

F_{c} \in ℝ^{C \times H \times W}

and style feature maps

F_{s} \in ℝ^{C \times H \times W}

into visual token sequences.

X_{c}, X_{s}

. Next, we perform prefix concatenation along the sequence dimension:

Z_{i n} = C o n c a t ([X_{s} + E_{t y p e}^{s}, X_{c} + E_{t y p e}^{c}], d i m = 1) \in ℝ^{2 L \times C}

(3)

Subsequently, we input the concatenated sequence

Z_{i n}

into the Mamba block. Denote

x_{t}

For the sequence

Z_{i n}

The feature vector at time t, based on the discrete recursive property of SSM, and the hidden state

h_{t}

evolves as follows:

h_{t} = {\begin{matrix} \bar{A} h_{t - 1} + \bar{B} x_{t}, 0 < t \leq L (S t y l e P h a s e) \\ \bar{A} h_{t - 1} + \bar{B} x_{t}, 0 < t \leq 2 L (C o n t e n t P h a s e) \end{matrix}

(4)

To further enhance information exchange between both ends of the long sequence and mitigate potential information bias caused by unidirectional recursion, we employ a bidirectional SSM (Bi-directional SSM) strategy. To reduce the directional bias of unidirectional recursion over flattened visual tokens, we employ a bidirectional SSM strategy. The forward branch performs prefix-based style conditioning, while the reverse-order branch provides complementary context aggregation. The two branches are fused only at the output stage, and only the outputs corresponding to the content segment are retained for decoding. Finally, we extract the latter half of the output sequence and add it to the original content features via a residual connection to obtain the final fused feature.

F_{f u s e d}

:

F_{fused} = F_{c} + α \cdot Split (Bi-SSM (Z_{in}))

(5)

Among them,

α

is a learnable scaling factor initialized to a small value to ensure stability during the early stages of training.

3.4. Style-Aware $Δ t$ Dynamics Modulation

In standard Mamba, the step size

Δ

is determined solely by the current input (

Δ = Softplus (Linear (x_{t}))

). We consider that, in style transfer,

Δ

actually controls the system’s ‘Memory Horizon’: a smaller

Δ

allows the system to retain longer contextual dependencies (low-frequency structures), whereas a larger

Δ

enables the system to rapidly update its state to capture instantaneous changes (high-frequency textures).

To this end, we design a style-aware dynamic modulation strategy, detailed in Figure 3b. Specifically, the style encoder used for dynamic

Δ t

modulation is not a pre-trained VGG network. In our implementation, it is an independent StyleMamba encoder trained jointly with the full Dh-Mamba model. Under the default configuration, this encoder adopts the same hierarchical Mamba backbone design as the content encoder, but with separate parameters. Given a style image

I_{s}

, it produces multi-scale style features

\{F_{s}^{1}, \dots, F_{s}^{L}\}

, and the deepest style feature is further compressed into a global style vector

v_{s}

through a lightweight style embedding head:

\{F_{s}^{1}, \dots, F_{s}^{L}\} = E_{s} (I_{s}), v_{s} = G (F_{s}^{L})

(6)

It should be noted that

v_{s}

is not designed as an explicit handcrafted measure of texture complexity. Instead, it serves as a compact global style descriptor learned end-to-end from the stylization objective. This descriptor is used to modulate the recurrent dynamics of SS2D through lightweight linear projections:

Δ t = Δ t_{x} + α W_{d t} (v_{s})

(7)

D = D_{0} + α W_{D} (v_{s})

(8)

where

W_{d t}

and

W_{D}

are linear projections, and

α

is a learnable scaling factor. In implementation, the style-to-

Δ t

and style-to-

D

projections are zero-initialized, so the model starts from a near-unmodulated baseline and gradually learns style-dependent dynamic adjustments during training. This design improves numerical stability and allows

v_{s}

to provide global guidance for adapting the effective memory horizon of the SSM to different mural styles.

3.5. Orthogonal HSV Modulator

Dunhuang murals often exhibit a characteristic degradation pattern in which structural forms remain relatively stable while chromatic attributes are altered by pigment oxidation, binder aging, and chroma attenuation. Therefore, directly mixing style and content features in a single latent space may undesirably couple color transfer with structure modulation, leading to tonal contamination or over-smoothed textures. To address this issue, we introduce a pixel-guided orthogonal HSV modulator (Figure 3c), which decomposes style-conditioned feature modulation into three parallel branches designed to be guided by pixel-space hue (H), saturation (S), and value (V) attributes. Unlike a purely latent-space factorization, our design is explicitly anchored to pixel-space color statistics. Given a style image

I_{s}

, we first convert it into HSV space and extract a compact statistics vector:

q = ϕ_{H S V} (I_{s})

(9)

where

q \in R^{10}

encompasses hue orientation/concentration, saturation mean/variance, value mean/variance, and chroma statistics. These explicit descriptors provide physically meaningful color cues to complement the global style code

z

. To construct branch-specific style conditions, we project

q

into the style latent space. By applying branch masks (

M_{H}, M_{S}, M_{V}

) and a shared projection

P (\cdot)

, the conditions are formulated as:

z_{b a s e} = z + α_{0} P (q)

(10)

z_{b} = z_{b a s e} + α_{b} W_{b} (P (M_{b} ⊙ q)), b \in \{H, S, V\}

(11)

where

W_{b}

are branch-specific linear mappings. This encourages each branch to be primarily guided by its corresponding H/S/V statistics. Given an input feature

F

, each branch predicts a channel-wise gate

g_{b}

and affine parameters

(γ_{b}, β_{b})

. The final modulated feature

F^{'}

is obtained by residual fusion of the branch increments:

Δ F_{b} = g_{b} ⊙ [F ⊙ (γ_{b} - 1) + β_{b}]

(12)

F^{'} = F + \sum_{b \in \{H, S, V\}} λ_{b} Δ F_{b}

(13)

where

λ_{b}

are branch scaling factors that can be selectively activated during branch-isolation analysis. To reduce overlap and encourage complementary specialization among the branches, we introduce an orthogonality regularizer on the gates:

L_{o r t h o} = E [g_{H} ⊙ g_{S}] + E [g_{S} ⊙ g_{V}] + E [g_{H} ⊙ g_{V}] + η L_{b a l a n c e}

(14)

where the expectation terms penalize element-wise overlap, and

L_{b a l a n c e}

maintains comparable activation magnitudes to help avoid branch collapse.

3.6. Loss Functions

To achieve high-fidelity structural preservation and texture restoration, we adopt a multi-scale joint optimization strategy. The total loss function is defined as follows:

L_{t o t a l} = λ_{c} L_{c o n t e n t} + λ_{s} L_{s t y l e} + λ_{w} L_{w a v e l e t} + λ_{o} L_{o r t h o} + λ_{t v} L_{t v}

(15)

The individual components are defined as:

Perceptual Content Loss ( $L_{c o n t e n t}$ ): Computing the Euclidean distance between the feature maps of the generated image $I_{c s}$ and the content image $I_{c}$ at the relu4_1 layer of the VGG-19 network:

$L_{c o n t e n t} = ‖ ϕ r e l u 4_1 (I_{c s}) - ϕ r e l u 4_1 (I_{c}) ‖_{2}^{2}$

(16)
Style Loss ( $L_{s t y l e}$ ): Matching the Gram matrix statistics of multi-layer VGG features:

$L_{s t y l e} = \sum_{l \in L} ‖ G (ϕ_{l} (I_{c s})) - G (ϕ_{l} (I_{c})) ‖_{F}^{2}$

(17)

Among them, $G (\cdot)$ Denotes the Gram matrix computation, $G (F) = \frac{1}{C H W} F F^{T}$ .
Wavelet Texture Loss ( $L_{w a v e l e t}$ ): To eliminate checkerboard artifacts and enhance high-frequency details, we introduce the Haar wavelet transform (DWT). The image is decomposed into the low-frequency LL and high-frequency sub-bands {LH, HL, HH}, with constraints applied only to the high-frequency components:

$L_{w a v e l e t} = \sum_{k \in L H, H L, H H} ‖ G (D W T (I_{c s})_{k}) - G (D W T (I_{s})_{k}) ‖_{F}^{2}$

(18)

4. Experiments

4.1. Implementation Details

To evaluate the proposed Dh-Mamba framework, we constructed a custom dataset specifically for Dunhuang mural style transfer. The data sources primarily include publicly available digital image archives and exclusive high-resolution mural data provided by the Dunhuang Academy. Considering the importance of landscape and natural scenery elements in Dunhuang art research and digital restoration, we implemented a targeted semantic filtering strategy during data collection. Specifically, we selected 3266 real-world photographs containing natural textures (e.g., mountains and rocks) as content images. Due to the limited availability of high-resolution Dunhuang mural data, we carefully cropped a smaller set of original mural images into local patches, resulting in 1417 style reference images. This yields a total dataset of 4683 individual images rather than fixed pairs. At the implementation level, the content and style images are stored separately in the trainA and trainB directories. All input images are uniformly resized and cropped to 256 × 256. During the training phase, content images are sequentially sampled from the dataset, while style images are randomly sampled to maximize style diversity. For the dataset split, we randomly allocated 90% of the images for training and 10% for testing and evaluation. During the evaluation phase, natural ordering is employed for one-to-one pairing to ensure the reproducibility of experimental results.

In the encoding stage, input images are uniformly resized to 256 × 256, with a Patch Size of 4, an embedding dimension C = 96 for the hierarchical Mamba encoder, and an SSM state dimension of

d_{s t a t e} = 16

. The number of encoder layers is configured to effectively capture multi-scale features, followed by reconstruction through a symmetric multi-scale decoder. For the style-aware dynamic

Δ t

module, the style encoder

E_{s}

is implemented as a hierarchical StyleMamba encoder trained from scratch together with the full model, rather than using a pre-trained VGG backbone. In the default configuration,

E_{s}

uses a patch size of 4 and stage depths [2, 2, 2], producing stage dimensions [96, 192, 384]. The deepest style feature is converted into a 64-dimensional global style vector by a lightweight global style encoder consisting of global average pooling and a 384 → 96 → 64 projection head. To optimize the Dh-Mamba framework, we employed the AdamW optimizer with weight decay set to

1 \times 10^{- 2}

, and training was conducted for a total of 300 epochs. The learning rate scheduling strategy comprises a 5-epoch linear warmup, followed by a cosine annealing schedule, with the learning rate decaying from the initial value of

1 \times 10^{- 4}

to

1 \times 10^{- 6}

. The batch size was set to 8, and training was completed on a single NVIDIA RTX 5090 GPU.

4.2. Evaluation Details

We compared Dh-Mamba qualitatively and quantitatively against six representative state-of-the-art models, covering different architectural paradigms: CycleGAN [28] and F-LSeSim [29] (GAN-based), AdaAttn [30] and AesPA-Net [31] (CNN/Attention-based), StyTr2 [13] (Transformer-based), and Mamba-ST [21] (SSM-based). We specifically introduced Mamba-ST (i.e., directly applying the basic Vision Mamba as a baseline for style transfer) to clearly demonstrate the superiority of our proposed CrossMamba and HSV modulation design over the naive SSM implementation.

To quantitatively evaluate the model performance, we adopted four primary metrics: ArtFID, FID, LPIPS, and CFSD. We did not use the content loss

L_{c o n t e n t}

and style loss

L_{s t y l e}

as evaluation metrics; employing the training objectives as evaluation criteria would introduce cyclic evaluation bias and fail to objectively reflect model performance. Specifically, FID measures the distribution similarity between generated images and the real Dunhuang murals dataset, serving as the primary metric for evaluating style authenticity. LPIPS assesses the perceptual distance between the stylized image and the source content image. ArtFID is a comprehensive metric that is highly correlated with human aesthetic judgment, integrating style fidelity and content preservation; its calculation formula is

FID \times (1 + LPIPS)

. The introduction of CFSD primarily aims to address the limitations of LPIPS regarding texture deviations. In fact, LPIPS depends on the feature space extracted by pretrained networks, which are highly sensitive to local texture variations. Since the essence of style transfer is to modify texture, LPIPS may erroneously penalize reasonable stylization as content loss. CFSD calculates the KL divergence between spatial autocorrelation matrices of feature maps (extracted from the VGG-19 ReLU3_1 layer), concentrating on evaluating the geometric structural layout of images rather than texture appearance, thus effectively avoiding this issue.

4.3. Quantitative Results

As shown in the comparative experiments in Table 1, our method exhibits exceptional performance across multiple key evaluation metrics, underscoring its comprehensive advantages in the Dunhuang mural style transfer task. Style Consistency and Color Fidelity: Firstly, in terms of the ArtFID and FID metrics used to assess style consistency, our method achieved the best results (163.15 and 106.05, respectively), significantly surpassing the Transformer-based StyTr2 and the baseline Mamba-ST models. To further verify the physical basis of this advantage, we visualized the pixel distributions of the generated images in the hue (H), saturation (S), and value (V) channels in Figure 4. Consistent with the outstanding quantitative metrics, the color distribution curve produced by our method (red solid line) demonstrates the highest degree of correspondence with the target Dunhuang murals (black dashed line). Especially in the critical hue channel, Dh-Mamba precisely captures and aligns the mural-specific mineral color gamut peaks (such as the characteristic bands of cinnabar and malachite), whereas the comparative models exhibit noticeable distribution shifts. This result compellingly demonstrates the effectiveness of the orthogonal HSV modulator in our architecture: by enforcing the disentanglement of physical attributes in feature space, the model avoids the color contamination commonly observed in traditional methods, thereby ensuring a high-quality restoration of the unique oxidation tones of Dunhuang murals.

Content awareness and structural preservation: Regarding content-aware metrics such as LPIPS and CFSD, our method achieves highly competitive low scores (0.53 and 0.51, respectively), indicating that the generated images are perceptually closer to the target content while preserving intact geometric structures. This advantage originates from the precise operation of the CrossMamba module during the feature extraction and style injection processes. Distinct from methods that enforce geometric distortion through global statistical information, CrossMamba employs an In-Context Learning mechanism to optimize the alignment of high-level perceptual features, thereby strictly preserving the semantic layout and geometric edges of the content image when applying dense mural textures. This produces generated images with more natural and structurally consistent stylized features.

Computational Efficiency Analysis: A core advantage of Dh-Mamba is its overall linear time complexity

O (N)

with respect to the input resolution (where

N = H \times W

), overcoming the

O (N^{2})

quadratic computational bottleneck typical of Transformer-based methods. To make this efficiency claim concrete, we compared the inference latency of Dh-Mamba against state-of-the-art CNN/Attention methods (AdaAttN, AesPA-Net), a Transformer method (StyTr2), and SSM-based methods (SaMam, Mamba-ST) across multiple image resolutions ranging from

256 \times 256 to 2048 \times 2048

.

As summarized in Table 2, Mamba-based architectures demonstrate a massive speed advantage over traditional methods. While the Transformer-based StyTr2 suffers from explosive computational growth and encounters an Out-of-Memory (OOM) error at

2048 \times 2048

, Dh-Mamba strictly maintains linear scaling—its inference time predictably scales by approximately a factor of four each time the resolution doubles (i.e., when pixel count

N

quadruples). Remarkably, Dh-Mamba achieves the fastest inference time across all models (e.g., 37.21 ms at

256 \times 256

), even slightly outperforming the baseline Mamba-ST. Considering the substantial improvements in image quality, style fidelity, and color distribution alignment (Figure 4), this performance demonstrates that Dh-Mamba effectively exploits the linear complexity advantage, providing a highly efficient and scalable approach for high-resolution digital stylization of Dunhuang art.

4.4. Qualitative Analysis

Qualitative comparison results are presented in Figure 5. It can be concluded that the proposed Dh-Mamba framework achieves visual effects comparable to or surpassing current state-of-the-art models across multiple challenging Dunhuang mural style transfer tasks. Upon examining the experimental results, CycleGAN and F-LSeSim frequently fail to preserve the semantic integrity of the content image. As shown in the last two columns of Figure 5, F-LSeSim demonstrates significant structural distortions when processing water surfaces and architectural contours, and produces unrealistic purple and dark blue artifacts in the first and second rows. Due to the absence of explicit modeling of style features, such models often experience distortion or loss of content semantics when handling the complex spatial layouts of Dunhuang murals. Compared to Transformer and CNN-based methods, StyTR-2, AdaAttN, and AcsPA-Net occasionally face challenges in maintaining color consistency during style application. For instance, in the narrow street scene in the third row, AdaAttN’s generated results show evident tonal degradation and perceptual inconsistencies, failing to restore the purity of cinnabar and malachite present in the mural.

Ultimately, our method represents an optimal balance among current style transfer models. We are capable not only of accurately transferring complex artistic styles but also of maintaining the correct color consistency of the content image, without excessively altering the saturation or contrast of the generated samples. Despite achieving excellent results, the model may still exhibit limitations under certain extreme conditions. As shown in Figure 5, non-uniform blotches or seams occasionally appear within the generated images. One possible reason is that Dh-Mamba inherits certain characteristics of recurrent neural networks (RNNs): due to the state memory mechanism, ensuring contextual continuity between different blotches remains challenging when processing extremely long sequences or complex narrative scenes. Examples of these occasional artifacts are provided in the Supplementary Material for a more comprehensive analysis.

4.5. Ablation Study

4.5.1. Additive Ablation on Core Modules

To verify the effectiveness of each core component in the Dh-Mamba framework, we conducted a series of ablation experiments. The results are shown in Table 3. We adopt the architecture with all innovative modules removed as the baseline model and progressively incorporate Style-Aware

Δ

modulation, HSV modulator, and the CrossMamba mechanism for quantitative evaluation.

Comparing the baseline model to Model A, it is evident that solely introducing Style-Aware

Δ

modulation reduces the ArtFID from 195.38 to 191.07 and the FID from 125.01 to 123.68. This demonstrates that, based on the discretization theory of the state space model (SSM), dynamically adjusting the time step size

Δ

to perceive style complexity can effectively enhance the model’s ability to capture artistic texture granularity. Meanwhile, the improvement in LPIPS (from 0.57 to 0.54) indicates that

Δ

adaptive prediction aids in better maintaining content-aware feature layout during the style injection process.

Model B investigates the independent effect of the HSV modulator. Although the FID metric shows minor fluctuations under a single variable,

Δ

when combined with modulation to form Model (dt + HSV), performance significantly improves, with FID optimized to 119.89. This validates the necessity of enforcing the decoupling of hue, saturation, and luminance spaces through orthogonal constraints. This module addresses the semantic loss commonly caused by color-structure coupling in traditional methods by introducing physical color priors, playing a key role particularly in suppressing artifacts caused by pigment oxidation.

After integrating the CrossMamba context injection mechanism into the full model, all metrics showed significant improvement: ArtFID sharply decreased to 163.15, and FID was reduced by approximately 11.5%, reaching 106.05. This result strongly demonstrates that CrossMamba, by employing a visual prompt-prefixed concatenation strategy, can transmit style features as prior memory losslessly throughout the entire state space. This completely addresses the “style forgetting” problem that standard SSMs typically encounter when processing long sequences, enabling the generated images to maintain exceptionally high style fidelity across the global receptive field.

4.5.2. Fine-Grained Ablation on CrossMamba and Key Sub-Designs

Beyond the additive ablation in Table 3, we further conduct fine-grained ablations on the directional design of CrossMamba, adaptive Δt modulation, orthogonal loss, and modal embedding. The results are summarized in Table 4.

We first examine the directional design of CrossMamba. In our implementation, the forward and backward SSM passes are computed as two separate branches and fused only after the content-feature outputs are obtained. The forward prefix path remains the main route for style conditioning, whereas the backward branch provides complementary reverse-order context for content modeling. In addition, only the content portion of the fused sequence is retained for decoding, rather than writing the updated content state back into the style prefix. As shown in Table 4, Bi-CrossMamba performs better than Uni-CrossMamba, suggesting that bidirectional scanning improves long-range context aggregation and structural consistency. The variant without style-prefix tokens in the backward branch also performs worse than the full model, which suggests that the backward branch mainly provides useful complementary context rather than harming the forward style prior.

We further ablate three related sub-designs. Replacing adaptive Δt modulation with fixed Δt degrades performance, indicating that style-aware dynamic discretization helps the recurrent dynamics better match style complexity. Removing the orthogonal loss also reduces performance, suggesting that the HSV module benefits not only from additional branch capacity, but also from reduced overlap among the H/S/V branches. Finally, removing modal embedding leads to a further performance drop, indicating that explicitly distinguishing content and style tokens is beneficial when the two streams are serialized into a shared SSM sequence.

Overall, the results in Table 4 suggest that the full design is consistently favorable, and that bidirectional scanning, adaptive Δt modulation, orthogonal regularization, and modal embedding each make a positive contribution to the final model.

4.5.3. Validation of Branch-to-Pixel HSV Correspondence

To further examine whether the proposed loss improves branch specialization, we perform a branch-to-pixel HSV sensitivity analysis. Specifically, after training, we activate one branch at a time (H, S, or V) and measure the change in the decoded pixel-space H, S, and V channels relative to a no-branch baseline. As shown in Figure 6, both the raw and row-normalized sensitivity matrices are clearly diagonal-dominant. The normalized dominant responses are 0.942 for Branch H on Pixel H, 0.980 for Branch S on Pixel S, and 0.969 for Branch V on Pixel V, while the off-diagonal responses remain small. These results indicate that the additional physical-mapping loss improves branch-to-pixel HSV correspondence and reduces cross-channel coupling after decoding.

5. Conclusions

This paper proposes Dh-Mamba, a hierarchical visual state-space model framework specifically designed for the digital restoration and high-fidelity style transfer of Dunhuang murals. Existing Transformer-based methods have high computational demands, while standard SSMs suffer from style forgetting and color coupling during long-sequence modeling. To address these challenges, we present an efficient and physically interpretable solution. By introducing the CrossMamba context injection mechanism, we reformulate style transfer as a sequence continuation task, utilizing the linear recursive property of SSM to achieve lossless transmission of global style information in long-range dependencies, thereby fully overcoming the computational bottleneck of traditional methods. Furthermore, our proposed style-aware

Δ t

dynamic modulation strategy and orthogonal HSV modulator, respectively, address granularity control in temporal sampling and physical attribute decoupling in feature space, successfully enabling adaptive capture of complex mural textures and effective suppression of morphological color fading oxidation damage. Crucially, this digital restoration capability directly supports real-world conservation efforts. By generating high-fidelity and physically decoupled visual previews, Dh-Mamba provides art conservators with a non-destructive reference tool to safely evaluate color recovery and structural completion strategies prior to actual physical interventions. Moreover, beyond objective algorithmic metrics, our framework fundamentally aims to revive the human visual appreciation of Dunhuang art, translating severely degraded historical aesthetics into visually accessible and aesthetically resonant forms for modern audiences.

Future work will focus on resolving local texture discontinuities arising from state memory limitations and exploring the extension of Dh-Mamba to higher-resolution (e.g., 4 K/8 K) mural restoration and video style transfer tasks, thereby further unlocking the potential of the Mamba architecture in the domain of computational art.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics15081578/s1, Figure S1: Representative examples of non-uniform blotch artifacts under different content-style pairs; Figure S2: Representative examples of seam artifacts near complex semantic boundaries.

Author Contributions

Conceptualization, P.Q. and L.L.; methodology, P.Q. and L.L.; software, P.Q.; validation, H.W., S.M. and C.C.; formal analysis, P.Q., Z.H. and H.W.; investigation, P.Q., S.M. and C.C.; resources, L.L. and M.C.; data curation, P.Q., S.M. and Z.H.; writing—original draft preparation, P.Q.; writing—review and editing, L.L., M.C. and H.W.; visualization, P.Q.; supervision, L.L. and M.C.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Institute of Graphic Communication (Grant Nos. Eb202306, HXDK2024052, and HXDK2024133) and the Innovation Project for Digital Preservation of Beijing’s Historical and Cultural Heritage and Art Communication (Grant No. 20190123116).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study cannot be made publicly available due to privacy, copyright, and/or cooperation agreement restrictions. The data are available from the corresponding author upon reasonable request. The source code is also available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IST	Image Style Transfer
CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
SSM	Structured State Space Model
HSV	Hue–Saturation–Value
DWT	Discrete Wavelet Transform
FID	Fréchet Inception Distance
ArtFID	Art Fréchet Inception Distance
LPIPS	Learned Perceptual Image Patch Similarity
VGG	Visual Geometry Group network
LLM	Large Language Model
U-Net	U-shaped Network
GPU	Graphics Processing Unit

References

Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Jing, Y.; Yang, Y.; Feng, Z.; Ye, J.; Yu, Y.; Song, M. Neural style transfer: A review. IEEE Trans. Vis. Comput. Graph. 2020, 26, 3365–3385. [Google Scholar] [CrossRef] [PubMed]
Chen, D.; Yuan, L.; Liao, J.; Yu, N.; Hua, G. Stereoscopic neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6654–6663. [Google Scholar]
Cao, Y.; Zhang, Y.; Lin, Y.; Wu, K. Dunhuang art style transfer via hierarchical vision transformer and color consistency constraints. IEEE Trans. Consum. Electron. 2025, 71, 3240–3251. [Google Scholar] [CrossRef]
Ding, Y.; Wang, H.; Liu, N.; Li, T. TCP-RBA: Semi-supervised learning for traditional Chinese painting classification with random brushwork augment. J. Intell. Fuzzy Syst. 2024, 46, 10653–10663. [Google Scholar] [CrossRef]
Liu, N.; Wang, H.; Lu, L.; Ding, Y.; Tian, M. Research on the Influence of Multi-Scene Feature Classification on Ink and Wash Style Transfer Effect of ChipGAN. IEEE Access 2024, 12, 129733–129752. [Google Scholar] [CrossRef]
Chen, C.; Wang, H. Traditional cloud pattern classification algorithm based on semi-supervision with random line augment. Sci. Rep. 2025, 15, 29225. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Park, D.Y.; Lee, K.H. Arbitrary style transfer with style-attentional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5880–5888. [Google Scholar]
Zhang, Y.; Li, M.; Li, R.; Jia, K.; Zhang, L. Exact feature distribution matching for arbitrary style transfer and domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8035–8045. [Google Scholar]
Risser, E.; Wilmot, P.; Barnes, C. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv 2017, arXiv:1701.08893. [Google Scholar] [CrossRef]
Zhang, H.; Dana, K. Multi-style generative network for real-time transfer. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. StyTr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11326–11336. [Google Scholar]
Wu, X.; Hu, Z.; Sheng, L.; Xu, D. StyleFormer: Real-time arbitrary style transfer via parametric style composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14618–14627. [Google Scholar]
Huang, S.; An, J.; Wei, D.; Luo, J.; Pfister, H. QuantArt: Quantizing image style transfer towards high visual fidelity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5947–5956. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yoo, J.; Uh, Y.; Chun, S.; Kang, B.; Ha, J.W. Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9036–9045. [Google Scholar]
Chen, Z.; Wang, W.; Xie, E.; Lu, T.; Luo, P. Towards ultra-resolution neural style transfer via thumbnail instance normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 393–401. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Liu, H.; Wang, L.; Zhang, Y.; Yu, Z.; Guo, Y. SaMam: Style-aware state space model for arbitrary image style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 28468–28478. [Google Scholar]
Botti, F.; Ergasti, A.; Rossi, L.; Fontanini, T.; Ferrari, C.; Bertozzi, M.; Prati, A. Mamba-ST: State space model for efficient style transfer. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 7797–7806. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Sun, J.; Liu, Y. VMamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Jeon, M. Adaptive normalization mamba with multi scale trend decomposition and patch moe encoding. arXiv 2025, arXiv:2512.06929. [Google Scholar] [CrossRef]
Chung, J.; Hyun, S.; Heo, J.P. Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8795–8805. [Google Scholar]
Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.; Xu, C. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10146–10156. [Google Scholar]
Li, X.; Cao, H.; Zhang, Z.; Hu, J.; Jin, Y.; Zhao, Z. Artistic neural style transfer algorithms with activation smoothing. In Proceedings of the 2025 2nd International Conference on Informatics Education and Computer Technology Applications (IECCT), Kuala Lumpur, Malaysia, 17–19 January 2025; pp. 1–6. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Zheng, C.; Cham, T.J.; Cai, J. The spatially-correlative loss for various image translation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16407–16417. [Google Scholar]
Liu, S.; Lin, T.; He, D.; Li, F.; Wang, M.; Li, X.; Sun, Z.; Li, Q.; Ding, E. AdaAttN: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6649–6658. [Google Scholar]
Hong, K.; Jean, S.; Lee, J.; Ahn, N.; Kim, K.; Lee, P.; Kim, D.; Uh, Y.; Byun, H. AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 22758–22767. [Google Scholar]
Kwon, G.; Ye, J.C. ClipStyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18062–18071. [Google Scholar]
Wang, Z.; Liu, Z.-S. StyleMamba: State Space Model for Efficient Text-driven Image Style Transfer. arXiv 2024, arXiv:2405.05027. [Google Scholar]
Hong, Z.; Dong, N.; Di, Y.; Xu, X.; Hu, R.; Shao, Y.; Ling, R.; Wang, Y.; Wang, J.; Zhang, Z.; et al. StyMam: A Mamba-Based Generator for Artistic Style Transfer. arXiv 2026, arXiv:2601.12954. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]

Figure 1. Representative Dunhuang murals. Panels (a–d) predominantly use cinnabar and viridian tones, emphasizing decorative color expression and intricate geometric textures, contrasting with Western color realism.

Figure 2. Overall architecture of Dh-Mamba. The framework follows a hierarchical U-Net design with separate content and style encoders. Multi-scale content features are delivered to the decoder through skip connections. The style image provides three forms of guidance: a deep style feature for CrossMamba at the deepest stage, a global style code z for style-aware modulation, and HSV statistics q extracted from the style image for the HSV modulator. The HSV modulator is deployed only in decoder-side DunhuangS7Blocks, including the bottleneck decoder blocks, the post-fusion decoder blocks after Upsampling, and the final texture block.

Figure 3. (a) CrossMamba module. The input content and style features are first embedded with modality information and then concatenated along the sequence dimension, where the style tokens are prepended before the content tokens. The forward SSM branch performs prefix-based style conditioning, enabling the style prefix to serve as prior memory for subsequent content tokens. An additional reverse-order branch is introduced to alleviate the directional bias of linear recurrent scanning and to provide complementary context aggregation. The two branches are fused only at the output stage, and only the outputs corresponding to the content tokens are retained for decoding. (b) Style-aware

Δ t

Dynamic modulation. Improved SSM core mechanism. Unlike the standard Mamba, the step size

Δ t

and parameters

D

are jointly determined by the input and the global style vector

v_{s}

through adaptive discretization to regulate the system’s “Memory horizon” to match the texture granularity of the target style. (c) Orthogonal HSV modulator. Physically perceptual color decoupling module. It employs three parallel style-gated branches to predict the H/S/V channel masks, respectively, ensuring independent modulation of physical attributes via orthogonal constraints, and achieves precise color restoration through affine transformation.

Figure 3. (a) CrossMamba module. The input content and style features are first embedded with modality information and then concatenated along the sequence dimension, where the style tokens are prepended before the content tokens. The forward SSM branch performs prefix-based style conditioning, enabling the style prefix to serve as prior memory for subsequent content tokens. An additional reverse-order branch is introduced to alleviate the directional bias of linear recurrent scanning and to provide complementary context aggregation. The two branches are fused only at the output stage, and only the outputs corresponding to the content tokens are retained for decoding. (b) Style-aware

Δ t

Dynamic modulation. Improved SSM core mechanism. Unlike the standard Mamba, the step size

Δ t

and parameters

D

are jointly determined by the input and the global style vector

v_{s}

through adaptive discretization to regulate the system’s “Memory horizon” to match the texture granularity of the target style. (c) Orthogonal HSV modulator. Physically perceptual color decoupling module. It employs three parallel style-gated branches to predict the H/S/V channel masks, respectively, ensuring independent modulation of physical attributes via orthogonal constraints, and achieves precise color restoration through affine transformation.

Figure 4. Comparison of the chromatic attribute distributions of the stylization results. We calculated pixel histogram distributions for the Hue, Saturation, and Value channels, respectively. The black dashed line represents the distribution of the target style image (Ground Truth). It can be observed that the distribution curve of the method proposed in this paper (red solid line) most closely aligns with the target style, particularly in the Hue channel, accurately reproducing the unique mineral color gamut peaks of the Dunhuang murals. In contrast, comparison models (such as StyTr2 and Mamba-ST) exhibit notable distribution shifts.

Figure 5. The figure presents a visual comparison of Dh-Mamba with six representative state-of-the-art (SOTA) models—including GAN, Transformer, and the foundational Mamba architecture—on the Dunhuang murals style transfer task.

Figure 6. Branch-to-pixel HSV sensitivity analysis after introducing the physical-mapping loss. (a) row-normalized sensitivity matrix. (b) raw sensitivity matrix. Each row corresponds to one isolated feature branch, and each column denotes the response in the decoded pixel-space H, S, or V channel. The clear diagonal dominance indicates that the three branches mainly affect the corresponding pixel-space HSV attributes.

Table 1. Performance comparison between the proposed method and existing state-of-the-art models on the Dunhuang mural style transfer task. Quantitative metrics include ArtFID and FID to measure stylization quality, as well as LPIPS and CFSD to evaluate content structure preservation. Model complexity (GPU memory usage) is also reported. The results demonstrate that Dh-Mamba achieves an optimal balance between style fidelity and structural consistency while maintaining low computational resource consumption. Best results are highlighted in bold, and “↓” indicates that lower values denote better performance.

Model		ArtFID ↓	FID ↓	LPIPS ↓	CFSD ↓	Memory Usage (MiB) ↓
F-LSeSim	GAN	209.01	117.83	0.76	0.57	10.52
CycleGAN	GAN	285.56	172.01	0.66	0.50	11.43
AdaAttN	Transformer	234.12	144.52	0.62	0.53	26.57
AesPA-NET	Transformer	295.74	186.32	0.59	0.52	24.20
StyTr2	Transformer	245.44	155.98	0.58	0.49	35.39
SaMam	Mamba	242.68	156.21	0.56	0.48	18.50
Mamba-ST	Mamba	214.06	142.11	0.51	0.55	14.14
Ours	Mamba	163.15	106.05	0.53	0.51	20.14

Table 2. Quantitative comparison of inference time (ms) across escalating input resolutions. Mamba-based architectures (including ours) maintain strict linear scaling, whereas the Transformer-based StyTr2 suffers from quadratic computational growth, leading to an Out-of-Memory (OOM) error at

2048 \times 2048

. All tests were conducted on a single NVIDIA RTX 5090 GPU. Best results are highlighted in bold.

Table 2. Quantitative comparison of inference time (ms) across escalating input resolutions. Mamba-based architectures (including ours) maintain strict linear scaling, whereas the Transformer-based StyTr2 suffers from quadratic computational growth, leading to an Out-of-Memory (OOM) error at

2048 \times 2048

. All tests were conducted on a single NVIDIA RTX 5090 GPU. Best results are highlighted in bold.

Model	256 × 256 (ms)	512 × 512 (ms)	1024 × 1024 (ms)	2048 × 2048 (ms)
AdaAttN	3193.03	3676.55	4083.25	5360.98
AesPA-Net	3168.75	3576.83	4083.25	5360.98
StyTr2	3243.95	3495.89	5716.82	—
SaMam	42.50	155.15	668.10	3126.91
Mamba-ST	39.74	166.45	659.73	3029.42
Ours	37.21	144.89	599.23	3042.88

Table 3. Ablation analysis of the core innovative components. We progressively introduced style awareness into the baseline model.

Δ t

modulation, HSV modulator, and the CrossMamba mechanism. The results demonstrate that each module contributes positively to the final performance, with CrossMamba producing the most significant improvements in ArtFID and FID metrics. Best results are highlighted in bold, and “↓” indicates that lower values denote better performance.

Table 3. Ablation analysis of the core innovative components. We progressively introduced style awareness into the baseline model.

Δ t

modulation, HSV modulator, and the CrossMamba mechanism. The results demonstrate that each module contributes positively to the final performance, with CrossMamba producing the most significant improvements in ArtFID and FID metrics. Best results are highlighted in bold, and “↓” indicates that lower values denote better performance.

Model	Style-Aware Δ	HSV Modulator	CrossMamba	ArtFID ↓	FID ↓	LPIPS ↓	CFSD ↓	MiB ↓
Baseline	False	False	False	195.38	125.01	0.57	0.52	18.17
Model A (dt)	True	False	False	191.07	123.68	0.54	0.52	18.47
Model B (HSV)	False	True	False	204.10	131.66	0.55	0.51	18.37
Model (dt + HSV)	True	True	False	186.15	119.89	0.53	0.51	19.57
Full Model	True	True	True	163.15	106.05	0.52	0.50	20.14

Table 4. Fine-grained ablation on CrossMamba and related sub-designs. We compare the full Bi-CrossMamba model with No-CrossMamba, Uni-CrossMamba, a variant without style-prefix tokens in the backward branch, and variants without dynamic Δt, orthogonal loss, or modal embedding. Best results are highlighted in bold, and “↓” indicates that lower values denote better performance.

	Configuration	ArtFID ↓	FID ↓	LPIPS ↓	CFSD ↓
A	Bi-CrossMamba	163.15	106.05	0.5269	0.5089
B	No-CrossMamba	172.76	111.42	0.5493	0.5154
C	Uni-CrossMamba	167.96	108.75	0.5437	0.5127
D	w/o Style Prefix in Backward Pass	171.61	112.59	0.5357	0.5166
E	w/o Dynamic Δt	164.89	107.97	0.5411	0.5135
F	w/o Orthogonal Loss	184.29	118.42	0.5545	0.5318
G	w/o Modal Embedding	184.78	121.02	0.5283	0.5236

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, P.; Liu, L.; Wang, H.; Ma, S.; Chen, C.; Han, Z.; Cheng, M. Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation. Electronics 2026, 15, 1578. https://doi.org/10.3390/electronics15081578

AMA Style

Qin P, Liu L, Wang H, Ma S, Chen C, Han Z, Cheng M. Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation. Electronics. 2026; 15(8):1578. https://doi.org/10.3390/electronics15081578

Chicago/Turabian Style

Qin, Peijun, Long Liu, Hongjuan Wang, Siyuan Ma, Cui Chen, Zixuan Han, and Mingzhi Cheng. 2026. "Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation" Electronics 15, no. 8: 1578. https://doi.org/10.3390/electronics15081578

APA Style

Qin, P., Liu, L., Wang, H., Ma, S., Chen, C., Han, Z., & Cheng, M. (2026). Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation. Electronics, 15(8), 1578. https://doi.org/10.3390/electronics15081578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dunhuang Mural Style Transfer Using Vision Mamba: In-Context Prompting and Physically Motivated HSV Modulation

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based and GAN-Based Style Transfer

2.2. Transformer-Based Style Transfer

2.3. State Space Models for Style Transfer

3. Methodology

3.1. Prerequisite Knowledge: Discretized State-Space Model

3.2. Overview

3.3. CrossMamba: In-Context Style Injection

3.4. Style-Aware Δ t Dynamics Modulation

3.5. Orthogonal HSV Modulator

3.6. Loss Functions

4. Experiments

4.1. Implementation Details

4.2. Evaluation Details

4.3. Quantitative Results

4.4. Qualitative Analysis

4.5. Ablation Study

4.5.1. Additive Ablation on Core Modules

4.5.2. Fine-Grained Ablation on CrossMamba and Key Sub-Designs

4.5.3. Validation of Branch-to-Pixel HSV Correspondence

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4. Style-Aware $Δ t$ Dynamics Modulation