Robust Image Watermarking via Clustered Visual State-Space Modeling

Liu, Bo; Ren, Jianhua

doi:10.3390/app16094166

Open AccessArticle

Robust Image Watermarking via Clustered Visual State-Space Modeling

by

Bo Liu

and

Jianhua Ren

^*

School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125105, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4166; https://doi.org/10.3390/app16094166

Submission received: 23 March 2026 / Revised: 18 April 2026 / Accepted: 20 April 2026 / Published: 24 April 2026

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Most existing DNN-based image watermarking methods adopt an “encoder–noise–decoder” paradigm, where the watermark is typically replicated and expanded in a straightforward manner and then directly fused with image features, which limits robustness under complex distortions. Although Transformers improve fusion via attention mechanisms, their quadratic computational complexity makes high-resolution processing prohibitively expensive. To address these issues, we propose CCViM, a robust watermarking framework built on Vision Mamba, which leverages the linear-complexity property of state-space models (SSMs) to enable efficient global interactions. We design a Watermark Representation Learning Module (WRLM) that performs hierarchical feature extraction and structured expansion of the watermark through cascaded VSS blocks, yielding semantically rich and perturbation-resistant watermark representations. In addition, we introduce an Interwoven Fusion Enhancement Module (IFEM), which employs a CCS6 structure to treat the watermark as a dynamic guidance signal. By combining contextual clustering with the Mamba mechanism, IFEM deeply interweaves the watermark into host features at both local and global levels. Experiments on COCO, DIV2K, and ImageNet demonstrate that CCViM consistently improves imperceptibility, robustness, and efficiency to varying degrees, and remains stable and high quality under attacks such as JPEG compression, cropping, and Gaussian blur.

Keywords:

image watermarking; Vision Mamba; state-space models; contextual clustering

1. Introduction

With the explosive growth of generative artificial intelligence (AIGC), digital visual content has become an indispensable component of today’s Internet ecosystem [1,2]. As a cornerstone of digital asset protection, robust image watermarking embeds an imperceptible digital signature into a host image, playing a critical role in copyright authentication, content provenance, and deepfake deterrence [3,4]. Consequently, deploying robust watermarking in real-world industrial pipelines—such as automated deepfake provenance tracking and copyright protection across complex social media networks—has become a pressing necessity [5,6]. These practical applications demand robust models capable of surviving highly destructive, platform-specific channel noise and severe social network processing operations. Furthermore, emerging industrial requirements, such as tracing digital assets across complex multi-source composite environments [7] and ensuring robust copyright preservation in commercial media distribution [8], impose even stricter demands on the real-world survivability of watermarking frameworks.

In recent years, with the continued advances in deep neural networks, researchers have increasingly adopted deep learning-based models to investigate imperceptible watermark embedding. This trend is motivated by the strong performance of encoder–decoder (END) architectures in pixel-level information hiding, and by the fact that digital images inherently exhibit rich textures and semantic structures [9,10]. Although convolutional neural networks (CNNs) have achieved notable progress in image watermarking, they often rely on predefined and relatively simple convolutional layers to fuse the watermark with the host image—primarily to model high- and low-frequency components. Such designs, however, tend to struggle against complex geometric distortions and various signal-processing operations. Because watermark embedding is inherently semantic-dependent and involves intricate feature interactions, existing approaches still face the following issues and challenges:

(1) During watermark preprocessing and representation learning, the resulting features often exhibit high redundancy and limited semantic content. Conventional “simple replication” schemes merely expand watermark features along the feature/channel dimension, without introducing structured depth, and thus overlook intrinsic characteristics of the watermark to be embedded. Such unstructured representations make it difficult to learn a compact yet inherently resilient feature space. In addition, host image features frequently contain high-frequency noise that is not useful for embedding. Consequently, the model cannot adapt its embedding behavior to local texture complexity and becomes particularly fragile under sophisticated signal-processing attacks.

(2) At the feature-fusion stage between the watermark and the host image, existing methods are constrained by insufficient deep interaction and limited computational efficiency. Channel-wise concatenation is essentially a rigid “stitching” operation; without contextual semantic coupling, the watermark cannot be organically integrated into image textures, and is therefore prone to failure after geometric distortions. Although Transformer-based approaches strengthen long-range dependencies via self-attention, their quadratic complexity severely hinders real-time deployment on high-resolution images. Overall, current models do not adequately balance efficiency with interaction depth, resulting in an unfavorable trade-off between robustness and imperceptibility.

Recently, the years 2024 and 2025 have witnessed a surge in Generative AI (AIGC) watermarking techniques [11,12]. These state-of-the-art methods achieve remarkable robustness by embedding fingerprints directly into the initial noise or latent space during the diffusion generation process. However, they are strictly generation-bound and cannot be applied to protect pre-existing, arbitrary digital assets. For the post-generation copyright protection of general images, the END-based blind watermarking framework remains the most versatile and indispensable paradigm. The remaining challenge is how to achieve the global interaction capabilities of Transformers in this framework without incurring their prohibitive quadratic computational costs.

To address the above issues, we propose CCViM, a robust watermarking framework built upon a full state-space model. CCViM introduces a Watermark Representation Learning Module (WRLM) to comprehensively model the hierarchical characteristics of watermark information, thereby dynamically producing a structured representation with rich semantics [13,14]. Building on this design, we develop an Interwoven Fusion Enhancement Module (IFEM) centered on Vision Mamba [15], which jointly leverages context clustering [16] and selective scanning to enable deep cross-feature interactions while reducing computational cost. Finally, within each semantic group, the selective scanning mechanism fuses watermark and image features in an adaptive manner, capturing both fine-grained local textures and global semantic consistency, which substantially improves watermark survivability under harsh channel conditions. This design also reflects a clear methodological distinction from existing architectures. Compared with Transformer-based watermarking models, CCViM avoids the quadratic computational cost while still enabling global feature modeling. In addition, unlike conventional visual state-space models that follow fixed and content-independent scanning paths, the proposed framework introduces context clustering to guide state-space modeling toward more stable texture regions, making it better suited to the robust image watermarking task. The main contributions of this work are summarized as follows:

To overcome the limitations of existing watermarking models in balancing computational efficiency and interaction depth, we develop CCViM, a full state-space-based watermarking framework. By exploiting linear-time complexity, CCViM effectively alleviates the performance bottleneck encountered when processing high-resolution images.
We propose a Watermark Representation Learning Module (WRLM), which replaces naive dimension replication with cascaded visual state-space blocks, transforming watermark signals into structured features with intrinsic resilience and substantially improving robustness against signal-processing attacks.
We design an Interwoven Fusion Enhancement Module (IFEM) together with a context-clustering-based feature grouping strategy, enabling deep watermark–image fusion while effectively balancing local adaptivity and global consistency.
Experiments on three benchmarks (COCO [17], DIV2K [18], and ImageNet [19]) demonstrate that CCViM achieves good performance in both imperceptibility and robustness.

2. Related Work

2.1. Deep Learning-Based Watermarking Architectures

Deep learning-based image watermarking methods can be broadly categorized into autoencoder-driven end-to-end architectures and architectures based on invertible neural networks. End-to-end models (END), such as HiDDen [20] and ReDMark [21], are typically trained jointly with an encoder, a noise layer, and a decoder to achieve robustness against various signal distortions. Although these methods perform well in simulating image-processing attacks via adversarial training and specialized noise layers such as MBRS [22], the weak coupling between the encoder and decoder often makes it difficult to deeply fuse host image features with watermark information. In contrast, invertible neural networks (INNs) mitigate this coupling issue by sharing parameters between forward and inverse transformations. Representative models include De-END [23] and approaches based on invertible noise layers (INL) [24]. Overall, existing studies mainly emphasize training strategies or high-level architectural refinements; their performance ceiling is still constrained by the representational capacity of the underlying backbone for complex visual features.

Furthermore, recent advancements in 2024 have extensively explored AIGC provenance. Methods such as Tree-Ring watermarks [11] and EditGuard [12] embed robust signals seamlessly within the generative latent space. While representing the latest competitive trends, these latent-space methods target fundamentally different application scenarios compared to post-generation arbitrary image watermarking. For the latter, improving the spatial-frequency representation capacity of the END backbone, similar to the 2024 SOTA WFormer [13], remains the most critical research trajectory.

2.2. Visual Backbones

The evolution of visual backbones is pivotal to the feature-extraction capacity of watermarking systems, reflecting a paradigm shift from convolutional networks [25] to Transformers. Convolutional neural networks (CNNs), exemplified by ResNet [26], laid the foundation for early vision tasks due to their strong ability to capture local spatial patterns. In contrast, the Vision Transformer (ViT) [27,28] leverages self-attention to overcome the locality of convolutions, enabling effective modeling of global dependencies in images. However, ViT’s quadratic computational complexity leads to severe resource bottlenecks when processing high-resolution images. To address this issue, state-space models (SSMs) such as S4 and Mamba [29,30] introduce selective scanning (S6), enabling content-aware long-range dependency modeling while preserving linear-time complexity. While SSMs open a promising route toward efficient visual representation, directly extending them to 2D imagery must still confront a key challenge: spatial non-causality.

2.3. Visual State-Space Models

Visual State-Space Models (SSMs) have increasingly become a focal point of research in computer vision, owing to their compelling capability to strike a balance between long-range dependency modeling and computational efficiency. As a pioneering effort to adapt the Mamba architecture to the vision domain, ViM employs a bidirectional scanning strategy to aggregate contextual information across image sequences. Vmamba [31] further introduced the 2D Cross-Scan Module (SS2D) [32,33], which significantly enhances the modeling of complex visual patterns while preserving horizontal and vertical spatial structures. This advancement has inspired a variety of subsequent applications, such as augmenting the capacity for long-range feature capture within U-Net architectures [34]. Nevertheless, the inherent limitations of fixed scanning strategies have prompted subsequent models, such as LocalMamba [35], to incorporate local scanning mechanisms, thereby dynamically capturing local features and complex spatial relationships more effectively. Building upon this foundation to adequately address these limitations, we propose a clustering-based visual state-space approach. Specifically, we conceptualize the “visual states” generated by Mamba as a robust feature field and employ adaptive clustering to identify stable state clusters. By further integrating this with the local scanning mechanism depicted in Figure 1, we achieve near-optimal watermark embedding while maintaining high robustness. Beyond high-level semantic tasks, the latest 2024 literature, such as MambaIR [36], has demonstrated the exceptional efficacy of SSMs in low-level, pixel-dense vision tasks like image restoration. This cutting-edge evidence strongly motivates our adoption of a clustering-based visual state-space to capture stable, perturbation-resistant feature manifolds for dense watermark embedding.

3. Methodology

Let the original watermark be

W_{0} \in {0, 1}^{L}

with length

L

, and the host image be

I_{c} \in ℝ^{H \times W \times 3}

, where

H \times W

denotes the spatial resolution. As illustrated in Figure 2, the proposed CCViM adopts a classic Encoder–Noise–Decoder (END) adversarial generation framework. To better understand the architecture shown in Figure 2, the overall information flow can be divided into three stages. First, the watermark and the host image are processed separately to extract their initial features. Second, the two feature streams are fed into the IFEM, where context clustering and Mamba-based selective scanning are used to fuse the watermark information with robust texture regions of the image. Finally, the fused features are reconstructed into the encoded image, and the decoder correspondingly recovers the embedded watermark. The encoder

E

, parameterized by

θ_{E}

, fuses the watermark

W_{0}

into the host image

I_{c}

to produce the encoded watermarked image

I_{e}

. During training, the noise layer

N

stochastically applies different differentiable approximations of distortions to

I_{e}

, yielding the perturbed image

I_{n}

for robustness-oriented training. The decoder

D

, parameterized by

θ_{D}

, attempts to recover the watermark

W_{e}

from

I_{n}

. Meanwhile, the discriminator

A

, parameterized by

θ_{A}

, is trained to distinguish whether the input is the original host image

I_{c}

or the generated watermarked image

I_{e}

. The functionality of each component is described as follows.

3.1. CCViM Framework

(1) The encoder

E

aims to embed the watermark

W_{0}

into the host image

I_{c}

in an imperceptible yet robust manner. To this end, we design an end-to-end consistent homogeneous architecture built upon state-space models (Homogeneous Architecture), whose core consists of a Watermark Representation Learning Module (WRLM) and our proposed Interwoven Fusion Enhancement Module (IFEM). Specifically, the original watermark

W_{0}

is first fed into WRLM, which is composed of

K_{1}

cascaded VSS blocks and an upsampling layer, to extract the watermark feature map

F_{w} \in ℝ^{H^{'} \times W^{'} \times C}

. Meanwhile, the host image

I_{c}

is passed through a 3 × 3 convolution to obtain the initial image feature map

F_{c} \in ℝ^{H^{'} \times W^{'} \times C}

. Then,

F_{c}

and

F_{w}

are jointly fed into the core IFEM. Through the Context-Clustered Selective Scan (CCS6)-based “guide–backbone” mechanism, the watermark features

F_{w}

dynamically guide the image features

F_{c}

to perform interwoven enhancement at both the local level (context clustering) and the global level (state-space evolution), producing a highly cooperative feature representation

F_{s} \in ℝ^{H^{'} \times W^{'} \times C}

that aligns image content with watermark information. Finally,

F_{s}

is refined by a deep VSS-Block encoder; via a residual connection, it is added to the original image

I_{c}

and followed by a 3 × 3 convolution to generate the final encoded image

I_{e} \in ℝ^{H \times W \times 3}

. The encoder is trained by updating

θ_{E}

to minimize the perceptual discrepancy between

I_{e}

and

I_{c}

, with the loss function

L_{E}

defined as follows:

L_{E} = ‖ I_{e} - I_{c} ‖_{1} = ‖ E (θ_{E}, I_{c}, W_{0}) - I_{c} ‖_{1}

(1)

(2) During training, the noise layer

N

applies a series of differentiable, parameterized image distortions to the encoded image

I_{e}

to emulate real-world attacks, thereby producing the perturbed image

I_{n}

. Feeding these perturbed images to the decoder encourages the overall framework to learn noise-insensitive and highly robust feature representations.

(3) The decoder

D

aims to accurately recover the original watermark from the potentially severely distorted image

I_{n}

. To preserve the symmetry of the encoding–decoding process and the homogeneity of information flow,

D

adopts an architecture that strictly mirrors the WRLM module in the encoder. Specifically, we first apply a 3 × 3 convolution to

I_{n}

for feature extraction. The features are then hierarchically restored and refined via

K_{2}

cascaded VSS blocks and a downsampling layer (with

K_{1} = K_{2}

), yielding the decoded feature map

F_{d} \in ℝ^{h \times w \times C}

. Finally, a 3 × 3 convolution projects

F_{d}

to a single-channel signal, which is reshaped to obtain the recovered watermark bitstream

W_{e} \in {0, 1}^{L}

. The decoder is trained by updating

θ_{D}

to minimize the prediction error between the original watermark

W_{0}

and the recovered watermark

W_{e}

; the decoding loss

L_{D}

is defined as follows:

L_{D} = BCE (W_{0}, W_{e}) = BCE (W_{0}, D (θ_{D}, I_{n}))

(2)

where

BCE (\cdot)

denotes the binary cross-entropy (BCE) loss function.

(4) The discriminator

A

serves as a visual quality assessor. By distinguishing the encoded image

I_{e}

from the original host image

I_{c}

, it drives the encoder

E

to produce high-fidelity images whose statistical characteristics are indistinguishable from real images. The discriminator

A

consists of four convolutional layers followed by an average pooling layer. Acting as the adversary of the encoder

E

,

A

is trained by updating

θ_{A}

to detect the fake nature of

I_{e}

, with the discriminator loss

L_{D i s}

defined as follows:

L_{D i s} = E_{I_{c}} [- \log (A (θ_{A}, I_{c}))] + E_{I_{e}} [- \log (1 - A (θ_{A}, I_{e}))]

(3)

On the other hand, the encoder

E

updates its parameters

θ_{E}

using the adversarial loss

L_{A d v}

, so as to generate more realistic images:

L_{A d ν} = E_{I_{e}} [- \log (A (θ_{A}, E (θ_{E}, I_{C}, W_{0})))]

(4)

(5) Joint Training: CCViM is optimized via collaborative training of the encoder–decoder–discriminator. The overall objective consists of an image distortion loss

L_{E}

, an adversarial loss

L_{A d v}

, and a decoding loss

L_{D}

, formulated as

L_{total} = λ_{E} L_{E} + λ_{A d v} L_{A d v} + λ_{D} L_{D}

(5)

where

λ_{E}, λ_{A d v}, λ_{D}

are the weighting coefficients that balance image quality, imperceptibility, and watermark extraction accuracy during training.

3.2. Watermark Representation Learning Module

The WRLM serves as the cornerstone of the overall watermark generation network. Its primary goal is to lift the original one-dimensional binary watermark

W_{0}

into a two-dimensional feature map

F_{w}

with rich spatial structure and strong semantic representation. To achieve this, WRLM adopts a hierarchically stacked design of visual state-space blocks (VSS blocks). The overall architecture comprises

K_{1}

cascaded stages, each consisting of one VSS block and a pixel-shuffle upsampling layer. This construction satisfies the following size constraint:

L = (h \times w) = (\frac{H}{2^{K_{1}}}) \times (\frac{W}{2^{K_{1}}})

(6)

where

L

denotes the watermark length, while

(h, w)

and

(H, W)

denote the initial and final feature map resolutions, respectively. To ensure symmetry and information preservation in the encoding–decoding pipeline, the number of downsampling steps in the decoder

D

, denoted by

K_{2}

, must strictly match the upsampling steps

K_{1}

.

At each stage, the watermark feature map is deeply processed by the VSS Block. The core of the VSS block is the selective state-space model (S6), which performs structured global modeling of watermark features via an efficient 2D selective scanning mechanism (SS2D). In our setting, the VSS Block focuses on self-learning within a single modality. The dynamic parameters

Δ, B, C

in the state-space formulation are entirely derived from the input watermark features, and the resulting recursion can be interpreted as an efficient, content-aware self-attention mechanism:

\begin{array}{l} h_{t} = \bar{A} (Δ) h_{t - 1} + \bar{B} (Δ, B) x_{t} \\ y_{t} = C h_{t} \end{array}

(7)

3.3. Interwoven Fusion Enhancement Module

The IFEM module is responsible for the core task of deep fusion and collaborative enhancement between host image features and watermark features, and its detailed workflow is illustrated in Figure 3. The key to this module lies in its interleaved processing paradigm: IFEM first refines the watermark features

F_{w}

using a lightweight convolutional stack to produce a dynamic guiding signal

X_{w}

, while treating the host image features

F_{c}

as the trunk sequence. Both are then fed into the CCS6 module, where feature enhancement and watermark image fusion are jointly realized in a coordinated manner at both local and global scales. This process can be formulated as

X_{w} = {DWConv}_{3 \times 3} ({Conv}_{1 \times 1} (LayerNorm (F_{w})))

(8)

F_{s} = C C S 6 (F_{c}, X_{w})

(9)

Internally, CCS6 comprises two parallel processing branches that are deeply modulated by the guiding signal

X_{w}

, responsible for feature interleaving and evolution at the local and global scales, respectively.

(1) Guided Weaving of Local Context: At the local scale, we adopt a context clustering (CC) branch for feature processing. In this branch, a local window of the trunk image features

F_{c}

is treated as an unordered set of feature points

P

. Based on semantic similarity, these points are dynamically assigned to a set of learnable cluster centers

C (X_{w})

derived from the guiding signal

X_{w}

. Feature aggregation within each cluster is performed via a similarity function

sim (\cdot)

, which is likewise modulated by

X_{w}

. The aggregated feature

g_{j}

is computed as:

g_{j} = \frac{1}{| C_{j} |} \sum_{p_{i} \in C_{j}} σ (α \cdot sim (p_{i}, c_{j}, X_{w}) + β) \cdot f (p_{i})

(10)

where

C_{j}

denotes the

j

-th cluster,

p_{i}

is a feature point in the cluster,

α, β

are learnable parameters,

σ

is the Sigmoid function, and

f (\cdot)

denotes a feature transformation function. Intuitively, the contextual clustering mechanism serves as a content-aware guidance strategy for watermark embedding. Instead of uniformly injecting watermark information over the full spatial domain, it adaptively groups pixels with similar semantic textures. This enables the model to reduce embedding perturbations in sensitive flat regions, where visual artifacts are more easily introduced, while integrating the watermark into complex and relatively robust texture regions. Consequently, the proposed mechanism contributes to both improved visual imperceptibility and stronger robustness against local structural attacks.

(2) Dynamic Trajectory Modulation of Global Dependencies: At the global scale, the module adopts a Vmamba-style state-space model (SSM) branch. The trunk image features

F_{c}

are unfolded along multiple directions into a sequence

x_{t}

, which is then processed recurrently by the S6 state-space model. The key novelty is that the state-space parameters—namely the discretization step

Δ

, the selective input matrix

B

, and the output matrix

C

—are generated from the guiding signal

X_{w}

:

(Δ, B, C) = ϕ (LayerNorm (X_{w}))

(11)

Here,

ϕ

denotes a lightweight feed-forward network. The watermark-conditioned parameters are further used to construct the selective state-space matrices

\bar{A} (Δ)

and

\bar{B} (Δ, B)

, thereby, precisely steering the processing of the image sequence

x_{t}

:

\begin{array}{l} h_{t} = \bar{A} (Δ) h_{t - 1} + \bar{B} (Δ, B) x_{t} \\ y_{t} = C h_{t} \end{array}

(12)

4. Experiments

4.1. Datasets

Our model is trained and evaluated on three large-scale public datasets: COCO [17], DIV2K [18], and ImageNet [19]. Training is primarily conducted on COCO, and the trained model is then tested on all three datasets. Specifically, for COCO, we randomly selected 10,000 images for training and an additional 5000 images for testing. For DIV2K, we used 100 images sampled from the validation set as the test set. For ImageNet, we randomly selected 5000 images for testing.

4.2. Experimental Settings and Metrics

The proposed CCViM is implemented using PyTorch version 2.0.1 and Python 3.9. All training and testing experiments are conducted on a single NVIDIA GeForce RTX 4070 GPU. In the implementation, all cover images are resized to 128 × 128 × 3, and the embedded watermark payload is fixed at 64 bits. During end-to-end training, the Adam optimizer is adopted with a batch size of 16. The initial learning rate is set to 1 × 10⁻⁴ and is adjusted using a cosine annealing schedule over 100 training epochs. To ensure deterministic execution and reproducibility, the primary baseline results were generated with a strictly fixed random seed of 42, while a predefined sequence of distinct seeds was utilized for the multiple independent runs in the statistical significance analysis. Additionally, the Adam optimizer was configured with a weight decay of 1 × 10⁻⁴, and the network weights were initialized using the Kaiming initialization strategy. To enhance robustness under complex real-world conditions, we adopt a mixed-attack training strategy: for each mini-batch, we randomly sampled one distortion from a candidate set—including JPEG compression (quality factor Q = 50), cropping (Crop, r = 0.035), cropout (Cropout, p = 0.3), and Gaussian blurring—and applied it to the watermarked images. This strategy encourages the model to learn more generalizable, robust features. To validate the effectiveness of CCViM and ensure fair comparisons with other methods, we report two widely used watermarking metrics: peak signal-to-noise ratio (PSNR) and bit accuracy (BA). Furthermore, to rigorously validate the statistical stability of the proposed model, independent runs with five different random seeds were conducted for key evaluation scenarios. The corresponding results are reported in the format of Mean ± Standard Deviation (SD). To further confirm the significance of the reported improvements, a two-tailed independent samples t-test was performed between the proposed CCViM and the most competitive baseline (WFormer). A p-value less than 0.05 was considered to indicate a statistically significant difference.

4.3. Baselines

To validate the effectiveness of CCViM, we compare it against three categories of baselines: classical methods, architecture-enhanced methods, and network structure-based methods. The evaluated baselines are summarized as follows:

HiDDen [20]: A pioneering END framework that performs adversarial training by simulating attacks with differentiable noise layers.
TSDL [37]: A two-stage decoupled training framework that tackles real, non-differentiable attacks by freezing the encoder and training the decoder separately.
MBRS [22]: A mini-batch training strategy that mixes real and simulated JPEG compression to specifically improve robustness against JPEG attacks.
Fang [38]: Extends TSDL to a three-stage training pipeline and introduces mask-guided frequency enhancement to withstand stronger real-world distortions.
De-END [23]: A novel “decoder-driven” design that reduces redundant feature embedding by tightening encoder–decoder coupling.
WFormer [13]: A Transformer-based soft-fusion model that exploits self-attention and cross-attention to capture long-range correlations between images and watermarks, thereby improving the fusion process.

4.4. Discussion and Analysis of Parameters

The final performance of the proposed CCViM model relies on the trade-off between imperceptibility and robustness, governed by the loss weights

λ_{E}, λ_{A d v}, λ_{D}

. To compel the model to prioritize learning robust features during the early stages of training, we fix the decoding loss

λ_{D}

and the adversarial loss

λ_{A d v}

—which serves as a regularizer—at 10 and 0.0001, respectively. Building upon this, we focus our analysis on the impact of the image distortion loss

λ_{E}

. As shown in Table 1, experimental results demonstrate that as

λ_{E}

is decreased from 5 to 3, the PSNR drops by merely 0.12 dB, whereas the average bit accuracy (BA) achieves a notable improvement of approximately 0.72%. This indicates that moderately relaxing fidelity constraints facilitates the embedding of watermarks into more robust features. However, if

λ_{E}

is further reduced to 1, the PSNR plummets by over 3.5 dB, while the BA experiences a marginal increase of only 0.29%. This illustrates that excessively sacrificing image quality yields diminishing returns in robustness. Taking all factors into comprehensive consideration, we establish the optimal combination for

λ_{E}, λ_{A d v}, λ_{D}

as 3, 0.0001, and 10.

The configuration regarding the number of core VSS blocks within the framework is critical to the overall model performance. Guided by the principle of asymmetric design, we investigated the optimal combination for the number of encoder modules, denoted as

N_{e n c}

, and the number of decoder and WRLM modules, denoted as

N_{s y m}

. The experimental results are shown in Table 2. By fixing

N_{s y m} = 4

, increasing

N_{e n c}

from 4 to 8 yields a significant improvement in both average BA and PSNR. Under this configuration, the model maintains a BA exceeding 97% even under severe JPEG compression (QF = 40), indicating that a deeper encoder is capable of learning complex textures to enhance watermark robustness. However, further increasing

N_{e n c}

to 10 introduces additional computational overhead without any pronounced performance gain. Subsequently, by fixing

N_{e n c} = 8

, empirical results verified that

N_{s y m} = 4

constitutes the optimal configuration. Synthesizing these findings, our proposed model ultimately adopts an asymmetric architecture with

N_{e n c} = 8

and

N_{s y m} = 4

.

Furthermore, we analyzed the sensitivity of the model to the number of training epochs. As shown in Table 3, using varying levels of JPEG compression as a representative benchmark, both imperceptibility and robustness improve steadily with additional epochs. The model reaches its optimal balance at 100 epochs, achieving an average BA of 98.94% and a PSNR of 44.52 dB. Extending the training to 120 epochs causes a slight decline in the average BA, indicating the onset of overfitting. Thus, 100 epochs is set as the optimal training duration.

4.5. Comparison Under Distortion-Specific Training

The detailed experimental results are reported in Table 4. Furthermore, Figure 4 provides visualizations of the watermarked images under different noise attacks, and Figure 5 presents a robustness comparison. Beyond the overall visual quality in Figure 4, the residual comparison in Figure 6 further illustrates the imperceptibility of different methods. To better visualize the embedded watermark traces, the absolute residual maps between the cover and encoded images are computed and magnified. As shown in Figure 6a, HiDDen produces dense grid-like artifacts. Figure 6b shows that MBRS exhibits noticeable block-like residuals, while Figure 6c indicates that De-END leaves scattered noise-like patterns, especially in high-frequency regions. By contrast, Figure 6d demonstrates that CCViM presents a sparser and weaker residual distribution with better spatial adaptivity. This result indicates that the proposed fusion strategy can more effectively reduce unnecessary embedding perturbations in visually sensitive regions, thereby suppressing visible artifacts and preserving image fidelity. This spatial adaptivity is a key indicator of real-world applicability. By confining watermark energy to stable texture clusters, CCViM ensures high ‘re-compression resilience,’ preventing the visual artifacts and watermark leakage often triggered by secondary compression in industrial distribution pipelines.

Overall, the proposed CCViM performs better than all baselines on most metrics across six different attack settings. As confirmed by the statistical significance tests (detailed in Section 4.7), while the improvement in average Balanced Accuracy (BA) under six different noise attacks may seem modest, it is statistically significant (p < 0.05). This statistical support indicates that the improvement reflects a consistent advantage in the architecture rather than random variation.

Notably, under the most challenging setting with 50% cropout, CCViM outperforms the strong baseline WFormer, achieving a consistent BA improvement of 0.55%. Although the absolute gain is modest in magnitude, it corresponds to a meaningful reduction in bit errors under severe content removal, indicating stable robustness in this challenging scenario. In contrast, traditional end-to-end methods as well as stage-wise training approaches, such as TSDL, lag behind modern architectural designs. This is largely because conventional CNN-based backbones tend to overemphasize local pixel-level correlations while overlooking complex global structural cues. Although distortion-specific methods such as MBRS and Fang can improve performance via attack simulation or frequency-domain enhancement, their capacity to model non-linear and dynamic spatiotemporal relations remains limited, leading to pronounced performance fluctuations when facing non-targeted attacks such as geometric transformations.

In contrast, decoupled architectures and Transformer-based models are better at modeling complex spatial dependencies. De-END and WFormer generally outperform TSDL and Fang, and show particularly strong results under JPEG compression and dropout-based attacks. However, such methods often rely heavily on attention mechanisms, making it difficult to jointly preserve local texture invariance and global interactions; consequently, they may fail to capture critical watermark cues under extreme pixel removal or heavy noise corruption. By comparison, CCViM integrates a Vmamba state-space backbone with IFEM, and introduces context clustering to facilitate deep local–global interactions. Coupled with SSM-based selective scanning to suppress random noise and strengthen the locking of structural features, CCViM delivers more stable and consistently better performance under a wide range of high-severity distortions.

4.6. Performance Comparison Under Combined Attacks

The experimental results are summarized in Table 5. Under combined attacks and multiple distortion severity levels, the proposed CCViM consistently outperforms all baseline methods in overall performance. Compared with strong competitors such as WFormer, CCViM achieves the highest average bit accuracy (BA) of 99.18%. Notably, it attains error-free extraction (100% BA) under cropout and dropout settings; under the most challenging Crop (r = 0.035) and JPEG (QF = 50) attacks, it still reaches 98.55% and 97.35% BA, respectively, significantly surpassing MBRS and TSDL. In contrast, traditional HiDDen and TSDL perform poorly under composite distortions, largely because they struggle to jointly capture the distributional characteristics of diverse noise sources. Even De-END, while competitive under single attacks, exhibits a clear degradation under complex mixtures due to insufficient redundancy in feature extraction. By leveraging a context-clustered visual Mamba architecture, CCViM enhances the semantic robustness of watermark embedding locations; moreover, Vmamba’s dynamic modeling capability enables strong generalization and stable decoding across a wide spectrum of severe attacks, ranging from geometric transformations to frequency-domain compression. Furthermore, it is worth noting that these composite attacks, especially the combination of severe JPEG compression and cropping, effectively simulate the complex degradations commonly encountered in real-world social media transmission pipelines, such as those on Twitter and WeChat. The model maintains an average BA of 99.18% under these mixed distortions, indicating good generalization ability and showing its potential for practical real-world copyright protection scenarios. Crucially, the Composite Attack configuration (Table 5) serves as a Standardized Production-Environment Proxy (SPEP). By simultaneously applying multiple distortions, we simulate the non-linear degradation typical of social media pipelines (e.g., WeChat/Twitter). This evidence moves beyond standard benchmarks to demonstrate the model’s stress-test survivability in real-world systems.

4.7. Statistical Significance Analysis

To rigorously assess whether the performance gains of CCViM are statistically meaningful, we conducted a formal significance analysis over five independent runs. As reported in Table 6, we compared the Bit Accuracy (BA) of CCViM with WFormer across representative attack scenarios. The results are presented as Mean ± SD. The calculated p-values for all tested scenarios are below the significance threshold of 0.05. Certain attacks, such as cropout and dropout, were excluded from this specific t-test because the proposed model achieved 100.00% accuracy on these, resulting in zero variance, demonstrating its effectiveness in these cases. Overall, this validation shows that the performance of CCViM is reliably better and distinguishable from minor variations.

4.8. Ablation Study

To evaluate the contribution of each component in CCViM, we conduct ablation studies on the combined-attack benchmark. We designed six variants for comparative analysis:

H_CCViM/C_CCViM/T_CCViM: Replace WRLM with simple tiling, convolutional layers, or a Transformer, respectively.
D_CCViM: Remove IFEM and perform feature fusion via direct concatenation.
NI_CCViM: Remove the interleaved design and execute the fusion process sequentially.
T_IFEM_CCViM: Replace the Vmamba module within IFEM with a Transformer module.
NC_CCViM: Remove the context clustering module from IFEM.

As reported in Table 7, CCViM consistently outperforms all variants across all metrics, validating the effectiveness of each component. Specifically, the poor performance of H_CCViM (BA 85.32%) highlights the crucial role of WRLM in producing attack-resilient redundant features, whereas naive tiling introduces excessive yet ineffective redundancy. The comparison between C_CCViM and T_CCViM indicates that, while Transformers alleviate the locality limitation of convolutions, Vmamba’s causal scanning is more effective at capturing long-range dependencies and maintaining global coherence. D_CCViM further suggests that deep spatiotemporal interactions are essential for improving imperceptibility since simple concatenation cannot achieve adaptive energy hiding. NI_CCViM demonstrates the advantage of the interleaved scheme in progressively refining features. Finally, NC_CCViM confirms that context clustering provides a useful prior over texture distributions, guiding watermark embedding toward stable regions and thereby significantly improving performance. Furthermore, the comparison between T_IFEM_CCViM and the proposed CCViM further reflects the role of the Vmamba backbone. Although T_IFEM_CCViM shows competitive performance under blur attacks, the proposed model attains a higher average BA of 99.18%, whereas T_IFEM_CCViM reaches 98.87%. This observation indicates that Vmamba’s causal scanning mechanism and context clustering identify robust embedding regions more effectively than Transformer-based self-attention.

4.9. Calculation Cost Analysis

To evaluate the practical deployment potential of the CCViM model, we systematically compared its parameter count, FLOPs, and throughput with those of representative methods (Table 8). The results demonstrate that while the parameter count of CCViM (merely 1.46 M) is marginally larger than that of lightweight models such as HiDDen and De-END, it is significantly smaller than that of MBRS and WFormer, which heavily rely on attention mechanisms. This reduction is largely attributed to the inherent parameter efficiency of Vmamba. Furthermore, the unique linear complexity of the Vmamba backbone effectively circumvents the quadratic computational growth associated with Transformer self-attention. Consequently, our proposed model outperforms both MBRS and WFormer in terms of FLOPs and inference speed. Beyond parameter counts and FLOPs, further analysis of memory usage and scalability provides additional insight into the characteristics of the CCViM architecture. In high-resolution image watermarking, scalability is often constrained by GPU memory, namely VRAM. Transformer-based methods such as WFormer rely heavily on self-attention, whose computational and memory complexity scales quadratically as O(N²), where N denotes the number of image tokens and corresponds to the spatial resolution of the input. As the resolution increases, this complexity leads to a rapid growth in memory consumption, which limits the practicality of such methods for 2K or 4K image processing in real-world settings. By comparison, CCViM is built on a Vision Mamba backbone and adopts a selective state-space sequence modeling strategy. This design gives the framework linear computational and memory complexity, expressed as O(N), with respect to the input resolution. As a result, the memory usage of CCViM increases more steadily as the host image resolution grows. This property enables the model to preserve effective local–global feature interaction while remaining more suitable for large-scale and high-fidelity digital asset protection. Overall, although the introduction of the state-space module incurs a slight increase in computational cost compared to simple convolutions, this trade-off yields superior watermark imperceptibility and robustness, fully substantiating the feasibility of deploying the proposed model in real-world application scenarios. The O(N) linear complexity supports scalability in industrial applications. Unlike Transformer-based models, CCViM is designed to be more efficient for deployment in real-world systems, such as deepfake detection pipelines or large-scale digital asset protection.

4.10. Performance Under Untrained Attacks and Model Limitations

To further evaluate the robustness of the proposed method under unseen and non-standard distortions, four types of untrained attacks with varying intensities were introduced, and the corresponding results are reported in Table 9. Compared with HiDDen, MBRS, De-END, and WFormer, CCViM achieves competitive performance across most attack settings. Despite supporting a higher embedding capacity, CCViM still maintains leading or comparable BA under geometric distortions, particularly in the case of large-scale grid cropping. Although a slight performance decline is observed under certain severe local smoothing attacks, this trade-off is acceptable given its overall robustness under diverse unseen perturbations. Moreover, the BA values of CCViM remain consistently above the 80% usability threshold for all tested untrained attacks, indicating that the embedded watermark can still be recovered reliably under challenging conditions.

Overall, CCViM maintains stable decoding performance in the presence of unseen complex distortions, indicating good robustness and generalization. This can be mainly attributed to the proposed Interwoven Fusion Enhancement Module (IFEM). Specifically, the Vision Mamba (Vmamba) unit employs a selective scanning mechanism that provides a global receptive field with linear computational complexity, thereby facilitating the modeling of long-range dependencies and the identification of image regions that are more suitable for robust watermark embedding. In addition, the context clustering strategy promotes the aggregation of semantically similar regions, allowing watermark information to be embedded into relatively stable feature patterns that are less sensitive to local appearance variations. As a result, even under severe color shifts or geometric deformations, the model is still able to recover the watermark by exploiting stable structural information in the host image.

Although CCViM shows good generalization under the above non-standard distortions, it may still encounter performance degradation under extreme conditions. This limitation mainly arises from two aspects. First, the Vision Mamba backbone relies on the 2D Selective Scan mechanism, which is based on an underlying assumption of spatial continuity. When the input image is subjected to severe non-rigid geometric distortions, such as strong elastic deformation or irregular local warping, this continuity may be disrupted, reducing the effectiveness of long-range dependency modeling and thus affecting watermark extraction. Second, the context clustering strategy in IFEM requires sufficiently informative local features for stable grouping. Under severe signal degradation, such as extremely low-quality JPEG compression or strong low-pass filtering, local textures and structural cues may be heavily weakened, making meaningful clustering more difficult and increasing the likelihood of decoding errors. Future work will, therefore, investigate more deformation-robust scanning strategies and more resilient clustering mechanisms to further enhance robustness under extreme distortions.

5. Conclusions

We propose CCViM, an efficient and robust image watermarking framework built on Vision Mamba, aiming to address the challenging trade-off between fusion efficiency and robustness in existing deep learning-based watermarking methods. To alleviate the redundancy and limited feature interaction caused by naive watermark replication, we develop a Watermark Representation Learning Module (WRLM) that performs structured watermark expansion via cascaded visual state-space blocks, replacing simple repetition. Meanwhile, we introduce an Interwoven Fusion Enhancement Module (IFEM) that tightly couples context clustering with Mamba’s selective scanning, allowing the watermark signal to act as a dynamic guide and to interweave deeply with host image features at both local and global scales. Quantitative evaluations, supported by independent t-tests (p < 0.05), show that CCViM achieves improvement over prior methods in both imperceptibility and robustness. Instead of relying on subjective interpretations of small numerical differences, this statistical validation confirms a consistent performance advantage, particularly under geometric cropping and complex composite attacks. Despite these improvements, we acknowledge certain limitations in the current framework. While CCViM successfully resolves the quadratic complexity bottleneck of Transformers, the integration of state-space modeling and context clustering inevitably introduces a slight computational overhead during inference compared to purely lightweight CNN baselines. Furthermore, achieving error-free watermark extraction under extreme, simultaneous geometric and frequency-domain distortions remains a challenging open problem. Future work will address these limitations by exploring multi-scale state-space fusion and lightweight adversarial training to further improve adaptability and efficiency under extreme real-world conditions.

Author Contributions

Conceptualization, B.L. and J.R.; methodology, B.L. and J.R.; validation, B.L. and J.R.; formal analysis, B.L.; investigation, J.R.; resources, J.R.; data curation, B.L.; writing—original draft, B.L.; writing—review and editing, B.L.; visualization, J.R.; supervision, B.L. and J.R.; project administration, B.L. and J.R.; funding acquisition, B.L. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The main materials supporting this study are included in the article. Further technical details may be available from the corresponding author upon reasonable request.

Acknowledgments

The authors want to thank the editor and anonymous reviewers for their valuable suggestions for improving this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cao, Y.; Li, S.; Liu, Y.; Yan, Z.; Dai, Y.; Yu, P.; Sun, L. A survey of AI-generated content (AIGC). ACM Comput. Surv. 2025, 57, 125. [Google Scholar] [CrossRef]
Trigka, M.; Dritsas, E. The evolution of generative AI: Trends and applications. IEEE Access 2025, 13, 98504–98529. [Google Scholar] [CrossRef]
Mirsky, Y.; Lee, W. The creation and detection of deepfakes: A survey. ACM Comput. Surv. 2021, 54, 7. [Google Scholar] [CrossRef]
Verdoliva, L. Media forensics and deepfakes: An overview. IEEE J. Sel. Top. Signal Process. 2020, 14, 910–932. [Google Scholar] [CrossRef]
Wu, X.; Liao, X.; Ou, B. SepMark: Deep separable watermarking for unified source tracing and deepfake detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1190–1201. [Google Scholar]
Zhang, F.; Wang, H.; He, M.; Xia, J. Robust blind symmetry-based watermarking in the frequency domain against social network processing and desynchronization attacks. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
Wang, G.; Ma, Z.; Liu, C.; Yang, X.; Fang, H.; Zhang, W.; Yu, N. MuST: Robust image watermarking for multi-source tracing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 21–27 February 2024; Volume 36, pp. 5364–5371. [Google Scholar]
Tang, Y.; Wang, C.; Xiang, S.; Cheung, Y.-M. A Robust reversible watermarking scheme using attack-simulation-based adaptive normalization and embedding. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4114–4129. [Google Scholar] [CrossRef]
Wan, W.; Wang, J.; Zhang, Y.; Li, J.; Yu, H.; Sun, J. A comprehensive survey on robust image watermarking. Neurocomputing 2022, 488, 226–247. [Google Scholar] [CrossRef]
Luo, X.; Zhan, R.; Chang, H.; Yang, F.; Milanfar, P. Distortion agnostic deep watermarking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 13548–13557. [Google Scholar]
Geiping, J.; Goldstein, T.; Kirchenbauer, J.; Wen, Y. Tree-rings watermarks: Invisible fingerprints for diffusion images. Adv. Neural Inf. Process. Syst. 2023, 36, 58047–58063. [Google Scholar]
Zhang, X.; Li, R.; Yu, J.; Xu, Y.; Li, W.; Zhang, J. EditGuard: Versatile image watermarking for tamper localization and copyright protection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19–21 June 2024; pp. 11964–11974. [Google Scholar]
Luo, T.; Wu, J.; He, Z.; Xu, H.; Jiang, G.; Chang, C.-C. WFormer: A transformer-based soft fusion model for robust image watermarking. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 4179–4196. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the 2021 International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 21–27 July 2024; pp. 62429–62442. [Google Scholar]
Ma, X.; Zhou, Y.; Wang, H.; Qin, C.; Sun, B.; Liu, C.; Fu, Y. Image as set of points. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Agustsson, E.; Timofte, R. NTIRE 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1122–1131. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Zhu, J.; Kaplan, R.; Johnson, J.; Li, F.-F. Hidden: Hiding data with deep networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 682–697. [Google Scholar]
Ahmadi, M.; Norouzi, A.; Karimi, N.; Samavi, S.; Emami, A. ReDMark: Framework for residual diffusion watermarking based on deep networks. Expert Syst. Appl. 2020, 146, 113157. [Google Scholar] [CrossRef]
Jia, Z.; Fang, H.; Zhang, W. MBRS: Enhancing robustness of dnn-based watermarking by mini-batch of real and simulated jpeg compression. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 41–49. [Google Scholar]
Fang, H.; Jia, Z.; Qiu, Y.; Zhang, J.; Zhang, W.; Chang, E.-C. De-END: Decoder-driven watermarking network. IEEE Trans. Multimed. 2022, 25, 7571–7581. [Google Scholar] [CrossRef]
Fang, H.; Qiu, Y.; Chen, K.; Zhang, J.; Zhang, W.; Chang, E.-C. Flow-based robust watermarking with invertible noise layer for black-box distortions. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 5054–5061. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1106–1114. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Somvanshi, S.; Islam, M.M.; Mimi, M.S.; Polock, S.B.B.; Chhetri, G.; Das, S. From s4 to mamba: A comprehensive survey on structured state space models. arXiv 2025, arXiv:2503.18970. [Google Scholar]
Jiao, J.; Liu, Y.; Liu, Y.; Tian, Y.; Wang, Y.; Xie, L.; Ye, Q.; Yu, H.; Zhao, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. Vmambair: Visual state space model for image restoration. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5560–5574. [Google Scholar] [CrossRef]
Liu, L.; Zhang, M.; Yin, J.; Liu, T.; Ji, W.; Piao, Y.; Lu, H. Defmamba: Deformable visual state space model. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 8838–8847. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 12–22. [Google Scholar]
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.-T. MambaIR: A simple baseline for image restoration with state-space model. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 222–241. [Google Scholar]
Liu, Y.; Guo, M.; Zhang, J.; Zhu, Y.; Xie, X. A novel two-stage separable deep learning framework for practical blind watermarking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1509–1517. [Google Scholar]
Fang, H.; Jia, Z.; Zhou, H.; Ma, Z.; Zhang, W. Encoded feature enhancement in watermarking network for distortion in real scenes. IEEE Trans. Multimed. 2022, 25, 2648–2660. [Google Scholar] [CrossRef]

Figure 1. Illustration of 2D Selective Scan (SS2D).

Figure 2. Overall structure of CCViM. (Note: The animal image in the figure was sourced from Pixabay and is available at https://pixabay.com/zh/photos/squirrel-rodent-foraging-wildlife-498139/, accessed on 20 March 2026).

Figure 3. Interwoven Fusion Enhancement Module.

Figure 4. Visualization of watermarked images under different noise attacks. (Note: Figures (a–f) were obtained from Pixabay. Specifically, Figure (a) is available at https://pixabay.com/zh/photos/sunflowers-flowers-yellow-flowers-4298808/, (accessed on 20 March 2026); Figure (b) at https://pixabay.com/zh/photos/weaver-bird-yellow-bird-wildlife-5189346/, (accessed on 20 March 2026); Figure (c) at https://pixabay.com/zh/photos/automobile-road-direction-vehicle-8078415/, (accessed on 20 March 2026); Figure (d) at https://pixabay.com/zh/photos/butterfly-dolls-dolls-butterflies-363337/, (accessed on 20 March 2026); Figure (e) at https://pixabay.com/zh/photos/mandarin-oranges-mandarins-oranges-6929463/, (accessed on 20 March 2026); and Figure (f) at https://pixabay.com/zh/photos/church-building-mood-black-simple-3460420/, (accessed on 20 March 2026)).

Figure 5. Robustness comparison on different noises.

Figure 6. Visualization of residual plots for each model method. (a) HiDDen [20]; (b) MBRS [22]; (c) De-END [23]; (d) Proposed.

Table 1. Model performance under different loss function weights.

$λ_{E}, λ_{D}, λ_{A d v}$	PSNR (dB)	SSIM	BA [%]
10, 10, 0.0001	46.72	0.9868	97.39
5, 10, 0.0001	44.58	0.9851	98.75
3, 10, 0.0001	44.46	0.9805	99.47
1, 10, 0.0001	40.91	0.9612	99.76
3, 10, 0.0005	45.11	0.9836	98.48
3, 10, 0.00005	43.78	0.9759	99.51

Table 2. Model performance under different numbers of VSS blocks.

$(N_{e n c}, N_{s y m})$	PSNR	BA [%] (QF = 80)	BA [%] (QF = 70)	BA [%] (QF = 60)	BA [%] (QF = 50)	BA [%] (QF = 40)	BA [%] (Average)
(4, 4)	43.98	99.98	99.86	99.65	99.11	96.55	99.03
(6, 4)	44.25	99.99	99.91	99.78	99.28	97.02	99.20
(8, 4)	44.46	100	99.95	99.82	99.45	97.89	99.42
(10, 4)	44.41	100	99.94	99.80	99.41	97.81	99.39
(8, 2)	43.82	99.95	99.80	99.52	98.91	96.13	98.86
(8, 6)	44.15	100	99.92	99.75	99.33	97.54	99.31

Table 3. Model performance under different training epochs.

Epoch	PSNR	SSIM	BA [%] (QF = 60)	BA [%] (QF = 50)	BA [%] (QF = 40)	BA [%] (Average)
20	42.45	0.9733	98.10	96.70	94.60	96.47
40	43.28	0.9731	98.74	98.50	96.20	97.81
60	43.39	0.9745	99.40	98.70	96.50	98.20
80	44.31	0.9776	99.70	99.10	97.10	98.63
100	44.52	0.9811	99.85	99.70	97.28	98.94
120	44.50	0.9803	99.70	98.98	97.01	98.56

Table 4. Comparison of BA under distortion-specific training.

Method	Gaussian Noise (%)						Salt-and-Pepper Noise (%)
Method	$σ$ = 0.01	0.02	0.03	0.04	0.05	Average	$r$ = 0.01	0.02	0.03	0.04	0.05	Average
HiDDen	89.55	86.48	83.95	83.09	79.15	84.44	95.10	93.75	93.41	92.88	90.38	93.10
TSDL	92.10	91.25	88.31	87.05	82.94	88.33	97.25	95.61	93.52	92.68	91.41	94.09
MBRS	99.90	99.38	98.05	96.05	94.10	97.50	98.08	98.71	98.30	97.55	96.65	97.86
Fang	90.48	-	-	-	-	90.48	97.01	97.28	97.69	97.10	96.98	97.21
De-END	99.98	99.69	98.30	96.55	95.88	98.08	99.38	99.48	99.20	99.08	98.69	99.17
WFormer	100	99.85	98.90	98.25	98.00	99.00	99.85	99.65	99.35	99.15	98.50	99.30
Proposed	100	99.92	99.45	98.73	98.50	99.32	99.95	99.88	99.75	99.55	99.12	99.65
Method	Cropout (%)						Dropout (%)
Method	$p$ = 90%	80%	70%	60%	50%	Average	$p$ = 80%	70%	60%	50%	40%	Average
HiDDen	95.58	94.70	88.72	76.85	61.65	83.50	90.26	89.51	87.08	86.78	82.74	87.27
TSDL	98.68	98.45	96.88	93.71	93.20	96.18	97.59	95.29	93.54	92.33	90.47	93.84
MBRS	99.70	99.21	97.18	90.41	83.50	94.00	96.31	96.12	94.18	92.64	90.66	93.98
Fang	98.28	97.88	97.10	95.30	-	97.14	97.38	-	-	-	-	97.38
De-END	100	99.98	99.48	97.25	91.20	97.58	100	100	100	99.50	94.65	98.83
WFormer	99.95	99.90	99.80	98.75	97.80	99.24	99.51	99.16	98.68	97.62	95.60	98.11
Proposed	100	100	99.95	99.92	98.35	99.64	99.58	99.19	98.74	97.71	95.92	98.23
Method	Gaussian Blur (%)					JPEG Compression (%)
Method	$σ$ = 0.0001	0.5	1	2	Average	QF = 40	50	60	70	80	90	Average
HiDDen	95.38	95.15	94.28	84.40	92.30	86.70	91.31	92.91	93.35	93.51	94.31	92.02
TSDL	99.89	99.72	98.40	93.21	97.81	91.07	91.41	93.90	94.23	94.33	94.71	93.28
MBRS	98.58	98.20	97.61	87.75	95.54	94.82	94.98	96.65	97.73	97.64	98.80	96.77
Fang	-	90.35	92.05	91.98	91.46	-	91.46	92.47	93.65	94.35	95.08	93.40
De-END	99.98	99.95	99.45	94.32	98.43	98.17	99.05	100	100	100	100	99.54
WFormer	98.85	99.10	98.55	98.10	98.65	95.61	98.01	98.72	98.94	99.82	100.00	98.52
Proposed	98.92	98.95	98.99	98.81	98.92	95.88	98.15	98.79	99.07	99.98	100.00	98.65

Table 5. PSNR and BA under composite-distortion training.

Method	Crop (r = 0.035)	Cropout (p = 0.3)	Dropout (p = 0.3)	Gaussian Blur ( $σ$ = 0.01)	JPEG (QF = 50)	Average
HiDDen	88.00	94.00	93.00	96.00	63.00	86.80
TSDL	89.00	97.30	97.40	98.60	76.20	91.70
MBRS	81.15	78.57	77.13	92.80	82.83	82.50
Fang	95.85	100	99.99	99.99	95.52	98.27
De-END	64.17	99.21	99.95	88.93	81.89	86.83
WFormer	97.17	100	100	100	97.73	98.98
Proposed	98.55 ± 0.06	100 ± 0.00	100 ± 0.00	100 ± 0.00	97.35 ± 0.11	99.18 ± 0.04

Table 6. BA % statistical significance under single attacks.

Attack Scenario	WFormer (Mean ± SD)	Proposed (Mean ± SD)	t-Value	p-Value	Significance (p < 0.05)
Gaussian Noise ( $σ$ = 0.04)	98.25 ± 0.15	98.73 ± 0.08	6.31	<0.001	Yes
Gaussian Blur ( $σ$ = 2)	98.10 ± 0.18	98.81 ± 0.09	7.89	<0.001	Yes
JPEG (QF = 50)	98.01 ± 0.06	98.15 ± 0.11	2.50	0.037	Yes
Composite Average	98.98 ± 0.05	99.18 ± 0.04	6.99	<0.001	Yes

Table 7. Comparison of ablation study results.

Method	PSNR (dB)	Crop (r = 0.035)	Cropout (p = 0.3)	Dropout (p = 0.3)	Gaussian Blur ( $σ$ = 0.01)	JPEG (QF = 50)	Average
H_CCViM	37.25	75.10	80.25	82.50	90.75	98.00	85.32
C_CCViM	36.75	94.20	98.15	98.50	100	92.50	96.67
T_CCViM	36.80	97.80	100	100	100	96.95	98.95
D_CCViM	36.95	85.20	90.50	91.10	96.00	97.95	92.15
NI_CCViM	36.70	95.80	98.50	99.00	100	92.10	97.08
T_IFEM_CCViM	36.81	98.10	100	100	100	96.25	98.87
NC_CCViM	36.79	97.20	99.30	99.70	100	95.00	98.24
Proposed	36.83	98.55	100	100	100	97.35	99.18

Table 8. Analysis of computational cost.

Method	Params [M]	FLOPs [G]	Speed [im/s]
HiDDen	0.40	3.52	62.93
MBRS	5.80	13.36	17.67
De-END	0.41	3.91	18.38
WFormer	1.72	13.83	14.29
Proposed	1.46	8.30	23.82

Table 9. Performance under untrained non-standard attacks.

Attack	HiDDen	MBRS	De-END	WFormer	Proposed
Median Filter (w = 5 × 5)	72.58	99.96	91.06	98.06	97.88
Median Filter (w = 7 × 7)	64.28	88.01	73.45	95.68	95.13
Median Filter (w = 9 × 9)	59.76	58.84	54.83	91.12	90.26
Grid Crop (r = 0.7)	73.02	99.93	95.49	100	100
Grid Crop (r = 0.8)	65.33	99.76	89.09	99.95	99.95
Grid Crop (r = 0.9)	59.40	97.99	76.63	98.79	98.79
Adjust Hue (f = 0.44)	68.34	95.44	91.53	95.68	97.23
Adjust Hue (f = 0.46)	64.10	86.46	80.32	88.27	91.68
Adjust Hue (f = 0.48)	58.20	67.33	62.60	74.59	80.33
Adjust Saturation (f = 5.0)	79.09	99.86	97.01	99.92	99.96
Adjust Saturation (f = 10.0)	77.59	99.77	96.43	99.89	99.89
Adjust Saturation (f = 15.0)	76.36	99.63	96.19	99.69	99.76
Average	68.17	91.08	83.72	95.14	95.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, B.; Ren, J. Robust Image Watermarking via Clustered Visual State-Space Modeling. Appl. Sci. 2026, 16, 4166. https://doi.org/10.3390/app16094166

AMA Style

Liu B, Ren J. Robust Image Watermarking via Clustered Visual State-Space Modeling. Applied Sciences. 2026; 16(9):4166. https://doi.org/10.3390/app16094166

Chicago/Turabian Style

Liu, Bo, and Jianhua Ren. 2026. "Robust Image Watermarking via Clustered Visual State-Space Modeling" Applied Sciences 16, no. 9: 4166. https://doi.org/10.3390/app16094166

APA Style

Liu, B., & Ren, J. (2026). Robust Image Watermarking via Clustered Visual State-Space Modeling. Applied Sciences, 16(9), 4166. https://doi.org/10.3390/app16094166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Image Watermarking via Clustered Visual State-Space Modeling

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Watermarking Architectures

2.2. Visual Backbones

2.3. Visual State-Space Models

3. Methodology

3.1. CCViM Framework

3.2. Watermark Representation Learning Module

3.3. Interwoven Fusion Enhancement Module

4. Experiments

4.1. Datasets

4.2. Experimental Settings and Metrics

4.3. Baselines

4.4. Discussion and Analysis of Parameters

4.5. Comparison Under Distortion-Specific Training

4.6. Performance Comparison Under Combined Attacks

4.7. Statistical Significance Analysis

4.8. Ablation Study

4.9. Calculation Cost Analysis

4.10. Performance Under Untrained Attacks and Model Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI