Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion

Yang, Zhenghan; Sun, Huadong; Lv, Nuohan

doi:10.3390/electronics15122580

Open AccessArticle

Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion

by

Zhenghan Yang

¹,

Huadong Sun

^1,2,* and

Nuohan Lv

¹

School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China

²

Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin 150028, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2580; https://doi.org/10.3390/electronics15122580

Submission received: 30 April 2026 / Revised: 6 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

(This article belongs to the Topic Recent Developments and Applications of Image Watermarking)

Download

Browse Figures

Versions Notes

Abstract

Blind image watermarking models such as HiDDeN have laid an important foundation for end-to-end watermarking. Nevertheless, they still suffer from three major limitations: single-scale feature extraction, fixed fusion weights, and slow training convergence. To address these issues, this paper proposes an adaptive multi-scale watermarking algorithm based on collaborative dual-attention mechanisms. The algorithm designs an adaptive multi-scale feature fusion module (MA-FFM) with a dynamic gating network in the encoder, which flexibly combines local multi-scale textures with global contextual information, overcoming the limitation of fixed fusion weights. In the decoder, a multi-level channel attention module is embedded to strengthen the extraction of watermark signals. The two attention modules work synergistically: the encoder focuses on adaptive feature fusion while the decoder leverages channel attention to selectively enhance watermark-related features, forming a dual-attention synergy that balances robustness and imperceptibility. Moreover, the dynamic gating network adaptively adjusts the contribution of local versus global features via learnable weights, whose evolution from approximately 0.51 to about 0.89 improves model interpretability. Experiments are conducted on the COCO 2017 dataset. Compared with HiDDeN, the proposed algorithm reduces the bit error rate (BER) from 0.1696 to 0.1538 under no attack with a relative reduction of 9.3%, increases PSNR by 0.61 dB, and improves SSIM from 0.9058 to 0.9077. Under various attacks—including JPEG compression, Gaussian noise, salt-and-pepper noise, and brightness/contrast adjustments—the BER remains consistently lower than that of HiDDeN. Ablation studies confirm the effectiveness of each module. Overall, the proposed algorithm preserves visual quality, improves the accuracy of watermark embedding and extraction, and exhibits strong generalization robustness against common image distortions.

Keywords:

blind image watermarking; adaptive multi-scale feature fusion; dynamic gating network; dual-attention mechanism; robustness

1. Introduction

Digital images form the foundation of today’s internet information. Their copyright protection and traceability have become central issues in multimedia security. Since digital watermarking emerged in the 1990s, it has gone through three major stages: from spatial domain methods to transform domain methods, and now to deep neural network-based approaches [1,2].

Early spatial domain methods (e.g., LSB) and transform domain methods (e.g., DCT-based spread-spectrum [3] and DWT-based [4] approaches) improved robustness but relied heavily on manually designed rules, struggling to adapt to diverse image content under combined attacks. For a detailed review, readers are referred to Section 2.

The rise in deep learning brought a major shift. The HiDDeN framework [5] pioneered end-to-end watermarking by jointly training an encoder, noise layer, decoder, and discriminator, achieving a collaborative optimization of invisibility and robustness. Recent improvements include TSDL [6], MBRS [7], and PIMoG [8], which have enhanced feature representation and robustness against specific distortions.

Nevertheless, for high-precision applications, these models still suffer from three core limitations: single-scale convolution captures only local textures without global guidance; existing feature fusion relies on fixed manual weights that cannot adapt to image content; and training convergence is slow while interpretability is lacking [9,10]. Moreover, most existing deep watermarking models operate as black boxes, providing no insight into why a watermark is embedded in a particular region or how the model adapts to different image contents. This lack of interpretability limits their trustworthiness and practical deployment.

To address these issues, this paper proposes a dual-attention adaptive multi-scale blind watermarking framework built on HiDDeN. Specifically, we design an adaptive multi-scale feature fusion module (MAFFM) with a dynamic gated weight learning mechanism that adaptively balances local and global features, evolving from ≈0.51 to ≈0.89 and enhancing interpretability. At the decoder side, we embed multi-level SE channel attention modules [11] to strengthen watermark signal filtering. Inspired by [12], our method integrates MA-FFM and SE into a collaborative dual-attention architecture.

The main contributions of this work are threefold:

(1): We propose a dual-attention synergistic architecture that places MA-FFM in the encoder for adaptive multi-scale fusion and SE in the decoder for channel-wise watermark enhancement, which balances robustness and imperceptibility better than single-module variants.
(2): We introduce a dynamic gating mechanism whose weight evolution (0.51 → 0.89) is recorded and interpreted, providing interpretability of the deep watermarking process.
(3): We systematically evaluate capacity scalability of 30, 64 and 100 bits and robustness against differentiable noise layers including color jitter, JPEG and Gaussian blur, which are not explored in previous MA-FFM/SE-based works.

Experiments on COCO 2017 [1] demonstrate that our algorithm reduces BER by 9.3% under no attack from 0.1696 to 0.1538, improves PSNR by 0.61 dB, and achieves lower BER under various attacks. The method achieves synergistic gains in performance, efficiency, and robustness.

The remainder of this paper is organized as follows. Section 2 reviews related work on deep watermarking and attention mechanisms. Section 3 describes the proposed dual-attention architecture, including the MA-FFM and SE modules, the dynamic gating mechanism, and the training strategy. Section 4 presents experimental results, ablation studies, capacity scaling, and robustness evaluations. Section 5 discusses the findings and limitations, and Section 6 concludes the paper.

2. Related Work

2.1. Traditional Watermarking

Several comprehensive surveys have summarized the evolution of deep learning-based image watermarking [12]. The development of image watermarking can be traced back to the early 1990s, following a clear historical trajectory from simple spatial domain methods to more sophisticated transform domain algorithms, and eventually to modern deep learning-based approaches.

In the earliest stage, researchers primarily focused on spatial domain methods, among which least significant bit (LSB) [13] substitution was the most representative. By directly modifying the least significant bits of pixel values, LSB-based methods were computationally simple and easy to implement [14]. However, they quickly proved to be highly vulnerable to common image operations such as JPEG compression and filtering, thus lacking sufficient robustness for practical applications [1,15].

To overcome the fragility of spatial domain methods, the late 1990s saw a shift toward transform domain watermarking. Cox and his colleagues pioneered this direction by proposing a spread-spectrum watermarking technique based on discrete cosine transform (DCT) [3], which laid the theoretical foundation for traditional watermarking. Shortly after, Kundur and others exploited the local time-frequency properties of discrete wavelet transform (DWT) to achieve a better trade-off between invisibility and robustness [4]. Quantization-based methods have also been explored [16]. Throughout the following two decades, transform domain methods [15] dominated the field. Nevertheless, they still suffered from inherent limitations: embedding positions and strengths were determined by manually designed rules that could not adapt to varying image content; resistance to combined attacks was poor; and most methods required the original image or side information for extraction, making them unsuitable for blind watermarking scenarios [17,18]. These unsolved problems gradually pushed the community toward data-driven solutions. In summary, traditional watermarking methods are limited by hand-crafted features and lack adaptability, which motivated the shift toward data-driven solutions.

2.2. Deep Learning-Based Watermarking

A major breakthrough occurred in 2018 when Zhu and colleagues proposed the HiDDeN framework [5], marking the beginning of the deep learning era for blind image watermarking. Unlike previous approaches that relied on hand-crafted features, HiDDeN adopted an end-to-end architecture with four jointly trained components: an encoder that fuses the watermark with the cover image, a noise layer that simulates various attacks, a decoder that blindly extracts the watermark, and a discriminator that preserves visual quality through adversarial training. This design achieved a collaborative balance between invisibility and robustness and quickly became a classic benchmark [5].

Subsequent research has focused on pushing HiDDeN’s performance further. Liu and others proposed a two-stage training framework called TSDL, which enhanced the encoder’s feature representation by decoupling the embedding and extraction processes [6]. To improve robustness against JPEG compression, Jia and colleagues introduced the SE attention mechanism into the encoder within the MBRS model, combined with a differentiable JPEG simulation layer [7]. Fang and others developed the PIMoG framework [2], which added a screen-capture noise layer to mimic real-world distortions. In addition, deep neural network copyright protection also employs watermarking techniques thereby improving [19] robustness in practical scenarios [8]. Zero-watermarking with deep learning and encryption is another emerging direction [20].

Despite these advances, even state-of-the-art deep watermarking models share common technical shortcomings. The decoder typically remains a simple convolutional structure with limited capacity for global feature modeling. More importantly, the encoder’s feature fusion method is often fixed and lacks adaptive optimization.

2.3. Attention Mechanisms in Watermarking

Parallel to watermarking research, attention mechanisms have emerged as a powerful technique in computer vision. Hu and colleagues proposed the squeeze-and-excitation (SE) channel attention module, which adaptively reweights channel features via global pooling and fully connected layers, strengthening informative features while suppressing irrelevant ones at very low computational cost [11]. In the domain of multi-scale feature fusion, Dai and colleagues introduced the attentional feature fusion (AFF) module, which overcomes the limitations of simple concatenation or addition by using a dual-branch structure that combines local multi-scale convolution and global context in parallel, together with a gating mechanism for adaptive fusion [21]. The work systematically studied multi-scale feature fusion and proposed to dynamically fuse features of different scales through an attention mechanism, effectively avoiding the drawback of blindly “adding” multi-scale information together.

In the specific context of digital watermarking, Wang and others combined an improved MA-FFM module with SE attention to build a hybrid neural network, confirming that attention mechanisms can enhance watermark robustness [12]. Other attention-based watermarking methods, such as DARI-Mark [22], further demonstrate the effectiveness of attention mechanisms in improving embedding accuracy and robustness. Inspired by this line of work, our paper focuses on robustness under various typical attacks—including Gaussian noise, salt-and-pepper noise, and brightness/contrast adjustments—and aims to improve the attention mechanism, specifically the MA-FFM multi-scale fusion module and SE channel attention, to boost watermark embedding accuracy, extraction efficiency, and training convergence speed. This provides a new direction for the lightweight optimization of the HiDDeN model. In conclusion, while attention mechanisms have been introduced into watermarking, their synergistic placement and the interpretability of dynamic gating have not been explored. Our work fills this gap.

3. Materials and Methods

3.1. Overall Framework Design

The encoder takes the cover image and the secret message as input. It first extracts multi-scale features through several convolutional layers, then adaptively fuses them with the message using the proposed MA-FFM module. The rationale is to learn a content-aware embedding that prioritizes texture-rich regions, thereby balancing robustness and imperceptibility.
During training, we replace the noise layer with an identity mapping. This deliberate choice isolates the pure contribution of the attention modules; any robustness gain can be attributed solely to the attention mechanism, not to overfitting to specific distortions. Generalization is still verified on unseen attacks during testing.
The decoder receives the possibly distorted watermarked image and employs SE channel attention modules to selectively enhance watermark-related features. The design rationale is that channel attention can adaptively reweight feature maps, making the decoder focus on informative channels and thus improve extraction accuracy under noise.
A discriminator is introduced in an adversarial training setup. It distinguishes original cover images from watermarked ones, forcing the encoder to produce visually imperceptible outputs. This adversarial component helps preserve high perceptual quality without sacrificing watermark embedding strength.

As illustrated in Figure 1, this paper builds upon the HiDDeN framework [5]. In the original design, the encoder takes a cover image and a secret message as input and produces an encoded image that visually resembles the cover. The decoder then recovers the secret message from this encoded image. An optional noise layer placed between the encoder and the decoder simulates various real-world distortions to improve robustness.

Given the limitations of the HiDDeN model discussed above—such as its single-scale feature fusion and the limited accuracy of watermark embedding and extraction—this paper focuses on isolating and validating the contribution of two attention mechanisms: MA-FFM and SE. To achieve this, we deliberately replace the noise layer with an identity mapping during training, meaning no distortion is added to the encoded image. This design eliminates noise-induced robustness gains, ensuring that any performance improvement comes solely from the feature selection ability of the attention modules. As a result, we can more accurately assess the effectiveness of our proposed enhancements [23].

It is important to note that this training strategy does not imply the model is weak against distortions. Although no simulated attacks are used during training, we still evaluate the model under various attacks during the testing stage, including JPEG compression, Gaussian noise, salt-and-pepper noise, and brightness/contrast adjustments. This allows us to test the model’s generalization robustness against previously unseen distortions. The fact that the model achieves lower BER than HiDDeN under these attacks—despite never seeing them during training—further confirms that the attention mechanisms themselves improve robustness, rather than the model simply overfitting to a specific noise layer.

In addition, we keep the adversarial training setup between the discriminator and the encoder. The discriminator learns to tell apart original cover images from watermarked ones, which in turn pushes the encoder to produce more realistic outputs. This adversarial mechanism is retained to preserve the visual quality of the watermarked images, thus offering a stable basis for evaluating the attention modules.

3.2. Encoder Improvement: MA-FFM Module

To address the key limitations mentioned above—namely, the encoder’s single-scale feature extraction, its inability to capture local features beyond a fixed scale, and the lack of global context guidance—we draw inspiration from the work of Dai et al. and design a Multi-scale Attention Fusion Module (MA-FFM) [21]. Similar ideas of combining attention with multi-scale features have also been explored in image watermarking [24]. This module is specifically intended to enhance the encoder’s ability to jointly model multi-scale features. Its structure is shown in Figure 2.

MA-FFM Operation Details

The input cover feature map

X = F_{c o} \in R^{C \times H \times W}

flows into the MA-FFFM module and is processed sequentially following four staged steps, split into preliminary preprocessing and dual-branch parallel feature extraction:

Two cascaded

1 \times 1

convolutions are first implemented for preliminary feature optimization. The first

1 \times 1

convolution removes redundant feature information and unifies the channel dimension of input features; the second successive

1 \times 1

convolution further adjusts channel distribution to realize channel unity, laying a consistent feature base for subsequent dual-branch computation. The preprocessed feature is then diverged into two independent parallel branches: the local multi-scale spatial branch and the global channel context branch.

Local Branch Multi-scale Spatial Feature Extraction

Two groups of parallel depth-wise separable convolution stacks are deployed for multi-scale local feature extraction: each group contains side-by-side

3 \times 3

and

5 \times 5

depth-wise separable convolutions. Feature maps from each convolution kernel are concatenated along the channel dimension respectively, and all concatenated multi-scale feature tensors are further fused, followed by a final

1 \times 1

convolution for channel calibration and compression to derive the local multi-scale feature

L (X)

.

Global Branch Global Channel Attention Extraction

Global average pooling GAP is firstly performed to compress the spatial dimension of preprocessed features into a channel-wise descriptor

Z \in R^{C}

. Afterwards, dual successive

1 \times 1

convolutions combined with Sigmoid activation conduct feature transformation and optimization on the pooled channel vector, generating the global channel attention descriptor

G (X)

.

The obtained local feature

L (X)

and global feature

G (X)

are fused via broadcast element-wise addition (

\oplus

) to unify spatial-channel dimensions. The fused composite feature is then fed into the Sigmoid activation function

σ (\cdot)

to normalize values into [0, 1], producing pixel-level adaptive attention weight mask

σ (L (X) \oplus G (X))

In this spatial mask, texture-abundant regions are assigned higher weights while flat smooth regions get lower weights.

Element-wise multiplication (

\otimes

) is executed between the original input feature

X

and the generated attention weight mask, and the enhanced output feature is formulated as

X^{'} = X \otimes σ (L (X) \oplus G (X))

(1)

Here,

σ

represents the Sigmoid function, which is used to normalize the fusion weights to the range of [0, 1];

\oplus

denotes broadcast addition, which achieves the dimensional alignment of local features and global features; and ⊗ represents element-wise multiplication, which completes the weighted combination of the original features and the fusion weights. To effectively address the problem of the simple and ineffective feature fusion method of the HiD-DeN encoder, we embed the MA-FFM module into the feature fusion stage of the encoder, replacing the original simple concatenation operation, thereby achieving more efficient and more targeted adaptive fusion of image features and secret message features, providing complete multi-scale feature support for the precise embedding of watermarks. The dynamic gate weight is a scalar per image; after Softmax, the local and global weights sum to 1. It is not a per-channel average nor a spatial map.

3.3. Decoder Improvement: SE Attention Module

Building upon the core weaknesses of the HiDDeN model, the decoder, as a crucial part of watermark extraction, directly determines the recovery accuracy of the secret message—the core task of the decoder is to accurately restore the embedded secret message from the watermarked image that has been subjected to noise attacks. However, the decoder of the original HiDDeN adopts a pure convolutional structure; although it can effectively extract local features, it has a fatal flaw: the feature responses of each channel are regarded as equally important, and it is unable to selectively enhance the key channels of the watermark signal, resulting in a large number of redundant channels interfering with the watermark extraction and reducing the recovery accuracy [5].

To precisely address this issue and enhance the decoder’s sensitivity to watermark signals, we were inspired by the squeeze-and-excitation networks and innovatively introduced the SE attention module into the decoder. By explicitly modeling the interdependence between channels, it adaptively recalibrates the channel feature responses, achieving the goal of “strengthening key channels and suppressing redundant channels” [11]. It is worth noting that the SE module has shown significant effects in the watermark field, providing a reliable reference for our improvement: MBRS extensively introduces SE blocks in the encoder to enhance the feature extraction ability [7]; PIMoG uses the attention mechanism to enable the network to better understand the preprocessed features [8]; and SE blocks have also been integrated with DWT and CNN for color image watermarking [25]. Different from existing studies, this paper introduces the SE module into the decoder. Cross-attention mechanisms have also been introduced for robust watermark extraction [26] focusing on exploring their enhancement effect on watermark extraction accuracy, and filling the research gap in channel attention optimization of the decoder.

Module structure: The SE module consists of two core operations—squeeze and excitation, as shown in Figure 3. Its principle is directly based on the description in the original paper.

Squeeze operation: “To leverage the dependencies between channels, we first treat the signal of each channel as a set of local descriptors. Through global average pooling, an aggregated spatial information descriptor is generated.” Formally, given the input feature map

U \in R^{C \times H \times W}

, the squeeze operation generates the channel descriptor

z \in R^{C}

, and its

c

-th element is

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{c} (i, j)

(2)

Excitation operation: “To utilize the aggregated information from the compression operation, we perform a second operation aimed at fully capturing the dependencies between channels. This operation must meet two criteria: First, it must be flexible. Second, it must learn the nonlinear interactions between channels. We chose a simple gated mechanism with Sigmoid activation” [11]. This gated mechanism is expressed as

s = σ (W_{2} R E L U (W_{1} z))

(3)

Here,

δ

represents the ReLU activation function,

σ

represents the Sigmoid function,

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are the learnable weight matrices of two fully connected layers, and

r

is the dimensionality reduction ratio. In our implementation, we set the reduction ratio r = 16.

Feature re-estimation: “The output of the activation operation is then used to rescale the input feature map

U

through a per-channel multiplication” [11]. That is:

\tilde{U} = s ⊙ U = [s_{1} U_{1}, s_{2} U_{2}, \dots, s_{C} U_{C}]

(4)

Here,

⊙

represents the element-wise multiplication of channel dimensions.

Integration method: We seamlessly integrate the SE block after each convolutional layer of the decoder, forming a basic building block of “convolution → batch normalization → ReLU → SE block”. The decoder is composed of multiple such units alternately stacked, and finally outputs the predicted L-bit message bits through a fully connected layer.

Theoretical basis: As described in squeeze-and-excitation networks, “in the filter responses learned by the convolutional kernels, the activations of many channels are highly redundant [11]. By explicitly modeling the interdependence between channels, the network’s representational ability can be significantly enhanced.” In the watermark decoding task, the response intensities of different channels to the watermark signal vary significantly. The SE module, through the aforementioned adaptive recalibration mechanism, enables the decoder to “focus” on the key feature channels carrying the watermark information, thereby improving the accuracy of message recovery under noise attacks.

3.4. Loss Function and Training Strategy

The optimization objective of this model is weighted by three components: decoding loss

L_{dec}

, encoding loss

L_{enc}

, and adversarial loss

L_{adv}

. The total loss function is defined as follows:

L = λ_{dec} L_{dec} + λ_{enc} L_{enc} + λ_{adv} L_{adv}

(5)

Decoding loss (

L_{dec}

): To ensure the high-fidelity restoration of the secret message from potentially attacked images, we employ binary cross-entropy loss. In HiDDeN, “message recovery is regarded as a binary classification problem, and we use cross-entropy loss to maximize the recovery accuracy of the message bits” [5]. This loss is defined as

L_{dec} = - \frac{1}{L} \sum_{i = 1}^{L} [M_{i} l o g (p_{i}) + (1 - M_{i}) l o g (1 - p_{i})]

(6)

Here,

p_{i}

represents the probability that the decoder predicts the i-th bit of the message to be 1.

Encoding loss (

L_{enc}

): To minimize the visual difference between the original image and the image with watermark, we use the mean squared error (MSE). As stated in HiDDeN, “We use the MSE loss between the encoder output and the cover image to penalize pixel-level differences” [5]. This loss is defined as

L_{enc} = \frac{1}{N} \sum_{i = 1}^{N} (I_{c o}^{(i)} - I_{e n}^{(i)})^{2}

(7)

Adversarial loss (

L_{adv}

): To further enhance the perceptual authenticity of the watermarked image, we introduced a discriminator network and adopted the standard GAN loss for adversarial training. Goodfellow et al. defined this training objective as [27]. The generator

G

and the discriminator

D

engage in a minimax game on the value function

V (G, D)

\underset{G}{m i n} \underset{D}{m a x} V (D, G) = E_{x \sim p_{data} (x)} [l o g D (x)] + E [l o g (1 - D (G (z)))]

(8)

In our framework, the encoder

E

acts as the generator, receiving the cover image

I_{c o}

and the message

M

as inputs; the discriminator

D

attempts to distinguish the original image from the image generated by the encoder. Therefore, the adversarial loss is specifically defined as

L_{adv} = E [l o g D (I_{c o})] + E_{I_{c o}, M} [l o g (1 - D (E (I_{c o}, M)))]

(9)

The encoder is trained to minimize

L_{adv}

, thereby enabling the generated watermarked image to be more similar in distribution to the original image.

Loss weights: Through experimental debugging, we set the weights of the three loss terms as

λ_{dec} = 2.5

,

λ_{enc} = 0.7

, and

λ_{adv} = 0.01

. This allocation prioritizes decoding accuracy while also taking into account image quality. The adversarial loss weight is relatively low, aiming to prevent it from dominating the optimization process in the early training stage. This is consistent with the observation in MBRS: “Adversarial training is used to assist in optimizing image quality, rather than dominating the training process” [28].

Optimizer and training strategy: We use the Adam optimizer for parameter updates. Kingma and Ba described Adam in their paper as a stochastic optimization method that requires only one gradient, has low memory requirements, and is highly suitable for large-scale data and parameter problems [29]. The learning rates for the encoder and decoder are set to

2 \times 10^{- 4}

, and the learning rate for the discriminator is set to

1 \times 10^{- 4}

. The batch size is set to 8, and the total number of training rounds is 20. Loss convergence was observed at the 10th round. To stabilize adversarial training, we use gradient clipping with a maximum norm of 0.5. As Hu et al. stated, “Gradient clipping can prevent the discriminator gradient from being too large, which leads to training crashes” [11]. All experiments were completed on a single NVIDIA GeForce RTX 4060 Laptop GPU, NVIDIA Corporation, Santa Clara, CA, USA.

4. Results

This section presents systematic experiments to verify the effectiveness of the proposed dual-attention collaborative structure and dynamic feature fusion mechanism. All experiments follow the principle of controlled variable comparison. Under identical training settings, we compare the performance of the baseline model with that of our improved model. We evaluate from four perspectives: quantitative metrics, ablation analysis, convergence efficiency, and interpretability. The results show that our method effectively addresses the core issues of the original HiDDeN watermarking approach, namely low accuracy, weak feature representation, and slow convergence.

4.1. Experimental Setup and Evaluation Metrics

We use the COCO 2017 dataset [1]. Following common practice in deep learning-based watermarking research, we train for 20 epochs using 10,000 images, validate on 5000 images, and test on 5000 images. All images are resized to 128 × 128 pixels. The secret message is a randomly generated 30-bit binary sequence.

Watermark decoding accuracy is measured by the bit error rate (BER), which is the proportion of incorrect bits out of the total bits.

BER = \frac{1}{L} \sum_{i = 1}^{L} 1 [m_{i} \neq {\hat{m}}_{i}],

(10)

where

L

is the message length,

m_{i}

and

{\hat{m}}_{i}

are the original and extracted message bits, respectively, and

1 [\cdot]

is the indicator function. A lower BEoR indicates stronger robustness. The visual quality of the watermarked image is evaluated using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). PSNR is defined as

PSNR = 10 \cdot {l o g}_{10} (\frac{{MAX}^{2}}{MSE})

(11)

with MAX = 2 (the image values are normalized to [−1, 1]) and MSE the mean squared error between the original and watermarked images. SSIM is computed with a window size of 11 according to the standard definition [30]. Higher PSNR and SSIM values mean better imperceptibility [31].

We also report the Learned Perceptual Image Patch Similarity (LPIPS) using an AlexNet backbone. The average LPIPS of our proposed model is 0.0267, while that of the baseline HiDDeN is 0.0422; lower is better, indicating that our method introduces less perceptual distortion.

We select the original HiDDeN model as the main baseline. In addition, we conduct ablation experiments to separately verify the effectiveness of the MA-FFM module and the SE attention module. All compared models are trained and tested under the same dataset, image size, and message length to ensure a fair comparison [32].

4.2. Training Parameters

The model is implemented using the PyTorch 2.0.1 framework and trained on a single NVIDIA GeForce RTX 4060 Laptop GPU. We use the Adam optimizer [29]. The learning rate for the encoder–decoder is set to 2 × 10⁻⁴, and the learning rate for the discriminator is 1 × 10⁻⁴. Due to GPU memory limits, the batch size is set to eight. The total number of training epochs is 20. To ensure stable training, we apply gradient clipping with a maximum norm of 0.5.

The total loss is a weighted sum of three components:

Decoder loss with weight 2.5;
Encoder loss with weight 0.7;
Adversarial loss with weight 0.01.

We also evaluated the computational cost. Table 1 compares the parameter count, FLOPs, and inference time for a single 128 × 128 image of the baseline HiDDeN and our proposed model.

4.3. Comparison Method

We select the original HiDDeN model as the main baseline. In addition, we conduct ablation experiments to separately verify the effectiveness of the MA-FFM module and the SE attention module. All compared models are trained and tested under the same dataset, image size, and message length to ensure a fair comparison.

Table 2 quantifies the experimental results under each configuration. The data indicate that the proposed dynamic gating model in this paper achieves the optimal comprehensive performance. Specifically, compared to the original HiDDeN model, this method significantly reduces the bit error rate (BER) from 0.1696 to 0.1538 with a relative decrease of approximately 9.3%., while the peak signal-to-noise ratio (PSNR) increases by 0.61 dB. It is noteworthy that simply introducing the SE module mainly improves PSNR, but has limited effect on the improvement of BER; meanwhile, introducing MA-FFM alone reduces BER, but leads to a decrease in PSNR. This proves the crucial role of the dynamic gating mechanism in balancing the two aspects.

Table 3 compares the BER of each model under different attacks and training conditions. Under attacks such as JPEG compression, Gaussian noise, salt-and-pepper noise, and brightness/contrast adjustment, the BER of the proposed model is lower than that of the original HiDDeN model, indicating that the dual-attention collaborative mechanism effectively enhances robustness. Additionally, our model obtains lower BER than the baseline across all listed distortion types including color jitter, Gaussian blur, JPEG compression and geometric transformations, which verifies the robustness of the dual-attention module against realistic differentiable noise layers.

Figure 4 shows the watermarked images generated under different configurations and their corresponding residual plots (Residual). From the visual analysis, all the generated images by the methods maintain a high degree of consistency with the original image (cover) when observed by the naked eye. However, by observing the distribution of the enlarged residuals, it can be found that the perturbations produced by the dynamic gating model are more uniform and subtle, effectively avoiding the smooth areas and hiding the watermark energy in the texture details. This explains the reason for the SSIM value in Table 1 reaching 0.9077; that is, this method better maintains the structural integrity of the image.

Figure 5 shows the amplified residual maps (×5) corresponding to Figure 4. The residuals of the baseline HiDDeN exhibit noticeable artifacts even in smooth regions, whereas the proposed dynamic gating model produces more uniform and subtle perturbations that are largely confined to textured areas. This further demonstrates the superior imperceptibility of our method.

Figure 6 visualizes the residual maps (scaled by ×5) under six common distortions: JPEG compression (Q = 85 and Q = 75), Gaussian blur, resize, random crop (90% area then resize), and color jitter. Even under these strong attacks, residuals remain highly localized to high-texture regions, while smooth areas show no noticeable artifacts. This observation further confirms the imperceptibility and robustness of our proposed method.

Figure 6 shows the BER evolution curves of each model on the validation set. The original HiDDeN model tended to stabilize around the 10th round, while the dynamic gating model maintained the lowest BER trajectory throughout the entire training period, with smaller fluctuations and demonstrating stronger learning efficiency and robustness.

Figure 7 illustrates the interpretability of the dynamic gating mechanism. The experiment observed that the local path weight (local gate) gradually increased from the initial value of 0.5 and eventually stabilized at approximately 0.89, while the global path weight decreased accordingly. This phenomenon reflects an important scientific discovery: in an ideal channel without attacks, the precise recovery of the watermark is highly dependent on pixel-level local multi-scale features. The model, through adaptive learning, focuses its attention on capturing fine textures, thereby significantly reducing the bit error rate without compromising the visual quality.

4.4. Capacity Scaling

To evaluate capacity scalability, we trained the baseline and our model with 64-bit and 100-bit messages under the same no-attack setting. Table 4 reports the results.

4.5. Generalization on Additional Datasets

To further evaluate cross-dataset generalization, we directly applied our COCO-trained model to the DIV2K dataset and the Mini-ImageNet dataset in a zero-shot manner. All images were resized to 128 × 128. The results are summarized in Table 5.

5. Discussion

The quantitative analysis presented in Table 1 confirms that the proposed model, which synergistically integrates the SE module, MA-FFM, and dynamic gating, significantly outperforms the original HiDDeN baseline [5]. By achieving a lower bit error rate (BER) of 0.1538 and a higher PSNR of 31.59 dB, the model demonstrates a superior ability to balance watermark extraction accuracy with visual fidelity. These results indicate that the inclusion of the SE module at the decoder side allows for explicit modeling of inter-channel relationships, effectively isolating key feature channels while suppressing irrelevant noise [11]. However, the ablation study reveals that the SE module’s efficacy is highly dependent on its interaction with local features. Specifically, when the MA-FFM is removed (SE Only), the BER escalates to 0.1887, the highest among all tested configurations. This performance degradation supports the theoretical concern that global average pooling (GAP) may inadvertently erase spatial details essential for watermark recovery [11], emphasizing that the SE module must be complemented by local texture descriptors to be effective.

The superiority of the dynamic gating mechanism is further highlighted when compared to the fixed-weight fusion approach. Experimental data shows that the fixed-gating configuration (0.5:0.5 ratio) yields a BER of 0.1706, which is paradoxically higher than the original HiDDeN baseline of 0.1696. This phenomenon suggests a state of “adversarial cancellation,” where the global optimization goals of the SE module and the local enhancement objectives of the MA-FFM conflict, leading to a performance bottleneck [21]. The fixed equal weighting forces the model to treat local and global features with equal importance throughout training, which prevents the network from adapting to the natural dominance of local features in the later stage. In contrast, our dynamic gating learns the fusion weights from data (evolving from 0.51 to 0.89), thereby avoiding this conflict and achieving better performance. Our proposed dynamic gating avoids this pitfall by enabling an adaptive evolution of feature representation. Instead of fluctuating randomly, the gating weights follow a structured path that increasingly prioritizes local multi-scale features in attack-free scenarios, recognizing their decisive role in decoding precision.

As illustrated in Figure 8, this adaptive mechanism ensures that the model retains the benefits of channel recalibration without sacrificing critical spatial information, ultimately achieving a “win–win” in both BER and perceptual quality. Furthermore, while the MA-FFM alone without SE manages to reduce the BER to 0.1667, it results in a slight reduction in PSNR to 29.99 dB. This marginal loss in pixel-level accuracy is attributed to the amplification of high-frequency components, which, while beneficial for watermark carrying, increases the contrast between background noise and image details. Nevertheless, the stability of the SSIM index 0.9077 for the full model suggests that these artifacts are primarily confined to high-frequency bands that are less perceptible to the human visual system, making the trade-off highly acceptable within the framework of structured feature fusion [21].

Figure 9 shows Ablation study results. Relative improvements of the proposed model over different baselines. Compared with the original HiDDeN, standalone modules (SE Only, MA-FFM Only), and fixed gating, our full model achieves consistent performance gains, especially a 18.5% reduction in BER when compared to SE Only.

Novelty beyond module combination. While MA-FFM and SE are existing modules, the novelty of this paper lies in the synergistic placement of MA-FFM in the encoder and SE in the decoder which outperforms individual placements, the interpretability provided by the dynamic gating evolution from 0.51 to 0.89, and the systematic analysis of capacity scaling from 30 to 100 bits and robustness to differentiable noise layers including color jitter, JPEG and Gaussian blur. These contributions go beyond a mere combination and offer new insights into deep watermarking.

Comparison with recent state-of-the-art methods. Direct experimental benchmarking against MBRS [7], TSDL [9], PIMoG [8] and DARI-Mark [22] under unified configurations is infeasible. Specifically, MBRS fixes 256 × 256 input resolution and square-number payload length; TSDL has no released source code; and PIMoG is specially optimized for screen-capture distortions inconsistent with our test set. For fair controlled experiments, HiDDeN [5] is adopted as the core baseline. We supplement cross-method analysis using performance values reported in the original literature.

Advantages over state-of-the-art methods. Compared to recent advanced watermarking methods such as MBRS [7], PIMoG [8], and CIN [2], our approach offers several distinct advantages. First, it is lightweight: with only 15.6 M parameters and 3.12 G FLOPs, it adds only 23% computational overhead over the HiDDeN [5] baseline while achieving 9.3% BER reduction. In contrast, CIN has roughly three times more parameters and significantly slower training. Second, our model is interpretable: the dynamic gating evolution (0.51 → 0.89) reveals for the first time how a watermarking model gradually shifts from global context to local multi-scale features. To the best of our knowledge, no other deep watermarking work provides such transparency. Third, we systematically evaluate capacity scalability (30/64/100 bits) and robustness to multiple differentiable noise layers (color jitter, JPEG, Gaussian blur), which are rarely explored together in previous works. While some SOTA methods may achieve higher PSNR or lower BER on specific metrics, our method focuses on a balanced trade-off among imperceptibility, robustness, interpretability, and computational efficiency.

Despite these advancements, certain limitations remain to be addressed in future research. Although the model exhibits robustness against standard distortions such as Gaussian noise and brightness adjustments [5], it encounters bottlenecks under extreme adversarial conditions, such as low-quality JPEG compression (e.g., Q = 50, where the BER increases to ≈0.48) or significant geometric cropping [33,34] (e.g., 50% random crop, where the BER reaches ≈0.35). Qualitative examples of these failure cases are shown in Figure 10. These failure cases indicate that the watermark is not fully recoverable under such extreme distortions. To overcome these limitations, we plan to incorporate differentiable JPEG simulation (e.g., Diff-JPEG) and stronger geometric augmentations (e.g., random crop, rotation, scaling) into the training pipeline in future work. Several recent works have attempted to address geometric distortions [23], yet this remains a challenging open problem. Due to the limited memory of a single GPU (8 GB) and time constraints, higher-resolution experiments (e.g., 256 × 256) are not included in this work and will be addressed in the future. These challenges often stem from the irreversible loss of high-frequency details, a common issue even in state-of-the-art watermarking frameworks. To further enhance the system’s reliability, future work will focus on the integration of differentiable noise layers to improve resistance against unseen distortions [28]. Additionally, we intend to establish more rigorous statistical benchmarks through large-scale, repeated experiments to ensure the reproducibility and generalizability of the proposed collaborative mechanism in diverse, non-stationary environments [35].

6. Conclusions

In this paper, we addressed the inherent limitations of the HiDDeN framework in handling complex feature representations and cross-scale fusion by proposing an adaptive multi-scale blind watermarking algorithm grounded in dual-attention synergy. By integrating a dynamic multi-scale feature fusion module (MA-FFM) with a channel attention mechanism at the decoder, our architecture moves beyond the traditional single-scale extraction paradigm. This allows the network to adaptively balance local fine-grained textures with global contextual semantics during the watermark recovery process.

The experimental results on the COCO dataset 1 demonstrate that the proposed model achieves a significant performance breakthrough. Compared to the baseline, the bit error rate (BER) is reduced by 9.3% relative, while visual quality metrics such as PSNR 31.59 dB and SSIM 0.9077 exhibit clear competitive advantages. Beyond quantitative gains, a key contribution of this work is the enhanced interpretability of the deep watermarking process. The observation that dynamic gating weights evolve from 0.51 to approximately 0.90 provides empirical evidence that local multi-scale features are decisive for information restoration in high-fidelity scenarios. This evolution transforms the traditionally “black-box” embedding logic into a more transparent mechanism with discernible physical significance.

Ablation studies further confirm that the soft coupling of the MA-FFM and the SE module is essential for achieving optimal performance, as it effectively mitigates the feature interference observed in static fusion schemes. While the model maintains robust performance under standard distortions, including various types of noise and JPEG compression, performance bottlenecks persist under extreme conditions such as ultra-low quality compression or severe geometric cropping.

Compared to recent advanced watermarking methods such as MBRS and PIMoG, our work focuses on a different direction within the HiDDeN framework, namely interpretability (dynamic gating evolution 0.51 → 0.89), capacity scalability (30/64/100 bits), and robustness to multiple noise types (color jitter, JPEG, Gaussian blur). Due to differences in technical focus and code compatibility, we limit our main experimental comparisons to the HiDDeN baseline, which is the most direct reference for our architectural improvements.

Future research will focus on systematically investigating the optimal placement of the SE module (encoder vs. decoder) and its interaction with MA-FFM, as well as incorporating differentiable noise layers for end-to-end joint training to enhance resilience against non-stationary real-world distortions. Furthermore, we aim to extend this adaptive synergy mechanism to emerging frontiers, such as short-video copyright protection and the governance of AI-generated content (AIGC). Other emerging threats, such as screen-shooting attacks, have recently been addressed using cross-attention mechanisms [36], which is also a direction worth exploring. Through large-scale statistical validation, we will continue to refine the algorithm’s robustness boundaries for large-scale industrial applications.

Author Contributions

Conceptualization, Z.Y., H.S. and N.L.; methodology, Z.Y. and H.S.; software, Z.Y.; validation, Z.Y. and N.L.; formal analysis, Z.Y.; investigation, Z.Y.; resources, H.S.; data curation, Z.Y.; writing—original draft preparation, Z.Y.; writing—review and editing, Z.Y.; visualization, N.L.; supervision, H.S.; project administration, Z.Y.; funding acquisition, H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds in Universities of Heilongjiang Province, grant number 2025-KYYWF-ZR0082.

Data Availability Statement

The original data presented in this study are openly available in the COCO 2017 dataset at http://cocodataset.org. No new data were created.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Wang, B.; Wu, Y. Staged Adaptive Blind Watermarking Scheme. Lect. Notes Comput. Sci. 2023, 13846, 357–372. [Google Scholar] [CrossRef]
Cox, I.J.; Kilian, J.; Leighton, F.T.; Shamoon, T. Secure spread spectrum watermarking for multimedia. IEEE Trans. Image Process. 1997, 6, 1673–1687. [Google Scholar] [CrossRef]
Kundur, D.; Hatzinakos, D. A Robust Digital Image Watermarking Method Using Wavelet-Based Fusion. In Proceedings of the 1997 International Conference on Image Processing (ICIP’97), Santa Barbara, CA, USA, 26–29 October 1997; Volume 1, pp. 544–547. [Google Scholar] [CrossRef]
Zhu, J.; Kaplan, R.; Johnson, J.; Li, F.-F. HiDDeN: Hiding data with deep networks. In Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 657–672. [Google Scholar] [CrossRef]
Liu, Y.; Guo, M.; Zhang, J.; Zhu, Y.; Xie, X. A novel two-stage separable deep learning framework for practical blind watermarking. In Proceedings of the 27th ACM International Conference on Multimedia (MM’19), Nice, France, 21–25 October 2019; pp. 1509–1517. [Google Scholar] [CrossRef]
Jia, Z.; Fang, H.; Zhang, W. MBRS: Enhancing robustness of DNN-based watermarking by mini-batch of real and simulated JPEG compression. In Proceedings of the 29th ACM International Conference on Multimedia (MM’22), Lisboa, Portugal, 10–14 October 2022; pp. 41–49. [Google Scholar] [CrossRef]
Fang, H.; Jia, Z.; Ma, Z.; Chang, E.-C.; Zhang, W. PIMoG: An effective screen-shooting noise-layer simulation for deep-learning-based watermarking network. In Proceedings of the 30th ACM International Conference on Multimedia (MM’22), Lisboa, Portugal, 10–14 October 2022; pp. 2267–2276. [Google Scholar] [CrossRef]
Ge, F.; Huang, Y.; Liu, J.; Zhang, G.; Zeng, Z.; Zhang, S.; Guan, H. MT-Mark: Rethinking image watermarking via mutual-teacher collaboration with adaptive feature modulation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’25), Philadelphia, PA, USA, 28 February–4 March 2025. [Google Scholar]
Shao, S.; Li, Y.; Yao, H.; He, Y.; Qin, Z.; Ren, K. Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution. arXiv 2024, arXiv:2405.04825. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, B.; Song, Z.; Wu, Y. Robust blind watermarking framework for hybrid networks combining CNN and transformer. In Proceedings of the 15th Asian Conference on Machine Learning (ACML 2023), İstanbul, Turkey, 11–14 November 2023; Yanıkoğlu, B., Buntine, W., Eds.; Proceedings of Machine Learning Research (PMLR): New York, NY, USA, 2024; Volume 222, pp. 1417–1432. [Google Scholar]
Singh, R.K.; Dube, A.P.; Singh, R. Least significant bit-based image watermarking mechanism: A review. Int. J. Soc. Ecol. Sustain. Dev. 2022, 13, 1–9. [Google Scholar] [CrossRef]
Katz, I. PSYCH221: A Survey of Digital Watermarking Techniques. Available online: http://acorn.stanford.edu/psych221/projects/2006/itai/LSB.html (accessed on 26 April 2026).
Potdar, V.M.; Han, S.; Chang, E. A survey of digital image watermarking techniques. In Proceedings of the 2005 3rd IEEE International Conference on Industrial Informatics (INDIN), Perth, WA, Australia, 10–12 August 2005; pp. 709–716. [Google Scholar] [CrossRef]
Liu, J.; Wu, S.; Xu, X. A logarithmic quantization-based image watermarking using information entropy in the wavelet domain. Entropy 2018, 20, 945. [Google Scholar] [CrossRef]
Ernawan, F.; Ariatmanto, D. A recent survey on image watermarking using scaling factor techniques for copyright protection. Multimed. Tools Appl. 2023, 82, 27123–27163. [Google Scholar] [CrossRef]
Zhong, X.; Das, A.; Alrasheedi, F.; Tanvir, A. A Brief, In-Depth Survey of Deep Learning-Based Image Watermarking. Appl. Sci. 2023, 13, 11852. [Google Scholar] [CrossRef]
Fkirin, A.; Attiya, G.; El-Sayed, A.; Shouman, M.A. Copyright protection of deep neural network models using digital watermarking: A comparative study. Multimed. Tools Appl. 2022, 81, 15961–15975. [Google Scholar] [CrossRef]
Gharib, H.A.; Abdelnapi, N.M.M.; Hosny, K.M. Robust zero-watermarking for color images using hybrid deep learning models and encryption. Sci. Rep. 2025, 15, 28906. [Google Scholar] [CrossRef]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2021), Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Wang, Y. DARI-Mark: Deep Learning and Attention Network for Robust Image Watermarking. Mathematics 2023, 11, 209. [Google Scholar] [CrossRef]
Mareen, H.; Antchougov, L.; Van Wallendael, G.; Lambert, P. Blind Deep-Learning-Based Image Watermarking Robust Against Geometric Transformations. arXiv 2024, arXiv:2402.09062. [Google Scholar]
Zhang, T.; Tan, S.; Shen, X.W.; Tang, J. Image Watermarking Method Combining Attention Mechanism and Multi-Scale Feature. J. Comput. Appl. 2025, 45, 616–623. [Google Scholar]
Younus, A.H.; Mohammed, A.O. Robust Color Image Watermarking Based on DWT and CNN. Sci. J. Univ. Zakho 2026, 14, 198–208. [Google Scholar] [CrossRef]
Dasgupta, A.; Zhong, X. Robust Image Watermarking Based on Cross-Attention and Invariant Domain Learning. arXiv 2023, arXiv:2310.05395. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Montreal, QC, Canada, 2014. [Google Scholar]
Li, G.; Tan, L.; Xue, Y.; Liu, G.; Qian, Z.; Li, S.; Zhang, X. Adversarial Shallow Watermarking. arXiv 2025, arXiv:2504.19529. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/abs/1412.6980 (accessed on 26 April 2026).
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Narwaria, M.; Mishra, A.; Jain, A.; Agarwal, C. An experimental study into objective quality assessment of watermarked images. Int. J. Image Process. 2010, 5, 199–219. [Google Scholar]
Guzik, P.; Matiolanski, A.; Dziech, A. Real data performance evaluation of CAISS watermarking scheme. Multimed. Tools Appl. 2015, 72, 2975–2992. [Google Scholar] [CrossRef]
Zhang, H.; Kone, M.M.K.; Ma, X.; Zhou, N. Frequency-domain attention-guided adaptive robust watermarking model. J. Frankl. Inst. 2025, 362, 107511. [Google Scholar] [CrossRef]
Wu, S.; Lu, W.; Yin, X.; Yang, R. Robust watermarking against arbitrary scaling and cropping attacks. Signal Process. 2025, 226, 109655. [Google Scholar] [CrossRef]
Spratling, M.W. A comprehensive assessment benchmark for rigorously evaluating deep learning image classifiers. arXiv 2023, arXiv:2308.04137. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Xu, P.; Xue, Q. Screen shooting resistant watermarking based on cross attention. Sci. Rep. 2025, 15, 17016. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed framework (encoder, decoder, noise layer, discriminator).

Figure 2. Structure of the MA-FFM module (local and global branches fused by Sigmoid).

Figure 3. The squeeze-and-excitation (SE) module.

Figure 4. Watermarked images from different model configurations (original vs. HiDDeN vs. proposed).

Figure 5. Amplified residual maps corresponding to Figure 4 (×5, texture-concentrated).

Figure 6. Residuals under attacks (×5, JPEG, blur, resize, crop, color jitter).

Figure 7. BER convergence curves on the validation set (20 epochs, validation BER).

Figure 8. Evolution of local and global gate weights during training (local weight 0.51 → 0.89).

Figure 9. Relative improvements of the proposed model compared to each ablation configuration (vs. w/o MA-FFM, w/o SE, fixed gating).

Figure 10. Qualitative failure cases under extreme distortions.

Table 1. Computational complexity comparison (per 128 × 128 image).

Model	Params (M)	FLOPs (G)	Inference Time (ms)
HiDDeN (baseline)	12.22	2.53	15.21
Proposed	15.63	3.12	18.32

The proposed model increases parameters by 28%, FLOPs by 23%, and inference delay by 20%. This is a reasonable trade-off for the observed performance gains, specifically a 9.3% BER reduction and a 0.61 dB PSNR improvement.

Table 2. Quantitative results under no attack (20 epochs, no attack).

Model	Validation Set BER (Epoch 20)	PSNR (dB)	SSIM
Original HiDDeN (w/o Attention)	0.169573	30.98	0.905849
w/o MA-FFM (SE Only)	0.188671	31.26	0.8939
w/o SE (MA-FFM Only)	0.166653	29.99	0.906296
Fixed Gating (MA-FFM + SE, Fixed Weights)	0.170601	30.97	0.905812
Proposed Model (Dynamic Gating + MA-FFM + SE)	0.153798	31.59	0.907736

Table 3. Performance comparison under different attacks and training conditions.

Attack	Original HiDDeN	Proposed Model
JPEG Q = 85	0.3232	0.2946
JPEG Q = 75	0.3809	0.3315
Gaussian σ = 0.01	0.1744	0.1624
Salt & pepper (d = 0.002)	0.1693	0.1623
Brightness + 0.02	0.1692	0.1542
Contrast × 0.98	0.1695	0.1540
Color jitter	0.2106	0.1857
Gaussian blur	0.2047	0.1833
Random crop	0.232	0.1927
Resize	0.3647	0.3113

Table 4. Capacity scaling under no-attack training (30/64/100 bits, 20 epochs).

Message Length	HiDDeN (Baseline)	Proposed
30 bits	0.1696	0.1538
64 bits	0.219	0.1993
100 bits	0.2608	0.2218

Table 5. Performance comparison on different datasets (zero-shot on DIV2K and Mini-ImageNet).

Dataset	BER	PSNR (dB)	SSIM
COCO 2017 (validation)	0.154	31.6	0.908
DIV2K	0.155	31.3	0.915
Mini-ImageNet	0.157	31.2	0.912

The model achieves highly consistent performance across three distinct datasets, with BER varying only in the third decimal place. This demonstrates that our watermarking scheme learns transferable patterns and does not overfit to COCO-specific features.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Z.; Sun, H.; Lv, N. Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion. Electronics 2026, 15, 2580. https://doi.org/10.3390/electronics15122580

AMA Style

Yang Z, Sun H, Lv N. Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion. Electronics. 2026; 15(12):2580. https://doi.org/10.3390/electronics15122580

Chicago/Turabian Style

Yang, Zhenghan, Huadong Sun, and Nuohan Lv. 2026. "Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion" Electronics 15, no. 12: 2580. https://doi.org/10.3390/electronics15122580

APA Style

Yang, Z., Sun, H., & Lv, N. (2026). Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion. Electronics, 15(12), 2580. https://doi.org/10.3390/electronics15122580

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Watermarking Algorithm Leveraging Dual-Attention Synergy and Adaptive Multi-Scale Fusion

Abstract

1. Introduction

2. Related Work

2.1. Traditional Watermarking

2.2. Deep Learning-Based Watermarking

2.3. Attention Mechanisms in Watermarking

3. Materials and Methods

3.1. Overall Framework Design

3.2. Encoder Improvement: MA-FFM Module

MA-FFM Operation Details

3.3. Decoder Improvement: SE Attention Module

3.4. Loss Function and Training Strategy

4. Results

4.1. Experimental Setup and Evaluation Metrics

4.2. Training Parameters

4.3. Comparison Method

4.4. Capacity Scaling

4.5. Generalization on Additional Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI