1. Introduction
Digital images form the foundation of today’s internet information. Their copyright protection and traceability have become central issues in multimedia security. Since digital watermarking emerged in the 1990s, it has gone through three major stages: from spatial domain methods to transform domain methods, and now to deep neural network-based approaches [
1,
2].
Early spatial domain methods (e.g., LSB) and transform domain methods (e.g., DCT-based spread-spectrum [
3] and DWT-based [
4] approaches) improved robustness but relied heavily on manually designed rules, struggling to adapt to diverse image content under combined attacks. For a detailed review, readers are referred to
Section 2.
The rise in deep learning brought a major shift. The HiDDeN framework [
5] pioneered end-to-end watermarking by jointly training an encoder, noise layer, decoder, and discriminator, achieving a collaborative optimization of invisibility and robustness. Recent improvements include TSDL [
6], MBRS [
7], and PIMoG [
8], which have enhanced feature representation and robustness against specific distortions.
Nevertheless, for high-precision applications, these models still suffer from three core limitations: single-scale convolution captures only local textures without global guidance; existing feature fusion relies on fixed manual weights that cannot adapt to image content; and training convergence is slow while interpretability is lacking [
9,
10]. Moreover, most existing deep watermarking models operate as black boxes, providing no insight into why a watermark is embedded in a particular region or how the model adapts to different image contents. This lack of interpretability limits their trustworthiness and practical deployment.
To address these issues, this paper proposes a dual-attention adaptive multi-scale blind watermarking framework built on HiDDeN. Specifically, we design an adaptive multi-scale feature fusion module (MAFFM) with a dynamic gated weight learning mechanism that adaptively balances local and global features, evolving from ≈0.51 to ≈0.89 and enhancing interpretability. At the decoder side, we embed multi-level SE channel attention modules [
11] to strengthen watermark signal filtering. Inspired by [
12], our method integrates MA-FFM and SE into a collaborative dual-attention architecture.
The main contributions of this work are threefold:
- (1)
We propose a dual-attention synergistic architecture that places MA-FFM in the encoder for adaptive multi-scale fusion and SE in the decoder for channel-wise watermark enhancement, which balances robustness and imperceptibility better than single-module variants.
- (2)
We introduce a dynamic gating mechanism whose weight evolution (0.51 → 0.89) is recorded and interpreted, providing interpretability of the deep watermarking process.
- (3)
We systematically evaluate capacity scalability of 30, 64 and 100 bits and robustness against differentiable noise layers including color jitter, JPEG and Gaussian blur, which are not explored in previous MA-FFM/SE-based works.
Experiments on COCO 2017 [
1] demonstrate that our algorithm reduces BER by 9.3% under no attack from 0.1696 to 0.1538, improves PSNR by 0.61 dB, and achieves lower BER under various attacks. The method achieves synergistic gains in performance, efficiency, and robustness.
The remainder of this paper is organized as follows.
Section 2 reviews related work on deep watermarking and attention mechanisms.
Section 3 describes the proposed dual-attention architecture, including the MA-FFM and SE modules, the dynamic gating mechanism, and the training strategy.
Section 4 presents experimental results, ablation studies, capacity scaling, and robustness evaluations.
Section 5 discusses the findings and limitations, and
Section 6 concludes the paper.
3. Materials and Methods
3.1. Overall Framework Design
The encoder takes the cover image and the secret message as input. It first extracts multi-scale features through several convolutional layers, then adaptively fuses them with the message using the proposed MA-FFM module. The rationale is to learn a content-aware embedding that prioritizes texture-rich regions, thereby balancing robustness and imperceptibility.
During training, we replace the noise layer with an identity mapping. This deliberate choice isolates the pure contribution of the attention modules; any robustness gain can be attributed solely to the attention mechanism, not to overfitting to specific distortions. Generalization is still verified on unseen attacks during testing.
The decoder receives the possibly distorted watermarked image and employs SE channel attention modules to selectively enhance watermark-related features. The design rationale is that channel attention can adaptively reweight feature maps, making the decoder focus on informative channels and thus improve extraction accuracy under noise.
A discriminator is introduced in an adversarial training setup. It distinguishes original cover images from watermarked ones, forcing the encoder to produce visually imperceptible outputs. This adversarial component helps preserve high perceptual quality without sacrificing watermark embedding strength.
As illustrated in
Figure 1, this paper builds upon the HiDDeN framework [
5]. In the original design, the encoder takes a cover image and a secret message as input and produces an encoded image that visually resembles the cover. The decoder then recovers the secret message from this encoded image. An optional noise layer placed between the encoder and the decoder simulates various real-world distortions to improve robustness.
Given the limitations of the HiDDeN model discussed above—such as its single-scale feature fusion and the limited accuracy of watermark embedding and extraction—this paper focuses on isolating and validating the contribution of two attention mechanisms: MA-FFM and SE. To achieve this, we deliberately replace the noise layer with an identity mapping during training, meaning no distortion is added to the encoded image. This design eliminates noise-induced robustness gains, ensuring that any performance improvement comes solely from the feature selection ability of the attention modules. As a result, we can more accurately assess the effectiveness of our proposed enhancements [
23].
It is important to note that this training strategy does not imply the model is weak against distortions. Although no simulated attacks are used during training, we still evaluate the model under various attacks during the testing stage, including JPEG compression, Gaussian noise, salt-and-pepper noise, and brightness/contrast adjustments. This allows us to test the model’s generalization robustness against previously unseen distortions. The fact that the model achieves lower BER than HiDDeN under these attacks—despite never seeing them during training—further confirms that the attention mechanisms themselves improve robustness, rather than the model simply overfitting to a specific noise layer.
In addition, we keep the adversarial training setup between the discriminator and the encoder. The discriminator learns to tell apart original cover images from watermarked ones, which in turn pushes the encoder to produce more realistic outputs. This adversarial mechanism is retained to preserve the visual quality of the watermarked images, thus offering a stable basis for evaluating the attention modules.
3.2. Encoder Improvement: MA-FFM Module
To address the key limitations mentioned above—namely, the encoder’s single-scale feature extraction, its inability to capture local features beyond a fixed scale, and the lack of global context guidance—we draw inspiration from the work of Dai et al. and design a Multi-scale Attention Fusion Module (MA-FFM) [
21]. Similar ideas of combining attention with multi-scale features have also been explored in image watermarking [
24]. This module is specifically intended to enhance the encoder’s ability to jointly model multi-scale features. Its structure is shown in
Figure 2.
MA-FFM Operation Details
The input cover feature map flows into the MA-FFFM module and is processed sequentially following four staged steps, split into preliminary preprocessing and dual-branch parallel feature extraction:
Two cascaded convolutions are first implemented for preliminary feature optimization. The first convolution removes redundant feature information and unifies the channel dimension of input features; the second successive convolution further adjusts channel distribution to realize channel unity, laying a consistent feature base for subsequent dual-branch computation. The preprocessed feature is then diverged into two independent parallel branches: the local multi-scale spatial branch and the global channel context branch.
Local Branch Multi-scale Spatial Feature Extraction
Two groups of parallel depth-wise separable convolution stacks are deployed for multi-scale local feature extraction: each group contains side-by-side and depth-wise separable convolutions. Feature maps from each convolution kernel are concatenated along the channel dimension respectively, and all concatenated multi-scale feature tensors are further fused, followed by a final convolution for channel calibration and compression to derive the local multi-scale feature .
Global Branch Global Channel Attention Extraction
Global average pooling GAP is firstly performed to compress the spatial dimension of preprocessed features into a channel-wise descriptor . Afterwards, dual successive convolutions combined with Sigmoid activation conduct feature transformation and optimization on the pooled channel vector, generating the global channel attention descriptor .
The obtained local feature and global feature are fused via broadcast element-wise addition () to unify spatial-channel dimensions. The fused composite feature is then fed into the Sigmoid activation function to normalize values into [0, 1], producing pixel-level adaptive attention weight mask In this spatial mask, texture-abundant regions are assigned higher weights while flat smooth regions get lower weights.
Element-wise multiplication (
) is executed between the original input feature
and the generated attention weight mask, and the enhanced output feature is formulated as
Here, represents the Sigmoid function, which is used to normalize the fusion weights to the range of [0, 1]; denotes broadcast addition, which achieves the dimensional alignment of local features and global features; and ⊗ represents element-wise multiplication, which completes the weighted combination of the original features and the fusion weights. To effectively address the problem of the simple and ineffective feature fusion method of the HiD-DeN encoder, we embed the MA-FFM module into the feature fusion stage of the encoder, replacing the original simple concatenation operation, thereby achieving more efficient and more targeted adaptive fusion of image features and secret message features, providing complete multi-scale feature support for the precise embedding of watermarks. The dynamic gate weight is a scalar per image; after Softmax, the local and global weights sum to 1. It is not a per-channel average nor a spatial map.
3.3. Decoder Improvement: SE Attention Module
Building upon the core weaknesses of the HiDDeN model, the decoder, as a crucial part of watermark extraction, directly determines the recovery accuracy of the secret message—the core task of the decoder is to accurately restore the embedded secret message from the watermarked image that has been subjected to noise attacks. However, the decoder of the original HiDDeN adopts a pure convolutional structure; although it can effectively extract local features, it has a fatal flaw: the feature responses of each channel are regarded as equally important, and it is unable to selectively enhance the key channels of the watermark signal, resulting in a large number of redundant channels interfering with the watermark extraction and reducing the recovery accuracy [
5].
To precisely address this issue and enhance the decoder’s sensitivity to watermark signals, we were inspired by the squeeze-and-excitation networks and innovatively introduced the SE attention module into the decoder. By explicitly modeling the interdependence between channels, it adaptively recalibrates the channel feature responses, achieving the goal of “strengthening key channels and suppressing redundant channels” [
11]. It is worth noting that the SE module has shown significant effects in the watermark field, providing a reliable reference for our improvement: MBRS extensively introduces SE blocks in the encoder to enhance the feature extraction ability [
7]; PIMoG uses the attention mechanism to enable the network to better understand the preprocessed features [
8]; and SE blocks have also been integrated with DWT and CNN for color image watermarking [
25]. Different from existing studies, this paper introduces the SE module into the decoder. Cross-attention mechanisms have also been introduced for robust watermark extraction [
26] focusing on exploring their enhancement effect on watermark extraction accuracy, and filling the research gap in channel attention optimization of the decoder.
Module structure: The SE module consists of two core operations—squeeze and excitation, as shown in
Figure 3. Its principle is directly based on the description in the original paper.
Squeeze operation: “To leverage the dependencies between channels, we first treat the signal of each channel as a set of local descriptors. Through global average pooling, an aggregated spatial information descriptor is generated.” Formally, given the input feature map
, the squeeze operation generates the channel descriptor
, and its
-th element is
Excitation operation: “To utilize the aggregated information from the compression operation, we perform a second operation aimed at fully capturing the dependencies between channels. This operation must meet two criteria: First, it must be flexible. Second, it must learn the nonlinear interactions between channels. We chose a simple gated mechanism with Sigmoid activation” [
11]. This gated mechanism is expressed as
Here, represents the ReLU activation function, represents the Sigmoid function, and are the learnable weight matrices of two fully connected layers, and is the dimensionality reduction ratio. In our implementation, we set the reduction ratio r = 16.
Feature re-estimation: “The output of the activation operation is then used to rescale the input feature map
through a per-channel multiplication” [
11]. That is:
Here, represents the element-wise multiplication of channel dimensions.
Integration method: We seamlessly integrate the SE block after each convolutional layer of the decoder, forming a basic building block of “convolution → batch normalization → ReLU → SE block”. The decoder is composed of multiple such units alternately stacked, and finally outputs the predicted L-bit message bits through a fully connected layer.
Theoretical basis: As described in squeeze-and-excitation networks, “in the filter responses learned by the convolutional kernels, the activations of many channels are highly redundant [
11]. By explicitly modeling the interdependence between channels, the network’s representational ability can be significantly enhanced.” In the watermark decoding task, the response intensities of different channels to the watermark signal vary significantly. The SE module, through the aforementioned adaptive recalibration mechanism, enables the decoder to “focus” on the key feature channels carrying the watermark information, thereby improving the accuracy of message recovery under noise attacks.
3.4. Loss Function and Training Strategy
The optimization objective of this model is weighted by three components: decoding loss
, encoding loss
, and adversarial loss
. The total loss function is defined as follows:
Decoding loss (
): To ensure the high-fidelity restoration of the secret message from potentially attacked images, we employ binary cross-entropy loss. In HiDDeN, “message recovery is regarded as a binary classification problem, and we use cross-entropy loss to maximize the recovery accuracy of the message bits” [
5]. This loss is defined as
Here, represents the probability that the decoder predicts the i-th bit of the message to be 1.
Encoding loss (
): To minimize the visual difference between the original image and the image with watermark, we use the mean squared error (MSE). As stated in HiDDeN, “We use the MSE loss between the encoder output and the cover image to penalize pixel-level differences” [
5]. This loss is defined as
Adversarial loss (
): To further enhance the perceptual authenticity of the watermarked image, we introduced a discriminator network and adopted the standard GAN loss for adversarial training. Goodfellow et al. defined this training objective as [
27]. The generator
and the discriminator
engage in a minimax game on the value function
In our framework, the encoder
acts as the generator, receiving the cover image
and the message
as inputs; the discriminator
attempts to distinguish the original image from the image generated by the encoder. Therefore, the adversarial loss is specifically defined as
The encoder is trained to minimize , thereby enabling the generated watermarked image to be more similar in distribution to the original image.
Loss weights: Through experimental debugging, we set the weights of the three loss terms as
,
, and
. This allocation prioritizes decoding accuracy while also taking into account image quality. The adversarial loss weight is relatively low, aiming to prevent it from dominating the optimization process in the early training stage. This is consistent with the observation in MBRS: “Adversarial training is used to assist in optimizing image quality, rather than dominating the training process” [
28].
Optimizer and training strategy: We use the Adam optimizer for parameter updates. Kingma and Ba described Adam in their paper as a stochastic optimization method that requires only one gradient, has low memory requirements, and is highly suitable for large-scale data and parameter problems [
29]. The learning rates for the encoder and decoder are set to
, and the learning rate for the discriminator is set to
. The batch size is set to 8, and the total number of training rounds is 20. Loss convergence was observed at the 10th round. To stabilize adversarial training, we use gradient clipping with a maximum norm of 0.5. As Hu et al. stated, “Gradient clipping can prevent the discriminator gradient from being too large, which leads to training crashes” [
11]. All experiments were completed on a single NVIDIA GeForce RTX 4060 Laptop GPU, NVIDIA Corporation, Santa Clara, CA, USA.
4. Results
This section presents systematic experiments to verify the effectiveness of the proposed dual-attention collaborative structure and dynamic feature fusion mechanism. All experiments follow the principle of controlled variable comparison. Under identical training settings, we compare the performance of the baseline model with that of our improved model. We evaluate from four perspectives: quantitative metrics, ablation analysis, convergence efficiency, and interpretability. The results show that our method effectively addresses the core issues of the original HiDDeN watermarking approach, namely low accuracy, weak feature representation, and slow convergence.
4.1. Experimental Setup and Evaluation Metrics
We use the COCO 2017 dataset [
1]. Following common practice in deep learning-based watermarking research, we train for 20 epochs using 10,000 images, validate on 5000 images, and test on 5000 images. All images are resized to 128 × 128 pixels. The secret message is a randomly generated 30-bit binary sequence.
Watermark decoding accuracy is measured by the bit error rate (BER), which is the proportion of incorrect bits out of the total bits.
where
is the message length,
and
are the original and extracted message bits, respectively, and
is the indicator function. A lower BEoR indicates stronger robustness. The visual quality of the watermarked image is evaluated using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). PSNR is defined as
with MAX = 2 (the image values are normalized to [−1, 1]) and MSE the mean squared error between the original and watermarked images. SSIM is computed with a window size of 11 according to the standard definition [
30]. Higher PSNR and SSIM values mean better imperceptibility [
31].
We also report the Learned Perceptual Image Patch Similarity (LPIPS) using an AlexNet backbone. The average LPIPS of our proposed model is 0.0267, while that of the baseline HiDDeN is 0.0422; lower is better, indicating that our method introduces less perceptual distortion.
We select the original HiDDeN model as the main baseline. In addition, we conduct ablation experiments to separately verify the effectiveness of the MA-FFM module and the SE attention module. All compared models are trained and tested under the same dataset, image size, and message length to ensure a fair comparison [
32].
4.2. Training Parameters
The model is implemented using the PyTorch 2.0.1 framework and trained on a single NVIDIA GeForce RTX 4060 Laptop GPU. We use the Adam optimizer [
29]. The learning rate for the encoder–decoder is set to 2 × 10
−4, and the learning rate for the discriminator is 1 × 10
−4. Due to GPU memory limits, the batch size is set to eight. The total number of training epochs is 20. To ensure stable training, we apply gradient clipping with a maximum norm of 0.5.
The total loss is a weighted sum of three components:
Decoder loss with weight 2.5;
Encoder loss with weight 0.7;
Adversarial loss with weight 0.01.
We also evaluated the computational cost.
Table 1 compares the parameter count, FLOPs, and inference time for a single 128 × 128 image of the baseline HiDDeN and our proposed model.
4.3. Comparison Method
We select the original HiDDeN model as the main baseline. In addition, we conduct ablation experiments to separately verify the effectiveness of the MA-FFM module and the SE attention module. All compared models are trained and tested under the same dataset, image size, and message length to ensure a fair comparison.
Table 2 quantifies the experimental results under each configuration. The data indicate that the proposed dynamic gating model in this paper achieves the optimal comprehensive performance. Specifically, compared to the original HiDDeN model, this method significantly reduces the bit error rate (BER) from 0.1696 to 0.1538 with a relative decrease of approximately 9.3%., while the peak signal-to-noise ratio (PSNR) increases by 0.61 dB. It is noteworthy that simply introducing the SE module mainly improves PSNR, but has limited effect on the improvement of BER; meanwhile, introducing MA-FFM alone reduces BER, but leads to a decrease in PSNR. This proves the crucial role of the dynamic gating mechanism in balancing the two aspects.
Table 3 compares the BER of each model under different attacks and training conditions. Under attacks such as JPEG compression, Gaussian noise, salt-and-pepper noise, and brightness/contrast adjustment, the BER of the proposed model is lower than that of the original HiDDeN model, indicating that the dual-attention collaborative mechanism effectively enhances robustness. Additionally, our model obtains lower BER than the baseline across all listed distortion types including color jitter, Gaussian blur, JPEG compression and geometric transformations, which verifies the robustness of the dual-attention module against realistic differentiable noise layers.
Figure 4 shows the watermarked images generated under different configurations and their corresponding residual plots (Residual). From the visual analysis, all the generated images by the methods maintain a high degree of consistency with the original image (cover) when observed by the naked eye. However, by observing the distribution of the enlarged residuals, it can be found that the perturbations produced by the dynamic gating model are more uniform and subtle, effectively avoiding the smooth areas and hiding the watermark energy in the texture details. This explains the reason for the SSIM value in
Table 1 reaching 0.9077; that is, this method better maintains the structural integrity of the image.
Figure 5 shows the amplified residual maps (×5) corresponding to
Figure 4. The residuals of the baseline HiDDeN exhibit noticeable artifacts even in smooth regions, whereas the proposed dynamic gating model produces more uniform and subtle perturbations that are largely confined to textured areas. This further demonstrates the superior imperceptibility of our method.
Figure 6 visualizes the residual maps (scaled by ×5) under six common distortions: JPEG compression (Q = 85 and Q = 75), Gaussian blur, resize, random crop (90% area then resize), and color jitter. Even under these strong attacks, residuals remain highly localized to high-texture regions, while smooth areas show no noticeable artifacts. This observation further confirms the imperceptibility and robustness of our proposed method.
Figure 6 shows the BER evolution curves of each model on the validation set. The original HiDDeN model tended to stabilize around the 10th round, while the dynamic gating model maintained the lowest BER trajectory throughout the entire training period, with smaller fluctuations and demonstrating stronger learning efficiency and robustness.
Figure 7 illustrates the interpretability of the dynamic gating mechanism. The experiment observed that the local path weight (local gate) gradually increased from the initial value of 0.5 and eventually stabilized at approximately 0.89, while the global path weight decreased accordingly. This phenomenon reflects an important scientific discovery: in an ideal channel without attacks, the precise recovery of the watermark is highly dependent on pixel-level local multi-scale features. The model, through adaptive learning, focuses its attention on capturing fine textures, thereby significantly reducing the bit error rate without compromising the visual quality.
4.4. Capacity Scaling
To evaluate capacity scalability, we trained the baseline and our model with 64-bit and 100-bit messages under the same no-attack setting.
Table 4 reports the results.
4.5. Generalization on Additional Datasets
To further evaluate cross-dataset generalization, we directly applied our COCO-trained model to the DIV2K dataset and the Mini-ImageNet dataset in a zero-shot manner. All images were resized to 128 × 128. The results are summarized in
Table 5.
5. Discussion
The quantitative analysis presented in
Table 1 confirms that the proposed model, which synergistically integrates the SE module, MA-FFM, and dynamic gating, significantly outperforms the original HiDDeN baseline [
5]. By achieving a lower bit error rate (BER) of 0.1538 and a higher PSNR of 31.59 dB, the model demonstrates a superior ability to balance watermark extraction accuracy with visual fidelity. These results indicate that the inclusion of the SE module at the decoder side allows for explicit modeling of inter-channel relationships, effectively isolating key feature channels while suppressing irrelevant noise [
11]. However, the ablation study reveals that the SE module’s efficacy is highly dependent on its interaction with local features. Specifically, when the MA-FFM is removed (SE Only), the BER escalates to 0.1887, the highest among all tested configurations. This performance degradation supports the theoretical concern that global average pooling (GAP) may inadvertently erase spatial details essential for watermark recovery [
11], emphasizing that the SE module must be complemented by local texture descriptors to be effective.
The superiority of the dynamic gating mechanism is further highlighted when compared to the fixed-weight fusion approach. Experimental data shows that the fixed-gating configuration (0.5:0.5 ratio) yields a BER of 0.1706, which is paradoxically higher than the original HiDDeN baseline of 0.1696. This phenomenon suggests a state of “adversarial cancellation,” where the global optimization goals of the SE module and the local enhancement objectives of the MA-FFM conflict, leading to a performance bottleneck [
21]. The fixed equal weighting forces the model to treat local and global features with equal importance throughout training, which prevents the network from adapting to the natural dominance of local features in the later stage. In contrast, our dynamic gating learns the fusion weights from data (evolving from 0.51 to 0.89), thereby avoiding this conflict and achieving better performance. Our proposed dynamic gating avoids this pitfall by enabling an adaptive evolution of feature representation. Instead of fluctuating randomly, the gating weights follow a structured path that increasingly prioritizes local multi-scale features in attack-free scenarios, recognizing their decisive role in decoding precision.
As illustrated in
Figure 8, this adaptive mechanism ensures that the model retains the benefits of channel recalibration without sacrificing critical spatial information, ultimately achieving a “win–win” in both BER and perceptual quality. Furthermore, while the MA-FFM alone without SE manages to reduce the BER to 0.1667, it results in a slight reduction in PSNR to 29.99 dB. This marginal loss in pixel-level accuracy is attributed to the amplification of high-frequency components, which, while beneficial for watermark carrying, increases the contrast between background noise and image details. Nevertheless, the stability of the SSIM index 0.9077 for the full model suggests that these artifacts are primarily confined to high-frequency bands that are less perceptible to the human visual system, making the trade-off highly acceptable within the framework of structured feature fusion [
21].
Figure 9 shows Ablation study results. Relative improvements of the proposed model over different baselines. Compared with the original HiDDeN, standalone modules (SE Only, MA-FFM Only), and fixed gating, our full model achieves consistent performance gains, especially a 18.5% reduction in BER when compared to SE Only.
Novelty beyond module combination. While MA-FFM and SE are existing modules, the novelty of this paper lies in the synergistic placement of MA-FFM in the encoder and SE in the decoder which outperforms individual placements, the interpretability provided by the dynamic gating evolution from 0.51 to 0.89, and the systematic analysis of capacity scaling from 30 to 100 bits and robustness to differentiable noise layers including color jitter, JPEG and Gaussian blur. These contributions go beyond a mere combination and offer new insights into deep watermarking.
Comparison with recent state-of-the-art methods. Direct experimental benchmarking against MBRS [
7], TSDL [
9], PIMoG [
8] and DARI-Mark [
22] under unified configurations is infeasible. Specifically, MBRS fixes 256 × 256 input resolution and square-number payload length; TSDL has no released source code; and PIMoG is specially optimized for screen-capture distortions inconsistent with our test set. For fair controlled experiments, HiDDeN [
5] is adopted as the core baseline. We supplement cross-method analysis using performance values reported in the original literature.
Advantages over state-of-the-art methods. Compared to recent advanced watermarking methods such as MBRS [
7], PIMoG [
8], and CIN [
2], our approach offers several distinct advantages. First, it is lightweight: with only 15.6 M parameters and 3.12 G FLOPs, it adds only 23% computational overhead over the HiDDeN [
5] baseline while achieving 9.3% BER reduction. In contrast, CIN has roughly three times more parameters and significantly slower training. Second, our model is interpretable: the dynamic gating evolution (0.51 → 0.89) reveals for the first time how a watermarking model gradually shifts from global context to local multi-scale features. To the best of our knowledge, no other deep watermarking work provides such transparency. Third, we systematically evaluate capacity scalability (30/64/100 bits) and robustness to multiple differentiable noise layers (color jitter, JPEG, Gaussian blur), which are rarely explored together in previous works. While some SOTA methods may achieve higher PSNR or lower BER on specific metrics, our method focuses on a balanced trade-off among imperceptibility, robustness, interpretability, and computational efficiency.
Despite these advancements, certain limitations remain to be addressed in future research. Although the model exhibits robustness against standard distortions such as Gaussian noise and brightness adjustments [
5], it encounters bottlenecks under extreme adversarial conditions, such as low-quality JPEG compression (e.g., Q = 50, where the BER increases to ≈0.48) or significant geometric cropping [
33,
34] (e.g., 50% random crop, where the BER reaches ≈0.35). Qualitative examples of these failure cases are shown in
Figure 10. These failure cases indicate that the watermark is not fully recoverable under such extreme distortions. To overcome these limitations, we plan to incorporate differentiable JPEG simulation (e.g., Diff-JPEG) and stronger geometric augmentations (e.g., random crop, rotation, scaling) into the training pipeline in future work. Several recent works have attempted to address geometric distortions [
23], yet this remains a challenging open problem. Due to the limited memory of a single GPU (8 GB) and time constraints, higher-resolution experiments (e.g., 256 × 256) are not included in this work and will be addressed in the future. These challenges often stem from the irreversible loss of high-frequency details, a common issue even in state-of-the-art watermarking frameworks. To further enhance the system’s reliability, future work will focus on the integration of differentiable noise layers to improve resistance against unseen distortions [
28]. Additionally, we intend to establish more rigorous statistical benchmarks through large-scale, repeated experiments to ensure the reproducibility and generalizability of the proposed collaborative mechanism in diverse, non-stationary environments [
35].
6. Conclusions
In this paper, we addressed the inherent limitations of the HiDDeN framework in handling complex feature representations and cross-scale fusion by proposing an adaptive multi-scale blind watermarking algorithm grounded in dual-attention synergy. By integrating a dynamic multi-scale feature fusion module (MA-FFM) with a channel attention mechanism at the decoder, our architecture moves beyond the traditional single-scale extraction paradigm. This allows the network to adaptively balance local fine-grained textures with global contextual semantics during the watermark recovery process.
The experimental results on the COCO dataset 1 demonstrate that the proposed model achieves a significant performance breakthrough. Compared to the baseline, the bit error rate (BER) is reduced by 9.3% relative, while visual quality metrics such as PSNR 31.59 dB and SSIM 0.9077 exhibit clear competitive advantages. Beyond quantitative gains, a key contribution of this work is the enhanced interpretability of the deep watermarking process. The observation that dynamic gating weights evolve from 0.51 to approximately 0.90 provides empirical evidence that local multi-scale features are decisive for information restoration in high-fidelity scenarios. This evolution transforms the traditionally “black-box” embedding logic into a more transparent mechanism with discernible physical significance.
Ablation studies further confirm that the soft coupling of the MA-FFM and the SE module is essential for achieving optimal performance, as it effectively mitigates the feature interference observed in static fusion schemes. While the model maintains robust performance under standard distortions, including various types of noise and JPEG compression, performance bottlenecks persist under extreme conditions such as ultra-low quality compression or severe geometric cropping.
Compared to recent advanced watermarking methods such as MBRS and PIMoG, our work focuses on a different direction within the HiDDeN framework, namely interpretability (dynamic gating evolution 0.51 → 0.89), capacity scalability (30/64/100 bits), and robustness to multiple noise types (color jitter, JPEG, Gaussian blur). Due to differences in technical focus and code compatibility, we limit our main experimental comparisons to the HiDDeN baseline, which is the most direct reference for our architectural improvements.
Future research will focus on systematically investigating the optimal placement of the SE module (encoder vs. decoder) and its interaction with MA-FFM, as well as incorporating differentiable noise layers for end-to-end joint training to enhance resilience against non-stationary real-world distortions. Furthermore, we aim to extend this adaptive synergy mechanism to emerging frontiers, such as short-video copyright protection and the governance of AI-generated content (AIGC). Other emerging threats, such as screen-shooting attacks, have recently been addressed using cross-attention mechanisms [
36], which is also a direction worth exploring. Through large-scale statistical validation, we will continue to refine the algorithm’s robustness boundaries for large-scale industrial applications.