3.1. Model Architecture
The overall architecture of our model is shown in
Figure 1. It consists of four parts. (1) Encoder: The encoder contains parameters
, the inputs are original image
and original watermark
, and output is residual image
; thus, watermarked image
is obtained as follows:
(2) Noise layer: A joint noise layer is constructed by fitting a mathematical function to various noises. The input to the noise layer is watermarked image
and output is noise image
. (3) Decoder: The decoder, also with parameters
, takes noisy image
as input and outputs decoded watermark
. (4) Adversary: The adversary has parameters
to differentiate between the original image
and watermarked image
. The main function of adversary is to distinguish the difference between watermarked image
and original image
, so as to further improve the encoder’s ability to fuse features of the watermark and original image, and thus improve the quality of the watermarked image.
Encoder. Encoder architecture of our model is shown in the left figure of
Figure 2. The encoder E is a network trained with parameters
to embed the watermark into the original image. The watermark
is expanded redundantly into a form of
, matching the dimensions of original image. Then, this extended watermark is concatenated with cover image to generate a residual image
. This residual image is then added to the cover image to produce the watermarked image. The encoder consists of seven convolutional layers. To capture features from both the watermark and cover image at multiple levels, outputs from layers 2, 4, 5, and 7 are concatenated with the watermarked image, respectively, while outputs from convolutional layers 3 and 6 are spliced with the original encoded image. Utilizing these multi-level convolutional mappings allows for more efficient learning of the watermark pattern. An MSE loss function is applied during training of the encoder, aimed at minimizing the difference between the watermarked image and original image.
Decoder. The decoder D with parameters
is trained to extract the watermark from a noise image
. As shown in the right figure of
Figure 2, the main structure of the decoder is similar to the encoder, using seven convolutional layers and one fully connected layer. To ensure the final output has the same structure as the original watermark
, the stride sizes of Conv1 and Conv3 are set to the default value of 1, whereas the stride sizes of remaining layers are set to 2 for reducing feature tensor size. After Conv7, a flatten operation is performed on remaining neurons, and then a dense layer is used to output a one-dimensional tensor which has the same size as
. We also applied MSE loss function to train the decoder, to minimize the difference between recovered watermark
and original watermark
, aiming to make each bit same.
Adversary. In our scheme, the adversary consists of three layers of
convolutional layers followed by a pooling layer for classification. The adversary is utilized to predict whether an image contains watermark information. It is trained by minimizing the following loss function. Update parameters
to minimize
And update parameters
to minimize
Noise layer. The noise layer is a crucial component of the model for achieving robustness. Common differentiable noises, such as crop, Gaussian filtering (GF), Gaussian noise (GN), resize, and salt and pepper (S&P) noise can be directly modeled using existing mathematical functions. However, for non-differentiable noise types like JPEG compression, existing fitting functions are not particularly effective. To address this issue, we introduce a refined JPEG compression function module designed in Reich et al. [
26]. To the best of our knowledge, this is the first application of this scheme in the field of watermarking. This JPEG compression module performs differentiable processing on quantization, rounding down, truncation, and other operations, making it directly applicable for training deep learning models under highly fitting real JPEG compression. Specifically, for differentiable quantization, the JPEG compression module uses polynomial approximation proposed by Shin et al. [
27] to approximate quantization operation using the rounding function.
In the quantization step, a given JPEG quality Q is mapped to a scale factor s. The scaling factor s is calculated as follows:
Since scaling factor s and the corresponding quantization table need to be stored as integers, a differentiable scale floor function is necessary. The JPEG compression module utilizes polynomial rounding to approximate downward rounding function, offering advantages over alternative schemes.
For differentiable quantization table (QT) clipping, the JPEG compression module employs a differentiable clipping technique to constrain values within an appropriate range.
3.2. Model Training and Testing Strategy
The scheme employed in our scheme comprises three steps: two training stages and a testing stage for adaptive control of watermarking performance through strength factor adjustment. During training, the method learns to generate residual images that are robust to various high-intensity distortions. At test time, a strength-balanced watermarking optimization algorithm is applied to adaptively produce high-quality watermarked images under diverse noise conditions. The entire process of model training and testing strategy is shown in
Figure 3.
In the first training stage, the model is trained under various strong noise distortion conditions. To enhance stability and accelerate training speed, only the loss function of the decoder is utilized in this phase, as represented by the following equation.
At the same time, to improve robustness of the model under combined noise conditions, we choose to train a single model that can withstand a variety of noises, rather than training separate models for each type of noise. The noise layer
we used is shown as follows. When training the model, the watermarked image is randomly distorted by one of the noises in the noise pool
before decoding, which ensures that the model has good robustness to various types of noises. By covering such aggressive distortions early in training, the model learns generalized robustness across diverse attack scenarios.
As shown in corresponding variables in
Table 2, during the training process, the model randomly selects different types of noises with varying intensities each time. Consequently, the model learns generalized features under diverse noise conditions. The first stage of training is designed to enable the decoder to efficiently learn robust features for various types of noise.
In the second stage, building upon the robust decoder obtained from the first stage of training, the model is further trained using all the loss functions. This ensures that the model benefits from both initial robustness and optimization across all parameters.
Our focus shifts toward enhancing imperceptibility while preserving robustness. To achieve this balance, we fix distortion parameters to moderate levels—representative of realistic but less extreme perturbations. This design choice enables the model to prioritize perceptual quality without sacrificing resilience under practical conditions. And the training is continued using a fixed medium intensity noise as shown in
Table 2. This training strategy improves the quality of watermarked images while maintaining robustness of the watermark.
While both our method and TSDL [
20] adopt a staged training paradigm, their objectives and design philosophies are different. TSDL [
20] first prioritizes generating high-fidelity watermarked images and only subsequently incorporates robustness considerations. In contrast, our staged design is a targeted technical strategy to achieve strong resilience against composite distortions. The first stage directly optimizes for robustness by exposing the model to increasingly severe synthetic noise and differentiable approximations of common attacks, thereby learning distortion-invariant watermark representations early in training. The second stage focus shifts toward enhancing imperceptibility while preserving robustness.
In the testing phase, to further improve the quality of watermarked image, we designed a strength-balanced watermarking optimization algorithm to measure the quality of watermarked image and decoding accuracy; the algorithm process is shown in Algorithm 1. The core principle behind our algorithm is that since watermarked image
is derived from direct addition of original image
and residual image
, robustness and imperceptibility of the watermark are mainly influenced by residual image. Therefore, we use strength factor S to adjust the weight of residual image during the testing phase. The relationship can be represented as follows:
By adjusting S, we aim to identify an optimal value in different noise conditions that ensures watermark imperceptibility while maintaining an acceptable decoding loss rate. In our implementation, the strength-balancing algorithm adopts a sequential search strategy: it starts from small watermark strength and incrementally increases S by a fixed step until either the extracted watermark achieves the target bit error rate or the maximum number of attempts is reached. This design ensures deterministic runtime—for each image, the algorithm performs at most iterations, which in our experiments is set to 50. The strength factor
S is bounded within
and optimized independently for each test image. Starting from
, we iteratively enhance
S until the extracted watermark satisfies a target BER threshold, thus achieving the best possible visual quality while preserving robustness.
| Algorithm 1 Method of the Strength-Balanced Watermarking Optimization |
- Input:
Original images , watermarks , error_rate - Output:
BER, PSNR, SSIM, optimal strength factor S
- 1:
Initialize parameters: - 2:
- 3:
- 4:
- 5:
- 6:
- 7:
whiledo - 8:
Encode into with strength S → Generate and - 9:
Compute and between and - 10:
Compute between and - 11:
if then - 12:
- 13:
- 14:
- 15:
else - 16:
Break loop - 17:
end if - 18:
end while - 19:
Output , corresponding , ,
|