Review Reports - Deep Watermarking Based on Swin Transformer for Deep Model Protection

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The presented article focuses on practical deep model protection, which is very relevant in the field of intellectual property protection in AI.The article contains extensive experiments in which the authors used a database with 12200 images, used Swin Transformers for watermarking, compared their research with several methods (e.g. Zhang, HiNet, HiDDeN), tested resistance to attacks and achieved minimal impact on the visual quality of images after watermarking.Overall, the article has a good structure and content, but requires modifications that will strengthen its scientific value, improve clarity, accuracy and formality.

I recommend the following modifications:

Complete the comparison of CNN vs. Swin Transformer.
Justify why you chose Swin-Unet, Have you also verified the use of other Swin Transformer based models in the field of watermarking?
Complete the links to the latest articles on the use of Swin Transformers in other types of tasks, especially from the last 2-3 years (most of the cited articles are from 2017).
The article lacks statistical analysis of the results: for example, calculating the standard deviation or confidence interval.
Add a sample of images where watermark extraction was not perfect and add an interpretation.
There are minor grammatical and stylistic errors and unclear formulations throughout the text (e.g. p. 4, r. 135-136), overly complex sentences (e.g. p. 2, r. 42-43), which makes it difficult to understand the text.
The text uses many abbreviations that are not always explained the first time they are used.
Some information is repeated in the text, for example, explaining the purpose of the watermark.
The image descriptions are too long and unclear.
p. 17 tab. 3 – the difference between "Our†" and "Our" is not sufficiently explained
In the conclusion or in the discussion, add information about what real attacks can be expected in practice.

Comments on the Quality of English Language I do not feel competent to assess the language level.

Author Response

Comments 1: Complete the comparison of CNN vs. Swin Transformer.

Response 1: Thank you for your valuable comment. We have revised Section 3.2 to include a more comprehensive comparison between CNN and Swin Transformer. Specifically, we added Table 1 to clearly illustrate the architectural differences and highlight the advantages of adopting the Swin Transformer in our watermark embedding framework.

Comments 2: Justify why you chose Swin-Unet, Have you also verified the use of other Swin Transformer based models in the field of watermarking?

Response 2: Thank you for your valuable comment. We have revised the final part of the Introduction to enhance the logical flow: starting from the limitation of CNNs in capturing long-range dependencies (as observed in Zhang et al.'s method), we introduce the Swin Transformer for its ability to model global context more effectively. We also highlight its strong performance in tasks such as large-scale classification and medical image segmentation. Based on this, we adopt Swin-UNet as our embedding backbone, as it combines Swin Transformer’s global modeling with UNet’s input-output consistency, making it well-suited for image-to-image tasks. While we have not explored other Swin Transformer variants for watermarking, we have provided a rationale for our choice of Swin-UNet, which we believe is the most appropriate for the task at hand.

Comments 3: Complete the links to the latest articles on the use of Swin Transformers in other types of tasks, especially from the last 2-3 years (most of the cited articles are from 2017).

Response 3: Thank you for your valuable comment. We agree with your suggestion and have revised the final part of the Introduction to include recent studies from the past 2–3 years that demonstrate the application of Swin Transformers in various image processing tasks. These additions better reflect the current development and relevance of Swin Transformer-based architectures.

Comments 4: The article lacks statistical analysis of the results: for example, calculating the standard deviation or confidence interval.

Response 4: Thank you for your valuable comment. We agree with your point, as the fluctuation in PSNR is indeed significant and largely depends on the choice of images. To address this, we have revised the PSNR representation in Table 3, expressing it as "mean ± standard deviation." We believe this approach better captures the variability in our results and provides a clearer representation of the performance across different test images.

Comments 5: Add a sample of images where watermark extraction was not perfect and add an interpretation.

Response 5: Thank you for your valuable comment. We have added Figure 8 in Section 4.3 to present several cases where watermark extraction was less effective, along with a brief explanation discussing why the extractor performs worse in such scenarios. Additionally, new experimental results have been included in Section 4.4 (Figure 10), showcasing some situations where the performance is not as optimal.

Comments 6: There are minor grammatical and stylistic errors and unclear formulations throughout the text (e.g. p. 4, r. 135-136), overly complex sentences (e.g. p. 2, r. 42-43), which makes it difficult to understand the text.

Response 6: Thank you for your valuable comment. Following your suggestion, we revised several grammatically incorrect or stylistically unclear sentences to improve readability. For example:

"Such actions are highly unfair to the original model creator, but even if infringement is detected, legal avenues can be costly and time-consuming, often failing to provide an effective solution."

was revised to

"Such actions are highly unfair to the original model creator. Even when infringement is detected, pursuing legal action is often costly and time-consuming, and it rarely leads to an effective solution."

Comments 7: The text uses many abbreviations that are not always explained the first time they are used

Response 7: Thank you for your valuable comment. We agree with your point, and have thoroughly reviewed the entire paper. We have now provided the full forms of abbreviations the first time they appear, along with brief explanations to help readers better understand the text.

Comments 8: Some information is repeated in the text, for example, explaining the purpose of the watermark.

Response 8: Thank you for your valuable comment. We agree with your point, and some content was indeed repetitive. We have removed or simplified redundant sections, such as repeatedly listing the specific methods used in data augmentation. We also reduced the repeated discussion on the limitations of CNN in Zhang’s method, condensing it for clarity. Additionally, the introduction of the two optimization points at the end of Section 3.1 has been simplified, with key details retained in the relevant sections. Throughout the paper, we have also streamlined other sentences we identified as repetitive.

Comments 9: The image descriptions are too long and unclear.

Response 9: Thank you for your valuable comment. We agree with your suggestion and have simplified all figure captions to improve clarity and readability.

Comments 10: p. 17 tab. 3 – the difference between "Our†" and "Our" is not sufficiently explained

Response 10: Thank you for your valuable comment. We have revised the description in Table X and added clarifications in the main text to explain the differences between "Our†" and "Our."

Comments 11: In the conclusion or in the discussion, add information about what real attacks can be expected in practice.

Response 11: Thank you for your valuable comment. We have revised the conclusion section to include information about potential real-world attacks, such as cropping, scaling, compression, and AI-based manipulations. Additionally, we have outlined possible strategies, including reducing watermark size, repeated embedding, and multi-dimensional embedding, to further enhance robustness.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present an interesting approach for embedding invisible watermarks into images to enhance copyright protection. It replaces traditional convolution-based methods with the Swin-UNet framework, which captures global and local image details more effectively. The framework minimizes image quality loss during watermark embedding, achieves high extraction accuracy, and includes robust adversarial training methods to defend against attacks. This approach balances watermark invisibility and structural consistency without compromising visual fidelity. A few considerations would make the article stronger:

The “Literature Review” section appears redundant and could be more cohesively integrated into the “Introduction” or merged with the “Materials and Methods” section for better narrative flow and focus.
Please include an explanation of the AdaIM optimizer, especially how it adapts to the watermarking task compared with traditional optimizers like Adam or SGD.
While the reference to Zhang is acknowledged, it is essential to explicitly describe both the classifier and the normalized classifier used to detect the presence of watermarks. Include among other things how these classifiers assess the similarity between the extracted and original watermarks. This information is critical to fully evaluate the effectiveness of the watermark detection mechanism.
It would significantly improve the analysis to provide zoomed-in regions of the extracted watermark images in Figure 8. This will help in visually assessing any distortions or artifacts, particularly in comparison with the method proposed by Zhang et al.
It would be valuable to demonstrate the system’s performance using QR code-like structured images as watermarks. These provide a more rigorous test of the model’s capacity to preserve high-frequency features under transformations and would offer practical insight into real-world applicability beyond binary sequence embedding.
It would be also beneficial to have a detailed comparison of the training and inference speed between the Swin-UNet with data augmentation and Zhang's method.

Author Response

Comments 1: The “Literature Review” section appears redundant and could be more cohesively integrated into the “Introduction” or merged with the “Materials and Methods” section for better narrative flow and focus.

Response 1: Thank you for your valuable comment. We agree that integrating the literature review could improve narrative flow. However, since our method is an extension of Zhang’s work, we chose to present Zhang’s method in a separate section. This approach, although involving some repetition, was intended to help readers more clearly distinguish between the two methods and better understand the improvements made in our approach.

Comments 2: Please include an explanation of the AdaIM optimizer, especially how it adapts to the watermarking task compared with traditional optimizers like Adam or SGD.

Response 2: Thank you for your valuable comment, and we sincerely apologize for the confusion. We would like to clarify that the optimizer used in our study is Adam, not AdaIM. The mention of AdaIM was a writing mistake and has been corrected in the revised manuscript. Training was conducted using the Adam optimizer, with a dynamic learning rate adjustment strategy: the learning rate was reduced by a factor of 0.2 if the validation performance did not improve for three consecutive epochs. We have updated the relevant section in the manuscript to accurately reflect this.

Comments 3: While the reference to Zhang is acknowledged, it is essential to explicitly describe both the classifier and the normalized classifier used to detect the presence of watermarks. Include among other things how these classifiers assess the similarity between the extracted and original watermarks. This information is critical to fully evaluate the effectiveness of the watermark detection mechanism.

Response 3: Thank you for your valuable comment. We agree that the description of the two classifiers in our manuscript was insufficient. In response to your suggestion, we have added more detailed information in the section describing Zhang's overall framework. Specifically, we now clarify that the discriminator is trained using binary cross-entropy loss, with its output being a probability value. This output is then weighted and incorporated as part of the embedding loss during training. Additionally, we have provided further details on the watermark classifier, including the loss function used for its training. We also explain the differences between the watermark classifier and the Normalized Correlation (NC) method, highlighting the classifier's advantage in terms of robustness, especially under attack conditions.

Comments 4: It would significantly improve the analysis to provide zoomed-in regions of the extracted watermark images in Figure 8. This will help in visually assessing any distortions or artifacts, particularly in comparison with the method proposed by Zhang et al.

Response 4: Thank you for your valuable comment. We fully agree with your suggestion. To improve the visual comparison, we have added a zoomed-in row next to the clearer samples (e.g., Res16) in Figure 8. The magnified regions allow for a more intuitive observation of distortion and artifact differences in the extracted watermarks, especially when comparing with the method proposed by Zhang et al.

Comments 5: It would be valuable to demonstrate the system’s performance using QR code-like structured images as watermarks. These provide a more rigorous test of the model’s capacity to preserve high-frequency features under transformations and would offer practical insight into real-world applicability beyond binary sequence embedding.

Response 5: Thank you for your valuable comment. We agree with your suggestion. In our experiments, we have tried using enlarged individual bits as watermark images, which, while not a true QR code watermark, can simulate similar effects. We have added this experimental result and its corresponding discussion in the comparison section 4.4. We believe this addition will provide a more comprehensive demonstration of our watermarking method's performance.

Comments 6: It would be also beneficial to have a detailed comparison of the training and inference speed between the Swin-UNet with data augmentation and Zhang's method.

Response 6: Thank you for your valuable comment. We agree with your point. In section 4.2, we have added descriptions regarding the inference time. However, we regret to inform you that we did not record specific training times during our experiments, and due to limited time for revision, we were unable to retrain the model to collect this data. Additionally, since our experiments were conducted on the Kaggle platform, which has limited computational resources, the training time would have limited reference value. Nevertheless, we can provide you with the number of parameters in the initial stage, which had the largest change. Our model has approximately 27.18 million parameters, while Zhang’s method has around 3.37 million parameters.