# Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

^{*}

## Abstract

**:**

## 1. Introduction

- We provide mathematical and experimental evidence for SGD and Adam to show that: (1) in contrast to SGD, the changes in the distribution of weights caused by Adam can be easily detected when embedding watermarks following the approach in [5,6] and, hence, (2) the use of Adam considerably increases the detectability of the watermark. For the purpose of carrying out this analysis, we use FFDNet [14]—a DNN that performs image denoising tasks—as the host network.
- We introduce a novel method based on orthogonal projections to solve the detectability problem that arises when watermarking a DNN which is being optimized with Adam. A side effect of this novel method is an increased robustness against weight pruning.

#### Notation

## 2. Preliminaries

#### 2.1. Host Network: FFDNet

#### 2.2. Optimization Algorithms

#### 2.2.1. SGD Optimization

#### 2.2.2. Adam Optimization

#### 2.3. Digital Watermarking Algorithm

#### 2.3.1. Embedding Elements

#### 2.3.2. Embedding Process

#### 2.3.3. Detectability Issues

#### 2.3.4. Gaussian and Orthogonal Projection Vectors

## 3. Theoretical Analysis

#### 3.1. Analysis for SGD

#### 3.2. Analysis for Adam

#### 3.2.1. Mean of the Gradient

#### 3.2.2. Variance of the Gradient

#### 3.2.3. Update Term

#### 3.2.4. Rationale for the Sign Function

#### 3.2.5. A Theoretical Expression for $\Delta \mathbf{w}$

#### 3.3. The Denoising Term

#### 3.3.1. SGD

#### 3.3.2. Adam

## 4. Block-Orthonormal Projections (BOP)

## 5. Information-Theoretic Measures

## 6. Experiments and Results

#### 6.1. Experimental Set-Up

#### 6.1.1. Training the Host Network

#### 6.1.2. Watermark Embedding

#### 6.2. Experimental Results

#### 6.2.1. Empirical Denoising Gradients

#### 6.2.2. SGD

#### 6.2.3. Adam

#### 6.2.4. BOP

## 7. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Mathematical Derivations

#### Appendix A.1. Projected Weights at k = 0

#### Appendix A.2. Adam: Mean of the Gradient

#### Appendix A.3. Adam: Variance of the Gradient

#### Appendix A.4. Adam: A Projection-Based Decomposition of ${\mathbf{c}}_{j}^{T}\mathrm{sgn}(\widehat{\mathit{\phi}})$

#### Appendix A.4.1. Decomposition for Gaussian Projectors

#### Appendix A.4.2. Decomposition for Orthogonal Projectors

#### Appendix A.5. Adam: Analysis with Denoising and Watermarking

## Appendix B. Verification of Assumptions

#### Appendix B.1. Affine Growth Hypothesis for the Weights

**Figure A2.**Evolution with k of four randomly selected weights. SGD optimization with orthogonal projectors, $\lambda =20$, $T=256$.

**Figure A3.**Evolution with k of four randomly selected weights. Adam optimization with Gaussian projectors, $\lambda =1$, $T=256$.

**Figure A4.**ECDF of the correlation coefficient $\rho $ between the observed values of the weights over k and their predicted affine evolution. (

**a**) SGD with orthogonal projectors, $\lambda =20$; (

**b**) Adam with Gaussian, $\lambda =1$ and orthogonal projectors, $\lambda =10$.

#### Appendix B.2. Negligibility of Weights at k = 0

## References

- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV ’15), Santiago, Chile, 7–13 December 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 1026–1034. [Google Scholar]
- Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access
**2019**, 7, 19143–19165. [Google Scholar] [CrossRef] - Le Merrer, E.; Pérez, P.; Trédan, G. Adversarial Frontier Stitching for Remote Neural Network Watermarking. arXiv
**2017**, arXiv:1711.01894. [Google Scholar] [CrossRef] [Green Version] - Adi, Y.; Baum, C.; Cissé, M.; Pinkas, B.; Keshet, J. Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring. arXiv
**2018**, arXiv:1802.04633. [Google Scholar] - Uchida, Y.; Nagai, Y.; Sakazawa, S.; Satoh, S. Embedding Watermarks into Deep Neural Networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR ’17), Bucharest, Romania, 6–9 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 269–277. [Google Scholar]
- Nagai, Y.; Uchida, Y.; Sakazawa, S.; Satoh, S. Digital Watermarking for Deep Neural Networks. Int. J. Multimed. Inf. Retr.
**2018**, 7, 3–16. [Google Scholar] [CrossRef] [Green Version] - Cox, I.J.; Kilian, J.; Leighton, F.T.; Shamoon, T. Secure Spread Spectrum Watermarking for Multimedia. IEEE Trans. Image Process.
**1997**, 6, 1673–1687. [Google Scholar] [CrossRef] [PubMed] - Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Saad, D., Ed.; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR ’15), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Rouhani, B.D.; Chen, H.; Koushanfar, F. DeepSigns: A Generic Watermarking Framework for IP Protection of Deep Learning Models. arXiv
**2018**, arXiv:1804.00750. [Google Scholar] - Balles, L.; Hennig, P. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. In Proceedings of the 2018 International Conference on Machine Learning (ICML ’18), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Wilson, A.C.; Roelofs, R.; Stern, M.; Srebro, N.; Recht, B. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA; pp. 4151–4161. [Google Scholar]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a Fast and Flexible Solution for CNN-Based Image Denoising. IEEE Trans. Image Process.
**2018**, 27, 4608–4622. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Fan, L.; Zhang, F.; Fan, H.; Zhang, C. Brief Review of Image Denoising Techniques. Vis. Comput. Ind. Biomed. Art
**2019**, 2, 7. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML ’15), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Sun, S.; Cao, Z.; Zhu, H.; Zhao, J. A Survey of Optimization Methods From a Machine Learning Perspective. IEEE Trans. Cybern.
**2020**, 50, 3668–3681. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res.
**2011**, 12, 2121–2159. [Google Scholar] - Tieleman, T.; Hinton, G. Lecture 6.5—RMSProp, COURSERA: Neural Networks for Machine Learning; University of Toronto: Toronto, ON, Canada, 2012. [Google Scholar]
- Wang, T.; Kerschbaum, F. Attacks on Digital Watermarks for Deep Neural Networks. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’19), Brighton, UK, 12–17 May 2019; pp. 2622–2626. [Google Scholar]
- Burden, R.L.; Faires, J.D. Numerical Analysis, 9th ed.; Brooks/Cole: Boston, MA, USA, 2010. [Google Scholar]
- Geyer, C.J. Practical Markov Chain Monte Carlo. Stat. Sci.
**1992**, 7, 493–497. [Google Scholar] [CrossRef] - Cachin, C. An Information-Theoretic Model for Steganography. In Information Hiding; Lecture Notes in Computer Science; Aucsmith, D., Ed.; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1525. [Google Scholar]
- Comesaña, P. Detection and information theoretic measures for quantifying the distinguishability between multimedia operator chains. In Proceedings of the IEEE Workshop on Information Forensics and Security (WIFS12), Tenerife, Spain, 2–5 December 2012. [Google Scholar]
- Barni, B.; Tondi, B. The Source Identification Game: An Information-Theoretic Perspective. IEEE Trans. Inf. Forensics Secur.
**2013**, 8, 450–463. [Google Scholar] [CrossRef] - Tassano, M.; Delon, J.; Veit, T. An Analysis and Implementation of the FFDNet Image Denoising Method. Image Process. Line
**2019**, 9, 1–25. [Google Scholar] [CrossRef] - Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo Exploration Database: New Challenges for Image Quality Assessment Models. IEEE Trans. Image Process.
**2017**, 26, 1004–1016. [Google Scholar] - Franzen, R. Kodak Lossless True Color Image Suite. 1999. Available online: http://r0k.us/graphics/kodak (accessed on 22 September 2020).
- Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proceedings of the 8th Int’l Conf. Computer Vision (ICCV 2001), Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
- Wilson, S.G. Digital Modulation and Coding; Pearson: London, UK, 1995. [Google Scholar]

**Figure 2.**Histograms from the embedding layer $l=2$ ($T=256$, $\lambda =1$ and k = 32,140). (

**a**) histogram of ${\mathbf{w}}^{(0)}$; (

**b**) histogram of ${\mathbf{w}}^{(k)}$; (

**c**) histogram of $\Delta \mathbf{w}={\mathbf{w}}^{(k)}-{\mathbf{w}}^{(0)}$.

**Figure 3.**Empirical histograms from the denoising gradients (

**a**) distribution of the mean denoising gradient, D; (

**b**) distribution of the variance of the batching noise, H.

**Figure 4.**Empirical histograms after the watermark embedding using SGD. (

**a**) histogram of ${\mathbf{w}}^{(k)}$, Gaussian, $\lambda =5$, and orthogonal projectors, $\lambda =20$; (

**b**) histogram of $\Delta \mathbf{w}$, Gaussian, $\lambda =5$; (

**c**) histogram of $\Delta \mathbf{w}$, orthogonal, $\lambda =20$.

**Figure 6.**Empirical histograms of ${\mathbf{w}}^{(k)}$ after the watermark embedding using Adam. (

**a**) Gaussian, $\lambda =0.05$ and $\lambda =1$; (

**b**) orthogonal, $\lambda =0.5$ and $\lambda =10$.

**Figure 7.**Empirical histograms of $\Delta \mathbf{w}$ after the watermark embedding using Adam. (

**a**) Gaussian, $\lambda =0.05$; (

**b**) Gaussian, $\lambda =1$; (

**c**) orthogonal, $\lambda =0.5$; (

**d**) orthogonal, $\lambda =10$.

**Figure 10.**Theoretical histograms of $\Delta \mathbf{w}$ for Adam with denoising and watermarking functions using Equations (A11) and (A12). (

**a**) Gaussian, $\lambda =0.05$; (

**b**) Gaussian, $\lambda =1$; (

**c**) orthogonal, $\lambda =0.5$; (

**d**) orthogonal, $\lambda =10$.

**Figure 11.**Empirical histograms of ${\mathbf{w}}^{(k)}$ after the watermark embedding using Adam. (

**a**) Gaussian, $\lambda =0.05$ and $\lambda =1$; (

**b**) orthogonal, $\lambda =0.5$ and $\lambda =10$.

**Figure 12.**Empirical histograms of $\Delta \mathbf{w}$ after the watermark embedding using Adam. (

**a**) Gaussian, $\lambda =0.05$; (

**b**) Gaussian, $\lambda =1$; (

**c**) orthogonal, $\lambda =0.5$; (

**d**) orthogonal, $\lambda =10$.

**Figure 13.**(

**a**) BER vs. pruning rate for Adam and BOP (pruning all layers or only the watermarked one does not have any impact on BER); (

**b**) PSNR vs. pruning rate for Adam and BOP for the Kodak24 dataset; (

**c**) PSNR vs. pruning rate for Adam and BOP for the CBSD68 dataset.

Grayscale | RGB | |
---|---|---|

Conv layers | 15 | 12 |

Feature maps per layer | 64 | 96 |

Receptive field | $62\times 62$ | $50\times 50$ |

**Table 2.**PSNR (dB) results with noise level $\sigma =25$, number of iterations k needed to converge, KLD and SIKLD between the distributions of ${\mathbf{w}}^{(k)}$ and ${\mathbf{w}}^{(0)}$.

SGD | Adam | BOP | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Gaussian | Orth. | Gaussian | Orth. | Gaussian | Orth. | |||||

$\lambda $ | 5 | 20 | $0.05$ | 1 | $0.5$ | 10 | $0.05$ | 1 | $0.5$ | 10 |

CBSD68 | $30.76$ | $31.09$ | $31.18$ | $31.17$ | $31.21$ | $31.16$ | $31.20$ | $31.16$ | $31.19$ | $31.15$ |

Kodak24 | $31.66$ | $32.03$ | $32.13$ | $32.10$ | $32.15$ | $32.10$ | $32.15$ | $32.08$ | $32.14$ | $32.09$ |

k | 42,780 | 98,510 | 43,590 | 32,140 | 27,110 | 57,230 | 40,880 | 14,840 | 33,180 | 7150 |

KLD | $0.0477$ | $0.0238$ | $0.1149$ | $0.2779$ | $0.0281$ | $0.8118$ | $0.0463$ | $0.0443$ | $0.0227$ | $0.0253$ |

SIKLD | $0.0468$ | $0.0206$ | $0.0879$ | $0.2112$ | $0.0280$ | $0.4707$ | $0.0449$ | $0.0266$ | $0.0197$ | $0.0226$ |

**Table 3.**Position of both side spikes in the histograms of $\Delta \mathbf{w}$ obtained from theoretical and empirical results.

$\mathit{\lambda}$ | Theoretical | Empirical | |
---|---|---|---|

Gaussian | $0.05$ | $-0.04243$ | $-0.04278$ |

$0.04239$ | $0.04276$ | ||

1 | $-0.03129$ | $-0.03119$ | |

$0.03126$ | $0.03125$ | ||

Orthogonal | $0.5$ | $-0.02710$ | $-0.02710$ |

$0.02710$ | $0.02709$ | ||

10 | $-0.05720$ | $-0.05717$ | |

$0.05720$ | $0.05718$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cortiñas-Lorenzo, B.; Pérez-González, F.
Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks. *Entropy* **2020**, *22*, 1379.
https://doi.org/10.3390/e22121379

**AMA Style**

Cortiñas-Lorenzo B, Pérez-González F.
Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks. *Entropy*. 2020; 22(12):1379.
https://doi.org/10.3390/e22121379

**Chicago/Turabian Style**

Cortiñas-Lorenzo, Betty, and Fernando Pérez-González.
2020. "Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks" *Entropy* 22, no. 12: 1379.
https://doi.org/10.3390/e22121379