Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

Cortiñas-Lorenzo, Betty; Pérez-González, Fernando

doi:10.3390/e22121379

Open AccessArticle

Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

by

Betty Cortiñas-Lorenzo

and

Fernando Pérez-González

^*

Atlanttic Research Center, University of Vigo, 36310 Vigo, Spain

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(12), 1379; https://doi.org/10.3390/e22121379

Submission received: 8 October 2020 / Revised: 30 November 2020 / Accepted: 4 December 2020 / Published: 6 December 2020

(This article belongs to the Special Issue Information-Theoretic Methods for Deep Learning Based Data Acquisition, Analysis and Security)

Download

Browse Figures

Versions Notes

Abstract

:

As training Deep Neural Networks (DNNs) becomes more expensive, the interest in protecting the ownership of the models with watermarking techniques increases. Uchida et al. proposed a digital watermarking algorithm that embeds the secret message into the model coefficients. However, despite its appeal, in this paper, we show that its efficacy can be compromised by the optimization algorithm being used. In particular, we found through a theoretical analysis that, as opposed to Stochastic Gradient Descent (SGD), the update direction given by Adam optimization strongly depends on the sign of a combination of columns of the projection matrix used for watermarking. Consequently, as observed in the empirical results, this makes the coefficients move in unison giving rise to heavily spiked weight distributions that can be easily detected by adversaries. As a way to solve this problem, we propose a new method called Block-Orthonormal Projections (BOP) that allows one to combine watermarking with Adam optimization with a minor impact on the detectability of the watermark and an increased robustness.

Keywords:

watermarking; deep neural networks; optimization algorithms; Adam; stochastic gradient descent; detectability

1. Introduction

Deep learning has substantially impacted technology over the last years, becoming an important center of attention for researchers all over the world. Such a great impact stems from the versatility it offers as well as the excellent results that DNNs achieve on multiple tasks, like image classification or speech recognition, which often reach and even surpass human-level performance [1,2].

However, far from being a simple task, the design of new DNNs is generally expensive not only in terms of human effort and time needed to build effective model architectures but mostly because of the large volume of suitable data that must be gathered and the vast amount of computational resources and power used for training. Consequently, businesses owning costly models are interested in protecting them from any illicit use, and this growing need has recently led researchers to a common concern on how to embed watermarks on DNNs. As a result, several frameworks for protecting the intellectual property of neural networks were proposed in the literature, and can be classified as black-box or white-box approaches. The first kind of methods (i.e., black-box) do not need to access model parameters for detecting the presence of watermarks. Instead, key inputs are introduced as special triggers to identify the original network, either by the use of the so-called adversarial examples [3] or backdoor poisoning [4].

White-box approaches, in contrast, directly affect the model parameters in order to embed watermarks. One of the most significant white-box contributions was proposed by Uchida et al. in [5,6], the algorithm under study in this paper. This watermarking framework employs a regularization term that defines a cost function for embedding the secret message. This regularizer, which transforms weights into bits by means of a projection matrix in a similar way to spread–spectrum techniques used in watermarking [7], is added to the main loss function and can be applied either at the beginning of the training phase (i.e., from scratch) or during fine-tuning steps.

On the other hand, when training a DNN, there are several optimization algorithms to choose from. With their pros and cons, some of the most popular are the classic Stochastic Gradient Descent (SGD) [8] and Adaptive Moment Estimation (Adam) [9]. We found that, in order to perform watermarking following the approach in [5,6], it is necessary to pay close attention to the optimization algorithm being used. In this paper, we prove that Adam, although being an appealing algorithm for lots of applications—especially because of its efficiency and training speed—poses a big problem when implementing this watermarking method.

As any other watermarking algorithm, it is important to meet some minimal requirements regarding fidelity, robustness, payload, efficiency, and undetectability [5,6,10]. This latter property involves the need for concealing any clue that would let unauthorized parties know if a watermark was embedded, along with its further consequences. In other words, from a steganographic point of view, watermarks should be embedded into the DNN without leaving any detectable footprint.

We show that, unlike SGD, Adam significantly alters the distribution of the coefficients at the embedding layer in a very specific way and, thus, it compromises the undetectability of the watermark. In particular, when using Adam, the histogram of the weights at the embedding layer shows that these coefficients tend to group together at both sides of zero, originating two visible spikes that grow in magnitude with the number of iterations, so that the initial distribution ends up changing completely. As we will see later, the update direction given by Adam depends on the projection matrix used for watermarking; specifically, it is based on the sign function, responsible for the symmetric two-spiked shape that emerges. This behavior is somewhat surprising, as with a pseudorandom projection matrix, one might expect that the weights evolve with random speeds; in contrast, the weights tend to move in unison in a way that is reminiscent of ants after they have established a solid path from the nest to a food source. Then, the statistical dependence of the weights with the projection matrix is weak, and this is mainly due to the information collapse induced by the sign function; this loss is invested in making the weights more conspicuous and, therefore, the watermark more easily detectable. As we will confirm, the sign phenomenon does not occur in other algorithms like SGD. The appearance of the sign was already pointed out and studied—without considering watermarking applications—by the authors in [11], who explained the adverse effects of Adam on generalization [12] as a consequence of this aspect. However, instead of delving into the general performance of the optimization algorithm itself, we show that the sign function which is involved in Adam’s update direction is detrimental for watermarking purposes, so we highlight the need for being careful with the selection of the optimization algorithm in these cases.

In this paper, we carry out a theoretical and experimental analysis and compare the results using Adam and SGD optimization. The analysis can be extended to other optimization algorithms. As a way to measure the similarity between the original weight distribution and the resulting one after the watermark embedding, we will use the Kullback–Leibler divergence (KLD) [13]. We will show that, as expected, when we use Adam, the KLD between both distributions is considerably larger than when we use SGD, thus confirming that, as opposed to SGD, Adam modifies the original distribution to a great extent. Furthermore, in order to perform this kind of watermarking and, at the same time, enjoy the advantages that Adam optimization provides, we propose a new method called Block-Orthonormal Projections (BOP), which uses a secret transformation matrix in order to reduce the detectability of the watermark generated by Adam. As we will see, BOP allows us to considerably reduce the KLD to small values which are comparable to those obtained with SGD. Therefore, we show that BOP allows us to preserve the original shape of the weight distribution.

In summary, this work makes the following two-fold contribution:

We provide mathematical and experimental evidence for SGD and Adam to show that: (1) in contrast to SGD, the changes in the distribution of weights caused by Adam can be easily detected when embedding watermarks following the approach in [5,6] and, hence, (2) the use of Adam considerably increases the detectability of the watermark. For the purpose of carrying out this analysis, we use FFDNet [14]—a DNN that performs image denoising tasks—as the host network.
We introduce a novel method based on orthogonal projections to solve the detectability problem that arises when watermarking a DNN which is being optimized with Adam. A side effect of this novel method is an increased robustness against weight pruning.

The remainder of this paper is organized as follows: Section 1.1 introduces the notation and Section 2 explains the frameworks and algorithms used in this study—host network, optimization algorithms, and watermarking method—in more detail. Section 3 presents the mathematical core that allows us to model the observed effects on the histograms of weights once the embedding process has finished. Then, in Section 4, we introduce BOP as a solution for using Adam and watermarking simultaneously, Section 5 presents the information-theoretic measures that we will implement, and Section 6 shows the experimental results. Finally, we point out some concluding remarks in Section 7. Two appendices give additional details on mathematical derivations (Appendix A) and the validity of certain assumptions (Appendix B).

Notation

In this paper, we use the following notation. Matrices and vectors are denoted by upper-case and lower-case boldface characters, respectively, while random variables and their realizations are respectively represented by upper-case and lower-case characters.

For matrix and vector operations, we proceed as follows. As an example, let

A

be a matrix. Then, its transpose is denoted by

A^{T}

. Moreover, we use

Tr [A]

to represent the trace of

A

and

{(A)}_{i, j}

to denote the

(i, j)

th element of

A

.

I_{N}

refers to the

N \times N

identity matrix. We use column vectors unless otherwise stated. In addition, we use

0

to denote a column vector of zeros and

1

for a column vector of ones. Let

w

be a column vector of length N; then,

\nabla_{w}

is the gradient operator with respect to

w

that is:

\nabla_{w} ≐ {(\frac{\partial}{\partial w_{1}}, \dots, \frac{\partial}{\partial w_{N}})}^{T}

We use the operator ∘ to denote the Hadamard (i.e., sample-wise) product and ⊗ for the Kronecker product. Finally,

E {\cdot}

and

Var {\cdot}

denote the mathematical expectation and the variance, respectively.

2. Preliminaries

2.1. Host Network: FFDNet

The rapid development of Deep Learning over the last few years has led to new advances in the field of image restoration [15]. Several Convolutional Neural Networks (CNNs) have been designed to replace classical methods and, often, they offer new competitive advantages. This is the case of FFDNet [14], which performs image denoising tasks and is used in our work as the exemplary host network that is watermarked by means of the algorithm proposed in [5,6].

Image denoising is the task of removing noise from a given image. Let

y

be the input noisy image,

x

the clean image, and

n

the noise, which is usually modeled as zero-mean Additive White Gaussian Noise (AWGN), then we have

y = x + n

and we wish to obtain an estimate

\hat{x}

of the clean image. As opposed to other CNNs for image denoising, FFDNet works on downsampled sub-images, and it is able to adapt to several noise levels using only a single network. For that purpose, a noise level map

M

is included also as an input, so that the function FFDNet aims to learn can be expressed as:

\hat{x} = F (y, M; w)

, where

w

represents the parameters of the network. Furthermore, FFDNet can handle spatially variant noise and offers competitive inference speed without sacrificing denoising performance [14].

Figure 1 shows the architecture of FFDNet. As we can see, it is composed of a downscaling operation on the input images, a nonlinear mapping consisting of Convolutional Layers, Batch Normalization steps [16], and ReLU activation functions; and, finally, an upscaling process to generate denoised images with the original size. Let

\{(y_{i}, x_{i})\}

be L noisy-clean image pairs from the training dataset, the denoising cost function is:

f_{0} (w) = \frac{1}{2 L} \sum_{i = 1}^{L} | | F (y_{i}, M_{i}; w) - x_{i} {| |}^{2}

(1)

FFDNet can be used for grayscale or color images. Table 1 shows the main differences between both configurations. As we can see, the total number of convolutional layers can be set to 15 or 12 for grayscale or RGB denoising, respectively. The number of feature maps and the size of the receptive field also differ, but this is not the case of the kernel size, which is kept to

3 \times 3

for either grayscale or RGB. In this paper, we implement the RGB version of FFDNet and proceed as follows: (1) train FFDNet from scratch using Adam optimization without embedding any watermark and (2) fine-tune the network to embed the desired watermark, using both Adam and SGD to compare the results.

2.2. Optimization Algorithms

The training of a DNN is an iterative process that makes it possible for the model to learn how to perform a given task. This is certainly the most challenging optimization problem when implementing deep learning models from scratch. In order to increase the efficiency of the training process, researchers have developed several optimization techniques in the last years, each of them with its own advantages and drawbacks [17].

An optimization algorithm determines the weight update rule that must be applied at every iteration. The goal is usually to minimize a cost function—also known as loss function—which generally compares predictions with expected values and computes an error metric that evaluates the performance of the network. The learning rate

μ

—or step size—is the hyperparameter that controls how much the weights can change after each iteration of the optimization algorithm.

In the following sections, we briefly review the mechanics of two widely known optimization algorithms: SGD and Adam.

2.2.1. SGD Optimization

SGD [8] is a classic optimization algorithm based on gradient descent and one of the most used. However, unlike standard gradient descent techniques that use all the training samples to compute the gradient of the cost function, SGD uses a small number of samples from the dataset—a minibatch—and then takes the average over these samples to get an estimate of the gradient,

\nabla_{w} f (w)

. Then, the update rule is given by:

w^{(k)} = w^{(k - 1)} - μ \nabla_{w} f (w^{(k - 1)}), k \geq 1

(2)

where

w^{(k)}

is the weight vector at iteration k and

w^{(0)}

is its initial value.

2.2.2. Adam Optimization

Adam [9] is a popular optimization algorithm that combines the ideas of AdaGrad [18] and RMSProp [19]. Some of the advantages of Adam include, among others, the fact of being easy to implement and computationally efficient, as well as being fast and suitable for complex settings with noisy and sparse gradients.

Just like SGD, Adam estimates the gradient from the samples of a minibatch, but, as opposed to SGD, it uses estimates of the first and second moments of the gradients to compute individual adaptive learning rates for each parameter in the network. In order to do that, exponential moving averages of the gradient and the squared gradient are calculated using two hyperparameters to control the decay rate,

β_{1}

and

β_{2}

, respectively. Let

m^{(0)} = 0

and

v^{(0)} = 0

be the initial values for the first and second moment vectors, respectively, then the steps of this algorithm at the kth iteration are the following [9]:

\begin{matrix} {\hat{g}}^{(k)} & = & \nabla_{w} f (w^{(k - 1)}) \end{matrix}

(3)

\begin{matrix} m^{(k)} & = & β_{1} m^{(k - 1)} + (1 - β_{1}) {\hat{g}}^{(k)} \end{matrix}

(4)

\begin{matrix} v^{(k)} & = & β_{2} v^{(k - 1)} + (1 - β_{2}) ({\hat{g}}^{(k)} \circ {\hat{g}}^{(k)}) \end{matrix}

(5)

\begin{matrix} {\hat{m}}^{(k)} & = & \frac{m^{(k)}}{(1 - β_{1}^{k})} \end{matrix}

(6)

\begin{matrix} {\hat{v}}^{(k)} & = & \frac{v^{(k)}}{(1 - β_{2}^{k})} \end{matrix}

(7)

\begin{matrix} w_{j}^{(k)} & = & w_{j}^{(k - 1)} - μ \frac{{\hat{m}}_{j}^{(k)}}{\sqrt{{\hat{v}}_{j}^{(k)}} + ϵ}, j = 1, \dots, N . \end{matrix}

(8)

where

w_{j}^{(k)}

,

{\hat{m}}_{j}^{(k)}

,

{\hat{v}}_{j}^{(k)}

are the jth element of

w^{(k)}

,

{\hat{m}}^{(k)}

,

{\hat{v}}^{(k)}

, respectively,

f (w)

is the cost function and

ϵ

is a very small number that avoids diving by zero. The default set-up for the hyperparameters is [9]:

β_{1} = 0.9

,

β_{2} = 0.999

and

ϵ = 1 \cdot 10^{- 8}

.

2.3. Digital Watermarking Algorithm

In this paper, we analyze the digital watermarking algorithm proposed in [5,6]. As we mentioned previously, we employ the fine-tune-to-embed approach; therefore, the embedding function is applied only during some additional epochs after convergence is achieved for the original task—in this case, image denoising—.

2.3.1. Embedding Elements

We wish to embed a T-bit sequence,

b \in {0, 1}^{T}

, into a certain layer l of the host DNN. For this specific layer, let

(S, S)

, I and F represent the size of the convolution filter, the depth of input to the convolutional layer and the number of filters in that layer, respectively. Then, the weights can be represented by a tensor

W_{l}

with dimensions

S \times S \times I \times F

and then rearranged to form a vector

w_{l}

of length

N = S^{2} I F

.

One important point here is that, instead of directly using the weight vector

w_{l}

, the authors in [5,6] suggest including an initial transformation of these coefficients. In order to reflect this, we must calculate the mean of

W_{l}

over the F kernels. As a result, we obtain a new flattened vector

{\hat{w}}_{l}

with M elements, where

M = S^{2} I

. This transformation can also be formulated if we introduce a new matrix

Θ

of size

N \times M

:

Θ ≐ I_{M} \otimes h

so that

{\hat{w}}_{l} = Θ^{T} w_{l}

. Here,

h

is a column vector of length F with all of its elements set to

1 / F

, i.e.,

h^{T} h = 1 / F

.

In order to move to the T-dimensional space, the authors in [5,6] introduce a secret projection matrix

Φ

. The size of this projection matrix—also referred to as regularizer parameter in [5,6]—is

M \times T

so that each column corresponds to a particular projection vector

ϕ^{(i)}

,

i = 1, \dots, T

.

For the purpose of the subsequent theoretical analysis, we pair up matrices

Φ

and

Θ

and, therefore, we ascribe the initial transformation to the projection matrix, preserving the notation with the original weight vector

w_{l}

. To that end, we define a new

N \times T

projection matrix:

\hat{Φ} = Θ Φ

(9)

so that now the projection vectors can be expressed as

{\hat{ϕ}}^{(i)} = Θ ϕ^{(i)}

.

2.3.2. Embedding Process

Now that we have introduced the basic elements, the embedding procedure can be described as follows [5,6]. Let

w

be a vector containing all the parameters of the network, then the watermarking regularizer,

f_{watermark} (w_{l})

, is added to the global cost function

f (w)

:

f (w) = f_{0} (w) + λ f_{watermark} (w_{l})

where

f_{0} (w)

is the original cost function and

λ

is the regularization parameter. The regularizer term is a composition of two functions: cross-entropy and the sigmoid,

f_{watermark} (w_{l}) ≐ - \sum_{i = 1}^{T} (b_{i} log (y_{i}) + (1 - b_{i}) log (1 - y_{i}))

(10)

with

y_{i} = 1 / (1 + exp (- {({\hat{ϕ}}^{(i)})}^{T} w_{l}))

. In order to minimize (10),

y_{i}

will approximate to the value of

b_{i}

. Therefore, the sigmoid function will force each projection

{({\hat{ϕ}}^{(i)})}^{T} w_{l}

,

i = 1, \dots, T

, to progressively move towards

+ \infty

and

- \infty

depending on whether

b_{i} = 1

or

b_{i} = 0

, respectively. For successfully embedding the secret message, it is generally enough to guarantee that each projection lies on the proper side of the horizontal axis. When this happens, we reach a Bit Error Rate (BER) of 0 and all the projected weights are aligned with their corresponding bits, that is, they are positive when the bit is 1 and, conversely, negative when the bit is 0.

2.3.3. Detectability Issues

In this paper, we employ the Random technique proposed by the authors, in which the values of the projection matrix

Φ

—before applying the transformation in (9)—are independent samples from a standard normal distribution

N (0, 1)

. Their results [5,6] show that this Random approach is the most appealing design method for

Φ

because, as they indicate, it does not significantly alter the distribution of weights at the embedding layer. However, there are some detectability issues here that should be considered.

On the one hand, the authors in [20] show that the standard deviation of the distribution of weights grows with the length of the embedded message. This information can be used by adversaries for detecting the watermark and even overwriting it.

On the other hand, one of the main conclusions of our work is that the presence or absence of alterations to the shape of the weight distributions is a consequence of the optimization algorithm used during the watermark embedding. In particular, the authors in [5,6] employ SGD with momentum in their experiments and the distributions of weights remain unchanged, yet the use of Adam would significantly alter the shape of the distributions even though we apply the same Random technique. In this paper, we will show that the results in [5,6] regarding the undetectability of the watermark do not hold when we use Adam optimization.

As an example to visualize this peculiar behavior shown by Adam when we employ the watermarking algorithm proposed in [5,6], we plot in Figure 2 the resulting histograms when

T = 256

and

λ = 1

; specifically, the histograms of the weights before and after the embedding (corresponding to k = 32,140) and the histogram of the weight variations, respectively. As we can see, the distribution of the original weights has significantly changed, turning into a two-spiked shape that could be easily detected by an adversary. The complete set of histograms will be later shown in Section 6.

2.3.4. Gaussian and Orthogonal Projection Vectors

In addition to the Random technique suggested by the authors of this watermarking algorithm—whose projection vectors will be referred to as Gaussian projection vectors in this paper—we will also implement orthonormal projectors. In order to build these kinds of projectors, we first generate the projection matrix following the Random technique; that is, samples are drawn from a standard normal distribution

N (0, 1)

. Then, from the Singular Value Decomposition of this projection matrix, we obtain an orthonormal basis for the column space so that we have

Φ^{T} Φ = I_{T}

. Notice that once we apply the initial transformation (i.e.,

\hat{Φ} = Θ Φ

) the resulting projection vectors

{\hat{ϕ}}^{(i)}

,

i = 1, \dots, T

, will still preserve the orthogonality between them, although they will not be normalized:

{(\hat{Φ})}^{T} \hat{Φ} = {(Θ Φ)}^{T} Θ Φ = Φ^{T} Θ^{T} Θ Φ = Φ^{T} (I_{M} \otimes h^{T}) (I_{M} \otimes h) Φ = \frac{1}{F} Φ^{T} Φ = \frac{1}{F} I_{T}

Therefore, these kinds of projectors will be referred to as orthogonal. As we will see later from KLD results and histograms, implementing orthogonal projectors may help us to better preserve both the original shape of the weight distribution and the denoising performance.

3. Theoretical Analysis

From now on, we will omit the sub-index l for the sake of clarity although we are always addressing the coefficients of the embedding layer. The experiments clearly illustrate that the use of Adam optimization together with the watermarking algorithm proposed in [5,6] originates noticeable changes in the distribution of weights, as we see in Figure 2. In the following analysis, we delve into the reasons why this happens. To that end, we aim to get a theoretical expression of

Δ w = w^{(k)} - w^{(0)}

. This will allow us to prove and understand the nature of the observed behavior of the weights when watermark embedding is carried out. We start off by defining vector

\hat{φ}

and matrix

\hat{Ψ}

:

\begin{matrix} \hat{φ} & ≐ & \sum_{i = 1}^{T} {\hat{ϕ}}^{(i)} \end{matrix}

(11)

\begin{matrix} \hat{Ψ} & ≐ & \sum_{i = 1}^{T} {\hat{ϕ}}^{(i)} {({\hat{ϕ}}^{(i)})}^{T} \end{matrix}

(12)

Notice that

\hat{Ψ} = {\hat{Ψ}}^{T}

. In addition, for the case of orthogonal projectors, the following properties can be straightforwardly proven:

\begin{matrix} \hat{Ψ} \hat{φ} & = & \frac{1}{F} \hat{φ} \end{matrix}

(13)

\begin{matrix} \hat{Ψ} \hat{Ψ} & = & \frac{1}{F} \hat{Ψ} \end{matrix}

(14)

These properties will come in handy later on in several theoretical derivations.

Firstly, in order to simplify the analysis and understand more clearly how the watermarking cost function impacts on the movement of the weights when using both SGD and Adam optimization, we will just consider the presence of the regularization term, that is, we will not include the denoising cost function for now. The influence of the denoising part will be studied in Section 3.3. Therefore, given our embedding cost function in (10) and assuming for simplicity (and without loss of generality) that all embedded symbols are

+ 1

, we have:

\begin{matrix} \tilde{f} (w) ≐ λ f_{watermark} (w | b = 1) = λ \sum_{i = 1}^{T} log (1 + exp (- {({\hat{ϕ}}^{(i)})}^{T} w)) \end{matrix}

If we compute the gradient of this function, we obtain:

\nabla_{w} \tilde{f} (w) = - λ \sum_{i = 1}^{T} {\hat{ϕ}}^{(i)} \frac{1}{1 + exp ({({\hat{ϕ}}^{(i)})}^{T} w)}

(15)

In order to simplify the subsequent analysis, we introduce a series of assumptions which are based on empirical observations or hypotheses that will be duly verified.

By construction, it is possible to show that the mean of

{({\hat{ϕ}}^{(i)})}^{T} w^{(0)}

is zero and its variance is

(N / F^{2}) Var {w^{(0)}}

for Gaussian projectors and

(1 / F) Var {w^{(0)}}

for orthogonal projectors (see Appendix A.1). Since the variance of the weights at the initial iteration is generally very small—in our experiments, it is

0.0012

—it can be considered that the variance of

{({\hat{ϕ}}^{(i)})}^{T} w^{(0)}

will also be small enough so that we can assume

| {({\hat{ϕ}}^{(i)})}^{T} w | ≪ 1

for all

i = 1, \dots, T

. Although this assumption might not be strictly true for all k—especially once we have crossed the linear region of the sigmoid function—it is reasonably good and it allows us to use a first-order Taylor expansion for

{(1 + exp ({({\hat{ϕ}}^{(i)})}^{T} w))}^{- 1}

around

{({\hat{ϕ}}^{(i)})}^{T} w = 0

:

{(1 + exp ({({\hat{ϕ}}^{(i)})}^{T} w))}^{- 1} \approx \frac{1}{2} - \frac{{({\hat{ϕ}}^{(i)})}^{T} w}{4}

(16)

Plugging (16) into (15) and using the definitions given above, we can write:

\nabla_{w} \tilde{f} (w) \approx - \frac{λ}{2} \hat{φ} + \frac{λ}{4} \hat{Ψ} w

(17)

We introduce now one important hypothesis in this theoretical analysis to handle the previous equation: we assume that

w^{(k)}

grows approximately affinely with k:

w^{(k)} \approx w^{(0)} + k \cdot μ \cdot η

(18)

where

η

is a vector that contains the slopes for each weight, and it is to be determined in the following sections. We hypothesize this affine-like growth for the weights and, later, we will verify that this is consistent with the rest of the theory and the experiments (see Appendix B.1 for more details). Therefore, we can write the weight variations as:

Δ w = w^{(k)} - w^{(0)} \approx k \cdot μ \cdot η

(19)

3.1. Analysis for SGD

We first analyze the behavior of SGD optimization when we implement digital watermarking as proposed in [5,6]. Recall the SGD update rule in (2). If we use the approximation for the gradient in (17) and the affine growth hypothesis for the weights introduced in (18), we have:

\nabla_{w} \tilde{f} (w) \approx - \frac{λ}{2} \hat{φ} + \frac{λ}{4} \hat{Ψ} (w^{(0)} + k μ η)

(20)

To simplify the analysis, we consider from now on that

w^{(0)} \approx 0

. We confirm the validity of this assumption in Appendix B.2. Then, we can write:

η = - \nabla_{w} \tilde{f} (w) \approx \frac{λ}{2} \hat{φ} - \frac{λ k μ}{4} \hat{Ψ} η

(21)

If we consider orthogonal projectors, we can arrive at a more explicit expression for

η

. In particular, if we multiply (21) by

\hat{Ψ}

and use the properties (13) and (14), we obtain:

\hat{Ψ} η = \frac{2 λ}{4 F - λ k μ} \hat{φ}

(22)

Then, substituting (22) into (21), we can get a more concise expression for

η

:

η \approx \frac{λ}{2} (1 - \frac{λ k μ}{4 F - λ k μ}) \hat{φ}

(23)

Thus, if F is large compared to

λ k μ

—this certainly holds for our experimental set-up, cf. Section 6.1—

η

will be approximately proportional to

\hat{φ}

. Then, the coefficients will follow an affine-like growth as we hypothesized in (18) (see Appendix B.1 for the empirical confirmation of this hypothesis). Now, the weight variations can be expressed as:

Δ w \approx k \cdot μ \cdot η = \frac{λ k μ}{2} (1 - \frac{λ k μ}{4 F - λ k μ}) \hat{φ}

(24)

As we can see, when we use SGD,

Δ w

will approximately follow a zero-mean Gaussian distribution, as induced by [9]. Because of this, and unlike Adam (as we will see later), the weights will evolve with random speeds when we embed watermarks using SGD optimization. Therefore, the impact on the original shape of the weight distribution will be small. However, the variance of the weight distribution may change considerably as stated in [20]. Since we have

Var {\hat{φ}} = T / (F N)

for orthogonal projectors, the variance of

Δ w

can be computed as:

Var {Δ w} = {[λ k μ \frac{(2 F - λ k μ)}{(4 F - λ k μ)}]}^{2} \frac{T}{F N}

Thus, considering that

w^{(0)}

and

Δ w

are uncorrelated—we check this statement in Section 6.2.2—we arrive at the following expression for the variance of the weights at the kth iteration:

Var {w^{(k)}} = Var {w^{(0)}} + {[λ k μ \frac{(2 F - λ k μ)}{(4 F - λ k μ)}]}^{2} \frac{T}{F N}

(25)

As we can see from (25), when implementing the digital watermarking algorithm in [5,6] with SGD optimization and orthogonal projectors, the variance of the resulting weight distribution might change considerably. In order to preserve the original weight distribution when using SGD, it is important to take care with the values of T, F and N, especially. In addition, the standard deviation of the weights will (approximately) increase linearly with the number of iterations so it may be also important to limit the value of k. This is in line with the expected behavior: the weights will move away from their original value and they will be further if we perform more iterations.

Because the analysis for Gaussian projectors becomes considerably difficult, in this paper, we just address the study of SGD with orthogonal projectors. A more comprehensive analysis for Gaussian projectors that can be linked to the results obtained in [20] is left for future research. Regardless of this, the whole analysis for both kinds of projection vectors will be developed in the next section for Adam optimization.

3.2. Analysis for Adam

In the next sections, we will delve into the theory behind Adam optimization for DNN watermarking. In particular, we will obtain an expression for the mean and the variance of the gradient and then, as we did with SGD, we will analyze the update term to get an expression of the weight variations.

3.2.1. Mean of the Gradient

We are interested in computing the mean of the gradient that is used in Adam. Considering

\tilde{f} (w)

as the global cost function, then, from (4), we can rewrite the mean at the kth iteration as:

m^{(k)} = (1 - β_{1}) \sum_{i = 1}^{k} β_{1}^{k - i} \nabla_{w} \tilde{f} (w^{(i)})

(26)

We use the gradient in (17) and do some derivations to find an explicit expression for

{\hat{m}}^{(k)} = m^{(k)} / (1 - β_{1}^{k})

under the hypothesis in (18). Finally, we arrive at the following expression for the bias-corrected mean gradient when k is sufficiently large (see Appendix A.2 for all the mathematical details):

\begin{matrix} {\hat{m}}^{(k)} & \approx & - \frac{λ}{2} \hat{φ} + \frac{λ}{4} \cdot \hat{Ψ} (w^{(0)} + k μ η) \\ = & \frac{λ}{2} (m_{0} + μ \cdot k \cdot m_{1}), k ≫ \frac{β_{1}}{1 - β_{1}} \end{matrix}

(27)

where

m_{0} ≐ - \hat{φ} + \frac{1}{2} \hat{Ψ} w^{(0)}

and

m_{1} ≐ \frac{1}{2} \hat{Ψ} η

. As we see from (27), the mean of the gradient also grows affinely with k.

3.2.2. Variance of the Gradient

Let

g (w) ≐ \nabla_{w} \tilde{f} (w) \circ \nabla_{w} \tilde{f} (w)

. The approximation in (17) for the jth element of this vector

g (w)

, denoted by

g_{j} (w)

, is the following:

\begin{matrix} g_{j} (w) & \approx & \frac{λ^{2}}{4} \sum_{m = 1}^{T} \sum_{l = 1}^{T} {\hat{ϕ}}_{j}^{(m)} {\hat{ϕ}}_{j}^{(l)} (1 - {({\hat{ϕ}}^{(m)})}^{T} w + \frac{1}{4} {({\hat{ϕ}}^{(m)})}^{T} w w^{T} {\hat{ϕ}}^{(l)}) \\ = & \frac{λ^{2}}{4} (a_{j} - b_{j}^{T} w + \frac{1}{4} c_{j}^{T} w w^{T} c_{j}) \end{matrix}

where:

\begin{matrix} a_{j} & ≐ & {\hat{φ}}_{j}^{2} \end{matrix}

(28)

\begin{matrix} c_{j} & ≐ & \sum_{m = 1}^{T} {\hat{ϕ}}_{j}^{(m)} {\hat{ϕ}}^{(m)} \end{matrix}

(29)

\begin{matrix} b_{j} & ≐ & {\hat{φ}}_{j} \cdot c_{j} \end{matrix}

(30)

Following the hypothesis in (18), we can write:

\begin{matrix} g_{j} (w^{(k)}) \approx \frac{λ^{2}}{4} (a_{j} - b_{j}^{T} w^{(0)} + \frac{1}{4} c_{j}^{T} w^{(0)} {(w^{(0)})}^{T} c_{j} - k μ b_{j}^{T} η + \frac{1}{2} k μ c_{j}^{T} w^{(0)} η^{T} c_{j} + \frac{1}{4} k^{2} μ^{2} c_{j}^{T} η η^{T} c_{j}) \end{matrix}

In summary, for this affine-like growth, the square gradient vector can be written as:

g_{j} (w^{(k)}) \approx \frac{λ^{2}}{4} (p_{j} + q_{j} k μ + r_{j} k^{2} μ^{2})

(31)

for some vectors

p, q, r

whose jth component can be defined as:

\begin{matrix} p_{j} & ≐ & a_{j} - b_{j}^{T} w^{(0)} + \frac{1}{4} c_{j}^{T} w^{(0)} {(w^{(0)})}^{T} c_{j} \\ q_{j} & ≐ & - b_{j}^{T} η + \frac{1}{2} c_{j}^{T} w^{(0)} η^{T} c_{j} \end{matrix}

(32)

\begin{matrix} r_{j} & ≐ & \frac{1}{4} c_{j}^{T} η η^{T} c_{j} \end{matrix}

(33)

Now, from (5), we can rewrite the variance of the gradient that is used in Adam as:

v_{j}^{(k)} = (1 - β_{2}) \sum_{i = 1}^{k} β_{2}^{k - i} g_{j} (w^{(i)})

(34)

The bias-corrected term

{\hat{v}}_{j}^{(k)}

is obtained after dividing

v_{j}^{(k)}

by

(1 - β_{2}^{k})

. Applied to the special case of (31), this yields (see Appendix A.3):

\begin{matrix} {\hat{v}}_{j}^{(k)} & = & \frac{λ^{2}}{4} (p_{j} - \frac{β_{2}}{1 - β_{2}} μ q_{j} + \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} r_{j}) \\ + & \frac{λ^{2}}{4 (1 - β_{2}^{k})} k (μ q_{j} - \frac{2 β_{2}}{1 - β_{2}} μ^{2} r_{j}) + \frac{λ^{2}}{4 (1 - β_{2}^{k})} k^{2} μ^{2} r_{j} \end{matrix}

3.2.3. Update Term

Because

μ

is usually very small—we use

μ = 1 \cdot 10^{- 6}

in our experiments—we can assume that

k μ

will be small enough to obtain an approximation of the update used in Adam. Recall that, for the jth weight, this is

u_{j}^{(k)} ≐ {\hat{m}}_{j}^{(k)} / (\sqrt{{\hat{v}}_{j}^{(k)}} + ϵ)

, implying that

w_{j}^{(k)} = w_{j}^{(k - 1)} - μ u_{j}^{(k)}

. Let:

s_{j} ≐ (p_{j} - \frac{β_{2}}{1 - β_{2}} μ q_{j} + \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} r_{j})

(35)

Then, assuming that

μ k ≪ 1

, we can make a zero-order approximation of the update term, i.e.,:

u_{j}^{(k)} = \frac{{\hat{m}}_{j}^{(k)}}{\sqrt{{\hat{v}}_{j}^{(k)}} + ϵ} \approx \frac{m_{0, j}}{\sqrt{s_{j}}}, μ k ≪ 1

(36)

This approximation is accurate enough for the set of experiments we perform. In particular, for the orthogonal case, we could deal with

k_{m a x}

= 625,000 and still get a correlation coefficient of

0.9900

between

(λ^{2} / 4) s_{j}

and

{\hat{v}}_{j}^{(k_{m a x})}

. In our experiments, we actually reach a BER of zero for values of k quite below

k_{m a x}

(cf. Section 6.2).

From (36), we observe that the updated jth coefficient approximately follows the hypothesized growth, i.e.,

w^{(k)} = w^{(0)} + k \cdot μ \cdot η

, where

η_{j} = - m_{0, j} / \sqrt{s_{j}}

. Notice that, as expected, the update does not depend on

λ

, following Adam’s property that the update is invariant to rescaling the gradients [9]. Finding a more explicit expression runs into the problem that

η

depends on

s

, which in turn is a function of

η

through (32) and (33). The following subsections are devoted to solving this problem by conjecturing a form for

η

and refining it.

To simplify the analysis, we consider from now on that

w^{(0)} \approx 0

since most of the values of the weights at the initial iteration are very small (see Figure 2a). We will verify the accuracy of this approximation in Appendix B.2.

3.2.4. Rationale for the Sign Function

Recall the expression (23) that we obtained for

η

when analyzing SGD, where we found

η

to be approximately proportional to

\hat{φ}

. Now, for Adam, we take this as a starting point, so we conjecture first that

η = γ \cdot \hat{φ}

, for some real positive

γ

. Here, we consider orthogonal projection vectors and use the property introduced in (13) and the following:

c_{j}^{T} \hat{φ} = \frac{{\hat{φ}}_{j}}{F}

In this particular case, we have the following identities:

\begin{matrix} m_{0, j} & \approx & - {\hat{φ}}_{j} \\ p_{j} & \approx & a_{j} = {\hat{φ}}_{j}^{2} \\ q_{j} & \approx & - \frac{γ}{F} {\hat{φ}}_{j}^{2} \\ r_{j} & = & \frac{γ^{2}}{4 F^{2}} {\hat{φ}}_{j}^{2} \end{matrix}

Substituting these values into (35), we find that

\sqrt{s_{j}} = sgn ({\hat{φ}}_{j}) \cdot {\hat{φ}}_{j} \cdot \sqrt{1 + (γ / F) μ β_{2} / (1 - β_{2}) + (γ^{2} / (4 F^{2})) μ^{2} (β_{2} + β_{2}^{2}) / {(1 - β_{2})}^{2}}

When we divide

m_{0, j}

by

\sqrt{s_{j}}

, we obtain:

u_{j}^{(k)} \approx \frac{- sgn ({\hat{φ}}_{j})}{\sqrt{1 + (γ / F) μ β_{2} / (1 - β_{2}) + (γ^{2} / (4 F^{2})) μ^{2} (β_{2} + β_{2}^{2}) / {(1 - β_{2})}^{2}}}

(37)

It is then clear that

η

cannot be written in the form

η = γ \cdot \hat{φ}

, as was conjectured at the beginning of this section.

3.2.5. A Theoretical Expression for $Δ w$

Although the conjectured form for

η

in Section 3.2.4 does not hold, the appearance of the sign function in (37) gives a key clue for an alternative approach, since the sign seems to reveal the reason behind the two-spiked histograms like the one shown in Figure 2c.

Therefore, let us write

η_{j}

to explicitly contain the sign of

φ_{j}

and allow

γ_{j}

to take different (non-negative) values with j to reflect the varying magnitude (recall that even in Section 3.2.4 the conjectured value could be written as

η_{j} = γ | φ_{j} | \cdot sgn (φ_{j})

). Let

γ

be the column vector containing

γ_{j}

,

j = 1, \dots, N

, then

η = γ \circ sgn (\hat{φ})

. Since

c_{j}^{T} (γ \circ sgn (\hat{φ})) = γ^{T} (c_{j} \circ sgn (\hat{φ}))

, we can write:

\begin{matrix} m_{0, j} & \approx & - {\hat{φ}}_{j} \\ p_{j} & \approx & a_{j} = {\hat{φ}}_{j}^{2} \\ q_{j} & \approx & - b_{j}^{T} η = - {\hat{φ}}_{j} γ^{T} (c_{j} \circ sgn (\hat{φ})) \\ r_{j} & = & \frac{1}{4} {[γ^{T} (c_{j} \circ sgn (\hat{φ}))]}^{2} \end{matrix}

Now,

u_{j}^{(k)} \approx \frac{- {\hat{φ}}_{j}}{\sqrt{{\hat{φ}}_{j}^{2} + μ {\hat{φ}}_{j} γ^{T} (c_{j} \circ sgn (\hat{φ})) \frac{β_{2}}{(1 - β_{2})} + \frac{μ^{2}}{4} {[γ^{T} (c_{j} \circ sgn (\hat{φ}))]}^{2} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}}}}

In addition, thus, in order to meet the condition

η_{j} = γ_{j} \cdot sgn ({\hat{φ}}_{j})

, the following nonlinear equation should be solved for all

γ_{j}

,

j = 1, \dots, N

:

γ_{j} = \frac{| {\hat{φ}}_{j} |}{\sqrt{{\hat{φ}}_{j}^{2} + μ {\hat{φ}}_{j} γ^{T} (c_{j} \circ sgn (\hat{φ})) \frac{β_{2}}{(1 - β_{2})} + \frac{μ^{2}}{4} {[γ^{T} (c_{j} \circ sgn (\hat{φ})]}^{2} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}}}}

(38)

This equation can be solved with a fixed-point iteration method [21]. To that end, we should initialize

γ

and then iterate the following: (1) compute the right-hand side of (38), and (2) use it to update

γ

on the left-hand side. This process will converge to the solution of (38). Even though this method can be implemented to give the specific values for each

γ_{j}

, we are more interested in obtaining a statistical characterization rather than a deterministic one. As we will see, the statistical approach offers a deeper explanation for the two-spiked distribution of

Δ w

which we ultimately seek.

We thus aim at finding the pdf of

Γ

, now considered as a random variable for which

γ_{j}

,

j = 1, \dots, N

are nothing but realizations. Once again, Equation (38) can be solved iteratively (e.g., with Markov-chain Monte Carlo methods [22]) to yield the equilibrium distribution for

Γ

. Instead, we can resort to the results in Section 6 where we conclude that the pdf of

Γ

is strongly concentrated around its mode. With this observation, it is possible to consider that

γ^{T} (c_{j} \circ sgn (\hat{φ}))

approximately corresponds to realizations of

Γ \cdot (c_{j}^{T} sgn (\hat{φ}))

.

In order to simplify the analysis even further, we are interested in decomposing

c_{j}^{T} sgn (\hat{φ})

using its statistical projection onto

{\hat{φ}}_{j}

, i.e.,

c_{j}^{T} sgn (\hat{φ}) = α \cdot {\hat{φ}}_{j} + z_{j}

. Here,

α

is a real multiplier and

z_{j}

is zero-mean noise uncorrelated with

{\hat{φ}}_{j}

. More generally, if we define matrix

C ≐ [c_{0}, \dots, c_{N}]

, then we seek to write

C^{T} sgn (\hat{φ}) = α \cdot \hat{φ} + z

. We do the analysis for the cases of Gaussian projectors and orthogonal projectors separately (refer to Appendix A.4 for the derivations). For the Gaussian case, we get:

\begin{matrix} α & = & \frac{1}{F T} \sqrt{\frac{2}{π T}} (F T^{2} + (N + F) T - F) \end{matrix}

(39)

\begin{matrix} Var (z_{j}) & = & - \frac{2}{π F^{3} T^{2}} (F T^{4} + 4 F T^{3} + (N + F) T^{2} - 4 F T + F) + \frac{T}{F^{3}} (N + F (T + 1)) \end{matrix}

(40)

On the other hand, for the orthogonal projectors, we get instead:

\begin{matrix} α & = & \sqrt{\frac{2 N}{π T F}} \end{matrix}

(41)

\begin{matrix} Var (z_{j}) & = & (1 - \frac{2}{π}) \frac{T}{F N} \end{matrix}

(42)

Recall that, by construction,

\hat{φ}

can be seen as a random vector. In fact, we have

\hat{φ} \sim N (0, I \cdot T / F^{2})

for Gaussian projection vectors, and

\hat{φ}

approximately follows

N (0, I \cdot T / (F N))

for orthogonal projectors. Let

Ξ

, Z be random variables with the distribution of a single element of

\hat{φ}

and

z

, respectively, then

q_{j}

and

r_{j}

can be seen as realizations of (approximately):

Q = - Γ Ξ (α Ξ + Z)

and

R = \frac{Γ^{2}}{4} {(α Ξ + Z)}^{2}

, so a stochastic version of (38) is:

Γ = \frac{| Ξ |}{\sqrt{Ξ^{2} + Γ μ Ξ (α Ξ + Z) \frac{β_{2}}{(1 - β_{2})} + \frac{Γ^{2} μ^{2}}{4} {(α Ξ + Z)}^{2} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}}}}

Squaring both sides, we find that, for a given realization (

ξ

, z) of (

Ξ

, Z),

Γ

must take the positive value

γ

that satisfies the following fourth degree equation:

\frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} \frac{μ^{2}}{4} {(α ξ + z)}^{2} γ^{4} + \frac{β_{2}}{1 - β_{2}} μ ξ (α ξ + z) γ^{3} + ξ^{2} γ^{2} - ξ^{2} = 0

(43)

From (43), it is easy to generate samples

γ

of

Γ

and, accordingly, samples of

Δ w

, by recalling that:

Δ w \approx k \cdot μ \cdot γ \cdot sgn (ξ)

(44)

We note that, for the particular case when

β_{2}

is very close to 1,

\frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} \approx 2 {(\frac{β_{2}}{1 - β_{2}})}^{2}

. This simplification allows us to approximate (43) as

γ^{2} {(\frac{γ μ}{2} \cdot \frac{β_{2}}{1 - β_{2}} (α ξ + z) + ξ)}^{2} - ξ^{2} \approx 0

(45)

which leads to the following fixed-point equation:

γ \approx \frac{ξ}{\frac{γ μ}{2} \cdot \frac{β_{2}}{1 - β_{2}} (α ξ + z) + ξ}

(46)

When the noise term z is very small compared to

α ξ

(which occurs with a fairly large probability, especially for the case of orthogonal projectors), then the solution to (46), denoted by

γ_{s}

, will be independent of the value of

ξ

. This will cause the probability of

Γ

to be concentrated around

γ_{s}

, and in turn this will make the pdf

Δ w

have two spikes centered at

\pm k μ γ_{s}

. We will see these spikes appearing time and again in the experiments carried out with Adam (Section 6).

3.3. The Denoising Term

Thus far, we have considered only that our cost function is

\tilde{f} (w) = λ f_{watermark} (w)

; however, as we know, there is an additional term, the original denoising function, so our real cost function is:

f (w) = f_{0} (w) + \tilde{f} (w)

.

The gradients corresponding to this function,

f_{0} (w)

, will try to pull the weight vector towards the original optimal

w^{(0)}

in a relatively hard to model way. In order to analyze this behavior, we can approximate the gradient of the denoising function at the kth iteration with respect to the jth coefficient as a sum of a constant term,

d_{j}

, and a noisy one,

{\tilde{n}}_{j}^{(k)}

, which follows a zero-mean Gaussian distribution and is associated with the use of different training batches on each step. We will refer to this noise as batching noise. Thus, for each coefficient j, we can write:

\frac{\partial f_{0} (w^{(k)})}{\partial w_{j}} \approx d_{j} + {\tilde{n}}_{j}^{(k)}

(47)

Like we did in the previous section, we can formulate a stochastic version of (47). To that end, we notice that the constant term of this gradient,

d_{j}

, can take different values with j, as well as the variance of the batching noise,

h_{j}

that is,

{\tilde{n}}_{j}

is drawn from

N (0, h_{j})

. Therefore, in order to reflect the variability of these terms along the j-elements, we introduce two random variables with the distribution of the mean gradient and the variance of the batching noise, D and H, respectively, for which

d_{j}

and

h_{j}

are realizations. The pdf of these distributions will be obtained empirically in Section 6.2. Then, we can see

{\tilde{n}}_{j}^{(k)}

as a realization of

\tilde{N} \sim N (0, H)

.

3.3.1. SGD

Similarly to Section 3.2.5, let

Ξ

be a random variable with the distribution of

\hat{φ}

. Let (

ξ

,

δ

,

\tilde{n}

) be a realization of (

Ξ

, D,

\tilde{N}

), respectively, then for SGD using orthogonal projectors we can compute samples of

Δ w

adding both functions, i.e., denoising and watermarking:

Δ w \approx k μ (δ + \tilde{n}) + \frac{λ k μ}{2} (1 - \frac{λ k μ}{4 F - λ k μ}) ξ

(48)

3.3.2. Adam

The variance of the batching noise computed by Adam will be approximately given by the random variable V, whose realizations can be expressed as

v_{j} = \frac{1 - β_{2}}{1 - β_{2}^{k}} \sum_{i = 1}^{k} β_{2}^{(k - i)} {({\tilde{n}}_{j}^{(i)})}^{2}

. Notice that, for each realization of V, as the sum takes places over i, we must work with a fixed value

h_{j}

for the variance of the batching noise. Then, with this variance, we generate k samples of

\tilde{N}

to be used in the sum that produces

v_{j}

. With this characterization, we can easily analyze how the denoising cost function shapes the distribution of the weight variations. Notice that this analysis could be adapted for any host network. Let

δ

and

ν

be realizations of D and V, respectively, then, we can generate samples of

Δ w

without including the gradients from the watermarking function as:

Δ w \approx k \cdot μ \cdot \frac{δ}{\sqrt{δ^{2} + ν}}

Moreover, in order to get a more accurate description of the problem, we can combine both functions: denoising and watermarking. The analysis becomes somewhat complicated, but, as we will check in Section 6, the distributions resulting from this analysis do capture better the shapes observed in the empirical ones. See Appendix A.5 for the results of this analysis.

4. Block-Orthonormal Projections (BOP)

Here, we discuss BOP, the solution we propose to solve the detectability problem posed by Adam optimization when implementing the watermarking algorithm proposed in [5,6]. In order to hide the noticeable weight variations that appear when we use Adam—as seen in Figure 2—we introduce a prior transformation using a secret

N \times N

matrix

X

(the details for its construction are given below). The procedure we follow has three steps per each iteration of Adam.

Firstly, we project the weights and gradients from the embedding layer using

X

:

\begin{matrix} y & = & X w \\ \nabla_{y} f (y) & = & X \nabla_{w} f (w) \end{matrix}

Then, we run Adam optimization on the projected weights,

y

, using the projected gradients,

\nabla_{y} f (y)

, as well, i.e., steps (3)–(8) are taken using

y

and

\nabla_{y} f (y^{(k - 1)})

instead of

w

and

\nabla_{w} f (w^{(k - 1)})

, respectively. The key of BOP relies on the following: if we execute Adam on

y

instead of

w

, we can break the natural bond created by Adam between

sgn (\hat{φ})

and

w

—as we saw in the previous sections—responsible for the ant-like behavior of the weights and, consequently, the appearance of side spikes in their histograms. These undesired effects disappear when we de-project

y

using

X^{- 1}

to get back to the weight vector

w

:

\begin{matrix} w & = & X^{- 1} y \end{matrix}

In order to reduce the computational complexity and the memory requirements of this method—recall that N is generally a very large number and we must project and de-project the weights on each iteration—we consider

X

to be a block diagonal matrix with B identical

\frac{N}{B} \times \frac{N}{B}

blocks. In this way, we only have to build and work with a single block

X_{B}

, for which we can choose the size by simply adjusting the value of B. The values of this block are drawn from a standard normal distribution. In addition,

X_{B}

is built as an orthonormal matrix so that

X_{B}^{- 1} = X_{B}^{T}

. Let

y^{(i)}

and

w^{(i)}

be the ith block of

y

and

w

, respectively, both of them of length

\frac{N}{B}

; therefore, we just compute:

\begin{matrix} y^{(i)} & = & X_{B} w^{(i)} \\ \nabla_{y^{(i)}} f (y^{(i)}) & = & X_{B} \nabla_{w^{(i)}} f (w^{(i)}) \end{matrix}

After executing Adam, we can get back to

w^{(i)}

:

\begin{matrix} w^{(i)} & = & X_{B}^{T} y^{(i)} \end{matrix}

As we will see in Section 6.2.4, BOP does not significantly alter the original distribution of weights, as opposed to standard Adam. This makes it possible to enjoy the advantages of Adam optimization when we implement the watermarking algorithm in [5,6] with a minimal increase in the detectability of the watermark. In addition, this has an advantage in terms of robustness: if the adversary is not able to infer which layer is watermarked, then he/she will have to exert his/her attack (e.g., noise addition, weight pruning) on every layer thus producing a larger impact on the performance of the network as measured by the original cost function. We will discuss this fact in the experimental section.

5. Information-Theoretic Measures

As already discussed, one of the potential weaknesses of any neural network watermarking algorithm is the detectability of the watermark. An adversary that detects the presence of a watermark on a certain subset of the weights can initiate an attack to remove or alter the watermark. For this reason, it is important that the weights statistically suffer the least modification possible while of course being able to convey the desired hidden message. To measure this statistical closeness, we propose using the KLD [13] between the distributions of weights before and after the watermark embedding. Let P and Q be two discrete probability distributions defined on the same alphabet

X

; then, the KLD from Q to P is (notice that it is not symmetric):

KLD (P | | Q) ≐ \sum_{x \in X} P (x) log (\frac{P (x)}{Q (x)})

The KLD is always non-negative. The more similar the distributions P and Q are, the smaller the divergence. In the extreme case of two identical distributions, the divergence is zero.

It is interesting to note that the KLD has been proposed for similar problems in forensics, including steganographic security [23], distinguishability between forensic operators [24], or more general source identification problems [25].

In our case, the two compared distributions are those of

w^{(0)}

and

w^{(k)}

, for k just producing convergence with no decoding errors. Since the KLD is not symmetric, it remains to assign those distributions to P and Q so that the measure is as informative as possible. In particular, we are interested in properly accounting for the possible lateral spikes in the pdf of

w^{(k)}

. As those spikes often appear where the pdf of

w^{(0)}

is small if not negligible, this suggests assigning the latter pdf to Q and the former to P. However, this choice creates a problem in practice, as for some

x \in X

, the empirical probabilities are such that

P (x) \neq 0

and

Q (x) = 0

, potentially leading to an infinite divergence. To circumvent this issue related to insufficient sampling, we use an analytical approximation to Q with infinite support, after noticing that the empirical distribution of

w^{(0)}

with 1000 discrete bins (see Figure 2a) can be approximated by a zero-mean Generalized Gaussian Distribution (GGD) with shape parameter

β = 0.64

and scale parameter

α = 0.01

, (for notational coherence with the literature,

α

is used in this section to denote a different quantity than in the rest of the paper.) for which the latter controls the spread of the distribution. As a reference, the KLD between the empirical distribution of

w^{(0)}

and its GGD best-fit is

0.0177

, which is smaller than any of the KLDs that we find in Table 2. In order to compute the KLD in our experiments, we use this infinite-support symmetric distribution for Q and the empirical one of

w^{(k)}

for P after quantizing both to 1000 discrete bins.

The use of the KLD is adequate to measure the detectability in those cases where the adversary has access to information about the ’expected’ distribution of the weights. For instance, when only one layer is modified, the expected distribution can be inferred from the weights of other layers. However, this may be still too optimistic in terms of adversarial success, as while the expected shape may be preserved—and thus, inferred—across layers, the scale (directly affecting the variance) may be not so. For instance, if the original weights were expected to be zero-mean Gaussian and they still are after watermarking, the KLD (which depends on the ratio of the respective variances) may be quite large, but the adversary will not be able to determine if watermarking took place if he/she does not know what the variance should be and only measures divergence with respect to a Gaussian. To reflect this uncertainty, quite realistic in practical situations, we minimize the KLD with respect to the scale parameter

α

. This puts the adversary in a scenario where only the shape is used for detectability. Thus, let

Q_{α}

correspond to a GGD with scale parameter

α

, then we define the Scale Invariant KLD (SIKLD) as:

SIKLD ≐ min_{α} KLD (P | | Q_{α}) = min_{α} \sum_{x \in X} P (x) log (\frac{P (x)}{Q_{α} (x)})

6. Experiments and Results

In this section, we show the experimental results and we compare them to the theory that we have developed. We use MATLAB R2018b to implement the expressions obtained in Section 3 and represent the theoretical histograms. As we will see, both theory and experiments match reasonably well. In particular, for Adam optimization, we are able to reproduce the same position of the side spikes seen in the empirical histograms of

Δ w

, as well as some effects which are attributable to the influence of the denoising cost function. We will also verify the BOP method proposed in Section 4. In addition, the KLD will be computed to give a more precise measure of the similarity between the distributions of

w^{(k)}

(i.e., after the embedding) and

w^{(0)}

, when using SGD, Adam and BOP.

6.1. Experimental Set-Up

We employ the fine-tune-to-embed approach described in [5,6]. This means that the training process is divided into two phases, as we explained earlier: (1) training the host network from scratch, and (2) fine-tuning steps for embedding the watermark.

6.1.1. Training the Host Network

In order to perform the initial training of the host network FFDNet, we use the open-source implementation for PyTorch provided in [26]. We employ the FFDNet architecture for color images, which has a depth of 12 convolutional layers and 96 feature maps per layer. The training details are the same as in [26] and also the used datasets: Waterloo Exploration Database [27] for training and Kodak24 [28] for validation. We implement the cost function introduced in (1) and train 80 epochs with the milestones described in [26] on a GPU NVIDIA Titan Xp. We use Adam as the optimization algorithm with its hyperparameters set to their default values. After training the network, we test it on the CBSD68 [29] and Kodak24 datasets.

6.1.2. Watermark Embedding

Once we have trained and tested our host network, we embed our T-bit watermark,

b = 1

,

T = 256

, into the convolutional layer

l = 2

of FFDNet. In the next section, we present the results for both SGD and Adam optimization algorithms. The size of the convolutional filter is

3 \times 3

and the depth of input is

I = 96

, as well as the number of filters in the layer,

F = 96

. Therefore, we have:

M = 96 \cdot 3 \cdot 3 = 864

, and

N = M F

= 82,944. In addition, the learning rate

μ

is set to

10^{- 6}

during these fine-tuning steps and, also, we do not perform weight orthogonalization as we did during the initial training.

In addition, we use the following values for the regularizer parameter. When we use SGD we set

λ = 5

and

λ = 20

for Gaussian and orthogonal projectors, respectively. In addition, for Adam optimization, we use different values of

λ

for each configuration to better reflect the influence of the denoising function. In particular, we set

λ = 0.05

and

λ = 1

when we use Gaussian projectors, and

λ = 0.5

and

λ = 10

when we employ orthogonal projectors. We finish our embedding process when we reach a BER of zero, that is, when all the projected weights are positive—recall that all the embedded bits are set to

+ 1

—, i.e.,

{({\hat{ϕ}}^{(m)})}^{T} w > 0

, for all

m = 1, \dots, T

. Notice that these values of

λ

were selected with the goal of reaching a BER of zero in a relatively fast way and, as it can be seen, they are not straightforwardly comparable for Gaussian and orthogonal projectors. Finally, to check the validity of our proposed method BOP, we use the same values of

λ

as with Adam optimization, and set the number of blocks B to 12.

6.2. Experimental Results

Here, we present the experimental results. After the main training of the host network and before the watermark embedding, we obtain a PSNR of

31.18

dB and

32.13

dB on the CBSD68 and Kodak24 datasets, respectively, for a noise level of 25. These results are very close to those reported in [26]. Compare these values to those in Table 2, where we show the PSNR (dB) results on the CBSD68 and Kodak24 datasets for the same noise level after the watermark embedding was performed. As we see, when we embed the watermark using SGD with Gaussian projectors, the denoising performance drops about

0.45

dB, while, if we employ orthogonal projectors, the performance drops only

0.1

dB. Thus, employing orthogonal projectors with SGD optimization might be beneficial to better preserve the denoising performance. For Adam optimization and BOP, the original performance does not significantly drop and it even increases when using orthogonal projectors. Consequently, in order to keep a good performance after the watermark embedding, Adam would be preferable to SGD were it not for the conspicuousness of the weights. Our proposed method BOP is a good solution to bring the detectability of Adam down to similar levels as SGD and still enjoy the rest of advantages.

Table 2 also presents the number of iterations required for obtaining a BER of zero and the KLD and Scale Invariant KLD (SIKLD) between the distributions of

w^{(k)}

and

w^{(0)}

for each configuration.

6.2.1. Empirical Denoising Gradients

In order to analyze the influence of the denoising function, we need to get the empirical distributions of the mean denoising gradient and the variance of the batching noise. To that end, we proceed as follows: firstly, we extract the denoising gradients from the embedding layer

l = 2

and we average them for each coefficient over the number of iterations k to get the distribution of the mean. Then, the batching noise can be easily computed if, for each coefficient, we subtract the mean value from its corresponding denoising gradient value at each iteration. By computing the variance of this noise for each individual weight, we can estimate the overall distribution of the variance of the batching noise. Figure 3 shows the empirical distribution of the mean denoising gradient, D, and the variance of the batching noise, H.

6.2.2. SGD

We embed our watermark using SGD and its corresponding set-up, as we detailed in Section 6.1.2. We show the resulting histograms of

w^{(k)}

and

Δ w

in Figure 4. As we can see, SGD does not significantly alter the original distribution of weights shown in Figure 2a. This is also reflected in the SIKLD, which is very small, especially when we use orthogonal projectors (see Table 2).

Now, we check the theory we developed for SGD in Section 3.1. Recall that out theoretical analysis just covers the case of orthogonal projectors. Using (24), we can generate samples of

Δ w

without including the effect of the denoising cost function. The resulting histogram is shown in Figure 5a. Notice that the unusual appearance of this histogram can be attributed to the effects of applying the initial transformation explained in Section 2.3.1. In particular, each value of

φ

repeats F times to form vector

\hat{φ}

; hence, the discrete values in the y-axis of the histogram shown in Figure 5a. However, we see that the range of values of this theoretical histogram fits quite well the empirical one (Figure 4c). In order to get a more accurate representation, we can generate samples of

Δ w

according to (48), so that we add the effect of noise coming from the denoising cost function. As we see in Figure 5b, the resulting histogram is now very similar to the one in Figure 4c.

In addition, we confirm that (25) can be used to compute the variance of the distribution of

w^{(k)}

when we implement orthogonal projectors. Firstly, we check the hypothesis that we made regarding the uncorrelatedness between

w^{(0)}

and

Δ w

. For our particular case of

λ = 20

, the correlation coefficient between

w^{(0)}

and

Δ w

is

2.539 \cdot 10^{- 4}

, a very small value that confirms our assumption. Using (25), we have that the variance of the empirical distribution of

w^{(k)}

—red histogram in Figure 4c—is

1.192 \cdot 10^{- 3}

while the theoretical variance is

1.193 \cdot 10^{- 3}

. As we see, these values are almost identical.

6.2.3. Adam

In the following experiments, we employ Adam optimization for the watermark embedding and use the same settings as in Section 6.1.2. The resulting histograms of

w^{(k)}

and

Δ w

are shown in Figure 6 and Figure 7, respectively. As it can be observed, the shape of the weight distribution changes to a great extent for

λ = 1

and

λ = 10

. For smaller values of

λ

, since the influence of the watermarking cost function is weaker, we can avoid having a significant alteration to the original distribution shape. This is also reflected on their SIKLD values in Table 2: as

λ

increases the SIKLD also increases considerably. However, notice that, whatever the value of

λ

is, the histograms of weight variations always present the characteristic side spikes. These footprints left by Adam can increase the detectability of the watermark. In addition, when

λ

is small, we can observe in these histograms the influence of the denoising cost function: it causes the appearance of a central peak with values that spread till the location of both side spikes.

Figure 8 represents the pdf of

Γ

obtained from (43). Notice that, as we stated in Section 3.2.5, the pdf is concentrated around its mode. Figure 9 shows the histograms of

Δ w

obtained from (44) when only the watermarking loss

\tilde{f} (w)

is optimized and the denoising component is set to zero. Compare these histograms to those in Figure 7: the theory developed in Section 3.2 is able to explain the two-spiked distributions of

Δ w

. Notice that these theoretical expressions provide a good enough approximation since they allow us to predict the position of the side spikes. We show in Table 3 the values of these positions obtained from both theoretical (Figure 9) and empirical (Figure 7) results. As we can see, these side spikes are placed in almost identical positions by both theory and experiments, hence, we can confirm that the sign phenomenon in Adam is responsible for this ant-like behavior shown by the weights at the embedding layer.

Finally, in order to reflect the influence of the denoising function and obtain more realistic histograms, we can solve the fourth-degree Equation (A11) and, then, generate samples of

Δ w

according to (A12). The resulting histograms are shown in Figure 10 and are now quite similar to those in Figure 7. As it can be seen, we can emulate the central peak and the dispersion of the values of the side spikes in the histograms and, thus, we can confirm that these effects are attributable to the influence of the denoising cost function.

6.2.4. BOP

Here, we represent the empirical histograms when we implement our method BOP. Figure 11 and Figure 12 show the histograms of

w^{(k)}

and

Δ w

, respectively. As we see from these histograms and the KLD and SIKLD values in Table 2, this method allows us to remove the side spikes of the histograms and much better preserve the original shape of the weight distribution. As a result, the detectability of the watermark due to Adam optimization is strongly reduced.

A positive side effect of the undetectability of the watermark is that the robustness is increased because an adversary will not know which layer must be modified in order to alter the embedded watermark. This is illustrated in Figure 13 where we compare the robustness of standard Adam with that of BOP against weight pruning. The network is trained long enough for both Adam and BOP to guarantee a similar BER vs. pruning rate, as shown in Figure 13a. Then, the PSNR obtained after training and pruning is shown in Figure 13b for the Kodak24 dataset which illustrates the following facts: (1) BOP produces a network that is more robust to pruning in terms of PSNR, which can be valuable towards model compression; for instance, for a pruning rate of 0.35 (that has no impact on the BER of the hidden information), BOP degrades the original PSNR by about 1 dB, whereas Adam would produce a degradation of more than 3 dB. (2) This robustness might be detrimental in case it is an attacker who does the pruning in an attempt to degrade the watermark; for instance, for a pruning rate of 0.82, the BER for both Adam and BOP rises to 0.02 (see Figure 13a). For this pruning rate, the PSNR of BOP is around 25 dB while Adam gives slightly less than 24 dB. Then, in the case of BOP, the adversary would be able to produce a network that performs closer to the original in terms of PSNR—notice, however, that the degradation in both cases is quite severe, so this heavy pruning would render a denoiser with little practical use —. (3) In any case, the previous comparison would assume that the adversary knows the layer that contains the watermark; as we have properly justified, this is reasonable for Adam but not so for BOP. If the adversary does not know the layer that must be pruned, then, to achieve the same target BER, he/she must prune all the layers. In this case, for the same pruning rate of 0.82 that causes the BER to increase to 0.02, the PSNR drops to less than 18 dB.

Similar conclusions can be extracted from Figure 13c that shows the PSNR vs. pruning rate for the CBSD68 dataset. These experiments clearly show the higher robustness brought about by the undetectability of BOP, as it prevents attacks targeted to a specific layer.

7. Conclusions

Throughout this paper, we have shown the importance of being careful with the optimization algorithm when we embed watermarks following the approach in [5,6]. The choice of certain optimization algorithms whose update direction is given by the sign function can originate footprints in the distributions of weights that are easily detectable by adversaries, thus compromising the efficacy of the watermarking algorithm.

In particular, we studied the mechanisms behind SGD and Adam optimization and found that the sign phenomenon that occurs in Adam is detrimental for watermarking, since it causes the appearance of two salient side spikes on the histograms of weights. As opposed to Adam, the sign function does not appear when we use SGD. Therefore, SGD does not significantly alter the original shape of the distribution of weights although, as we showed in the theoretical analysis, it slightly increases its variance. The analysis in this paper can be extended to other optimization algorithms.

In addition, we introduced orthogonal projectors and observed that, compared to the Gaussian case, they generally preserve the original performance and weight distribution better. However, a deeper analysis on this subject is left for further research.

Finally, we presented a novel method that uses orthogonal block projections to address the use of Adam optimization together with the watermarking algorithm under study. As we checked in the empirical section, this method allows us to solve the detectability problem posed by Adam and still enjoy the rest of advantages of this optimization algorithm.

Author Contributions

Conceptualization, B.C.-L. and F.P.-G.; methodology, F.P.-G.; software, B.C.-L.; validation, B.C.-L. and F.P.-G.; formal analysis, B.C.-L. and F.P.-G.; investigation, B.C.-L.; resources, B.C.-L.; writing—original draft preparation, B.C.-L.; writing—review and editing, B.C.-L. and F.P.-G.; supervision, F.P.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by the Agencia Estatal de Investigación (Spain) and the European Regional Development Fund (ERDF) under projects RODIN (PID2019-105717RB-C21) and RED COMONSENS (RED2018-102668-T). In addition, it was funded by the Xunta de Galicia and ERDF under projects ED431C 2017/53 and ED431G 2019/08.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Mathematical Derivations

Appendix A.1. Projected Weights at k = 0

We want to compute the mean and the variance of

{\hat{Φ}}^{T} w^{(0)}

. It is easy to see that the mean is zero for any

{({\hat{ϕ}}^{(i)})}^{T} w^{(0)}

,

i = 1, \dots, T

because

E {{\hat{ϕ}}_{j}^{(i)}} = 0

.

Now, in order to obtain an expression for the variance, we may compute the expectation of the trace of the covariance matrix of

{\hat{Φ}}^{T} w^{(0)}

divided by T. Therefore, if we use the properties of the trace, we have:

\begin{matrix} Var {{\hat{Φ}}^{T} w^{(0)}} & = & \frac{1}{T} E \{Tr [{\hat{Φ}}^{T} w^{(0)} {(w^{(0)})}^{T} \hat{Φ}]\} \\ = & \frac{1}{T} E \{Tr [w^{(0)} {(w^{(0)})}^{T} \hat{Φ} {\hat{Φ}}^{T}]\} \\ = & \frac{1}{T} \sum_{i, j} E \{{(w^{(0)} {(w^{(0)})}^{T})}_{i, j} {(\hat{Φ} {\hat{Φ}}^{T})}_{i, j}\} \end{matrix}

We can write:

E \{{(w^{(0)} {(w^{(0)})}^{T})}_{i, j} {(\hat{Φ} {\hat{Φ}}^{T})}_{i, j}\} = E {w_{i}^{(0)} w_{j}^{(0)}} E \{\sum_{m = 1}^{T} {\hat{ϕ}}_{i}^{(m)} {\hat{ϕ}}_{j}^{(m)}\}

(A1)

It is straightforward to see that, for the off-diagonal terms, i.e.,

i \neq j

, the expectation in (A1) is zero. For the N diagonal terms, i.e.,

i = j

, we have instead:

E \{{(w_{i}^{(0)})}^{2}\} E \{\sum_{m = 1}^{T} {({\hat{ϕ}}_{i}^{(m)})}^{2}\} = T \cdot Var {w^{(0)}} Var {{\hat{ϕ}}^{(m)}}

where

Var {{\hat{ϕ}}^{(m)}}

is the variance of any projection vector

{\hat{ϕ}}^{(m)}

,

m = 1, \dots, T

. When we use Gaussian projectors, we have

Var {{\hat{ϕ}}^{(m)}} = 1 / F^{2}

and when we use orthogonal projectors, we have

Var {{\hat{ϕ}}^{(m)}} = 1 / (F N)

. Therefore, the variance for Gaussian projectors is

Var {{\hat{Φ}}^{T} w^{(0)}} = \frac{N}{F^{2}} Var {w^{(0)}}

while for orthogonal projector is

Var {{\hat{Φ}}^{T} w^{(0)}} = \frac{1}{F} Var {w^{(0)}}

Appendix A.2. Adam: Mean of the Gradient

In order to find an expression for the mean of the gradient, we substitute (17) into (26) and obtain:

\begin{matrix} m^{(k)} & = & - \frac{λ}{2} \cdot \hat{φ} \cdot (1 - β_{1}) \sum_{i = 1}^{k} β_{1}^{k - i} + \frac{λ}{4} \cdot \hat{Ψ} \cdot (1 - β_{1}) \sum_{i = 1}^{k} β_{1}^{k - i} w^{(i)} \\ = & - \frac{λ}{2} \cdot \hat{φ} \cdot (1 - β_{1}^{k}) + \frac{λ}{4} \cdot \hat{Ψ} \cdot (1 - β_{1}) \sum_{i = 1}^{k} β_{1}^{k - i} w^{(i)} \end{matrix}

The bias-corrected mean gradient is:

\begin{matrix} {\hat{m}}^{(k)} = \frac{m^{(k)}}{1 - β_{1}^{k}} = - \frac{λ}{2} \hat{φ} + \frac{λ}{4} \cdot \hat{Ψ} \cdot \frac{1 - β_{1}}{1 - β_{1}^{k}} \sum_{i = 1}^{k} β_{1}^{k - i} w^{(i)} \end{matrix}

The main difficulty in finding an explicit expression for

{\hat{m}}^{(k)}

is the last sum which requires a closed-form expression for

w^{(i)}

, for all

i = 0, \dots, k

, which in turn depends on the Adam adaptation. This easily leads to a difference equation whose solution is cumbersome. Alternatively, we start by conjecturing the affine-growth in (18) and write:

\begin{matrix} \frac{1 - β_{1}}{1 - β_{1}^{k}} \sum_{i = 1}^{k} β_{1}^{k - i} w^{(i)} & = & \frac{1 - β_{1}}{1 - β_{1}^{k}} \sum_{i = 1}^{k} β_{1}^{k - i} (w^{(0)} + i \cdot μ η) \\ = & w^{(0)} + μ η \frac{1 - β_{1}}{1 - β_{1}^{k}} \sum_{i = 1}^{k} i \cdot β_{1}^{k - i} \\ = & w^{(0)} + μ η (\frac{k}{1 - β_{1}^{k}} - \frac{β_{1}}{1 - β_{1}}) \end{matrix}

Therefore, under the affine-growth hypothesis, we can write:

\begin{matrix} {\hat{m}}^{(k)} = - \frac{λ}{2} \hat{φ} + \frac{λ}{4} \cdot \hat{Ψ} (w^{(0)} + μ η (\frac{k}{1 - β_{1}^{k}} - \frac{β_{1}}{1 - β_{1}})) \end{matrix}

and arrive at the expression in (27).

Appendix A.3. Adam: Variance of the Gradient

To get an expression for the variance of the gradient, we plug (31) into (34) and write:

\begin{matrix} v_{j}^{(k)} & = & \frac{λ^{2}}{4} (1 - β_{2}) \sum_{i = 1}^{k} β_{2}^{k - i} (p_{j} + q_{j} μ i + r_{j} μ^{2} i^{2}) \\ = & \frac{λ^{2}}{4} (1 - β_{2}) p_{j} \sum_{i = 1}^{k} β_{2}^{k - i} + \frac{λ^{2}}{4} (1 - β_{2}) q_{j} μ \sum_{i = 1}^{k} i β_{2}^{k - i} + \frac{λ^{2}}{4} (1 - β_{2}) r_{j} μ^{2} \sum_{i = 1}^{k} i^{2} β_{2}^{k - i} \\ = & \frac{λ^{2}}{4} (1 - β_{2}^{k}) (p_{j} - \frac{β_{2}}{1 - β_{2}} μ q_{j} + \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} r_{j}) + \frac{λ^{2}}{4} k (μ q_{j} - \frac{2 β_{2}}{1 - β_{2}} μ^{2} r_{j}) + \frac{λ^{2}}{4} k^{2} μ^{2} r_{j} \end{matrix}

Appendix A.4. Adam: A Projection-Based Decomposition of $c_{j}^{T} sgn (\hat{φ})$

We want to write

c_{j}^{T} sgn (\hat{φ}) = α \cdot {\hat{φ}}_{j} + z_{j}

. To find

α

, we compute the cross-correlation with

\hat{φ}

, i.e.,

E {{\hat{φ}}^{T} C^{T} sgn (\hat{φ})} = α E {| | \hat{φ} {| |}^{2}}

. Then,

z

is simply

C^{T} sgn (\hat{φ}) - α \cdot \hat{φ}

. In the following, we show separately the derivations for the Gaussian and orthogonal projectors.

Appendix A.4.1. Decomposition for Gaussian Projectors

Recall that the projection vectors have i.i.d. components drawn from a

N (0, 1)

distribution. Since the quantity of interest

E {{\hat{φ}}^{T} C^{T} sgn (\hat{φ})}

is a scalar, we will use the trace to manipulate the involved matrices, as follows:

\begin{matrix} E {{\hat{φ}}^{T} C^{T} sgn (\hat{φ})} & = & Tr [E {{\hat{φ}}^{T} C^{T} sgn (\hat{φ})}] = E \{Tr [C^{T} sgn (\hat{φ}) {\hat{φ}}^{T}]\} \\ = & \sum_{i, j} E \{{(C \circ (sgn (\hat{φ}) {\hat{φ}}^{T}))}_{i, j}\} = \sum_{i, j} \sum_{m = 1}^{T} \sum_{n = 1}^{T} E {{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i}) {\hat{ϕ}}_{j}^{(m)} {\hat{ϕ}}_{j}^{(n)}} \end{matrix}

where

{\hat{ϕ}}_{i}^{(m)} = θ_{i} ϕ^{(m)}

,

m = 1, \dots, T

,

i = 1, \dots, N

, and

θ_{i}

is the ith row of matrix

Θ

. In order to compute the previous expectation, we need to consider separately the diagonal terms and the off-diagonal ones.

We start with the off-diagonal elements (i.e.,

i \neq j

). Here, we also need to distinguish the following cases:

⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋

, satisfied by

N (N - F)

elements, and

⌊\frac{i}{F}⌋ = ⌊\frac{j}{F}⌋

, which applies for the remaining

N (F - 1)

off-diagonal terms. Therefore, in the first subcase, we have that

θ_{i} \neq θ_{j}

, so

{\hat{ϕ}}_{i}^{(m)} \neq {\hat{ϕ}}_{j}^{(m)}

and:

\sum_{m = 1}^{T} \sum_{n = 1}^{T} E {{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i}) {\hat{ϕ}}_{j}^{(m)} {\hat{ϕ}}_{j}^{(n)}} = \sum_{m = 1}^{T} E {{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})} E {{({\hat{ϕ}}_{j}^{(m)})}^{2}}

(A2)

In order to compute the expectation

E {{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})}

in (A2), we write:

\begin{matrix} {\hat{φ}}_{i} & = & {\hat{ϕ}}_{i}^{(m)} + \sum_{\overset{l = 1}{i \neq m}}^{T} {\hat{ϕ}}_{i}^{(l)} \\ = & {\hat{ϕ}}_{i}^{(m)} + n_{i} \end{matrix}

where

n_{i}

can be seen as a realization of an i.i.d. Gaussian distribution with zero-mean and variance

(T - 1) / F^{2}

. Assume without loss of generality that

{\hat{ϕ}}_{i}^{(m)} > 0

. Then, the probability that

sgn ({\hat{φ}}_{i}) = - 1

is

p ≐ Q ({\hat{ϕ}}_{i}^{(m)} F / \sqrt{T - 1})

, (

Q (x) ≐ \frac{1}{\sqrt{2 π}} \int_{x}^{\infty} e^{- x^{2} / 2} d x

.) and

{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})

will take the value

- {\hat{ϕ}}_{i}^{(m)}

with probability p, and

{\hat{ϕ}}_{i}^{(m)}

with probability

1 - p

. Then, the distribution of

{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})

will correspond to that of a random variable Y constructed as follows: letting

X \sim N (0, 1 / F^{2})

, we define Y as

Y ≐ | X | \cdot (1 - 2 Q (| X | F / \sqrt{T - 1}))

. Then,

E {Y} = \frac{F}{\sqrt{2 π}} \int_{0}^{\infty} 2 x (1 - 2 Q (\frac{x F}{\sqrt{T - 1}})) e^{- x^{2} F^{2} / 2} d x

It can be shown that this integral gives [30]:

E {Y} = \frac{1}{F} \sqrt{\frac{2}{π T}} ≐ μ_{Y}

(A3)

Therefore, when

i \neq j

and

⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋

, we can compute (A2) as:

\sum_{m = 1}^{T} E {{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})} E {{({\hat{ϕ}}_{j}^{(m)})}^{2}} = \frac{T}{F^{2}} μ_{Y}

When

i \neq j

and

⌊\frac{i}{F}⌋ = ⌊\frac{j}{F}⌋

, we have that

θ_{i} = θ_{j}

, so

{\hat{ϕ}}_{i}^{(m)} = {\hat{ϕ}}_{j}^{(m)}

. Thus:

\begin{matrix} \sum_{m = 1}^{T} \sum_{n = 1}^{T} E {{({\hat{ϕ}}_{i}^{(m)})}^{2} {\hat{ϕ}}_{i}^{(n)} sgn ({\hat{φ}}_{i})} & = & \sum_{m = 1}^{T} E {{({\hat{ϕ}}_{i}^{(m)})}^{3} sgn ({\hat{φ}}_{i})} + \sum_{m = 1}^{T} \sum_{\overset{n = 1}{n \neq m}}^{T} E {{({\hat{ϕ}}_{i}^{(m)})}^{2} {\hat{ϕ}}_{i}^{(n)} sgn ({\hat{φ}}_{i})} \\ = & T ξ_{Y^{'}} + T (T - 1) E {{({\hat{ϕ}}_{i}^{(m)})}^{2} {\hat{ϕ}}_{i}^{(n)} sgn ({\hat{φ}}_{i})} \end{matrix}

(A4)

where

ξ_{Y^{'}}

can be computed as the mean of

Y^{'}

defined in a similar way as above, i.e.,: let

X \sim N (0, 1 / F^{2})

, we define

Y^{'}

as

Y^{'} ≐ {| X |}^{3} \cdot (1 - 2 Q (| X | F / \sqrt{T - 1}))

. Then:

ξ_{Y^{'}} ≐ E {Y^{'}} = \frac{F}{\sqrt{2 π}} \int_{0}^{\infty} 2 x^{3} (1 - 2 Q (\frac{x F}{\sqrt{T - 1}})) e^{- x^{2} F^{2} / 2} d x

Using Mathematica, it is found that:

ξ_{Y^{'}} = \sqrt{\frac{2}{π T}} \frac{3 T - 1}{T F^{3}} = \frac{(3 T - 1)}{T F^{2}} μ_{Y}

(A5)

Unfortunately, the double integral required to compute the expectation in (A4) does not seem to admit a closed-form solution. On the other hand, its numerical computation through Monte Carlo integration is rather straightforward. Let

X, Y \sim N (0, 1 / F^{2})

and

Z \sim N (0, \frac{T - 2}{F^{2}})

, all mutually independent, then the desired expectation is

E {X Y^{2} sgn (X + Y + Z)} ≐ κ (T)

, where, with

κ (T)

, we stress the fact that the integral depends on T alone. The result is represented in Figure A1. It is possible to show that, as

T \to \infty

,

κ (T) \to \frac{μ_{Y}}{F^{2}}

.

Figure A1.

F^{2} \cdot κ (T) / μ_{Y}

vs. T

Figure A1.

F^{2} \cdot κ (T) / μ_{Y}

vs. T

Then, for moderately large T, the sum in (A4) is approximately

μ_{Y} (T^{2} + 2 T - 1) / F^{2}

.

For the

i = j

case (i.e., diagonal terms), we obtain the same result as (A4). Therefore, when computing

\sum_{i, j} E {{(C \circ (sgn (\hat{φ}) {\hat{φ}}^{T}))}_{i, j}}

, there are

N (N - F)

terms with value

μ_{Y} T / F^{2}

, and

N F

terms with approximate value

μ_{Y} (T^{2} + 2 T - 1) / F^{2}

, hence:

\begin{matrix} E {{\hat{φ}}^{T} C^{T} sgn (\hat{φ})} \approx \frac{N μ_{Y}}{F^{2}} (N T + F (T^{2} + T - 1)) \end{matrix}

In addition,

E {| | \hat{φ} {| |}^{2}}} = E {{\hat{φ}}^{T} \hat{φ}} = E {φ^{T} Θ^{T} Θ φ} = \frac{M T}{F} = \frac{N T}{F^{2}}

, so we have:

\begin{matrix} α & = & \frac{E {{\hat{φ}}^{T} C^{T} sgn (\hat{φ})}}{E {| | \hat{φ} | |^{2}}}} \approx \frac{F^{2} N μ_{Y} (N T + F (T^{2} + T - 1))}{F^{2} N T} \\ = & \frac{μ_{Y}}{T} (F T^{2} + (N + F) T - F) \end{matrix}

(A6)

We are interested now in measuring the variance of

z_{j}

,

j = 1, \dots, N

, which, by construction, we assume to be i.i.d. First, we note that the covariance matrix is:

C^{T} sgn (\hat{φ}) sgn ({\hat{φ}}^{T}) C - α^{2} \hat{φ} {\hat{φ}}^{T}

Then, the variance of

z_{j}

will be the expectation of the trace of this matrix divided by N. The second term is immediate to compute, as

E {Tr [\hat{φ} {\hat{φ}}^{T}]} = N T / F^{2}

, so we focus next on

Tr [C^{T} sgn (\hat{φ}) sgn ({\hat{φ}}^{T}) C] = Tr [C C^{T} sgn (\hat{φ}) sgn ({\hat{φ}}^{T})]

. Again, we can write:

E {Tr [C C^{T} sgn (\hat{φ}) sgn ({\hat{φ}}^{T})]} = E \{\sum_{i, j} {(C C^{T})}_{i, j} {(sgn (\hat{φ}) sgn ({\hat{φ}}^{T}))}_{i, j}\}

Note that:

\begin{matrix} C C^{T} & = & \sum_{m = 1}^{T} \sum_{l = 1}^{T} {\hat{ϕ}}^{(m)} {({\hat{ϕ}}^{(m)})}^{T} {\hat{ϕ}}^{(l)} {({\hat{ϕ}}^{(l)})}^{T} \\ = & \sum_{m = 1}^{T} \sum_{l = 1}^{T} (\sum_{n = 1}^{N} {\hat{ϕ}}_{n}^{(m)} {\hat{ϕ}}_{n}^{(l)}) {\hat{ϕ}}^{(m)} {({\hat{ϕ}}^{(l)})}^{T} \end{matrix}

Then,

E {{(C C^{T})}_{i, j} {(sgn (\hat{φ}) sgn ({\hat{φ}}^{T}))}_{i, j}} = \sum_{m = 1}^{T} \sum_{l = 1}^{T} \sum_{n = 1}^{N} E \{{\hat{ϕ}}_{n}^{(m)} {\hat{ϕ}}_{n}^{(l)} {\hat{ϕ}}_{i}^{(m)} {\hat{ϕ}}_{j}^{(l)} sgn ({\hat{φ}}_{i}) sgn ({\hat{φ}}_{j})\}

(A7)

We analyze first the off-diagonal terms, i.e.,

i \neq j

, when

⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋

, and consider separately the cases where

m \neq l

and

m = l

. For the first case, the right-hand side of (A7) can be written as:

\begin{matrix} \sum_{m = 1}^{T} \sum_{\overset{l = 1}{l \neq m}}^{T} \sum_{\overset{n = 1}{⌊\frac{n}{F}⌋ \neq ⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋}}^{N} E \{{\hat{ϕ}}_{n}^{(m)}\} E \{{\hat{ϕ}}_{n}^{(l)}\} E \{{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})\} E \{{\hat{ϕ}}_{j}^{(l)} sgn ({\hat{φ}}_{j})\} \\ + & F \sum_{m = 1}^{T} \sum_{\overset{l = 1}{l \neq m}}^{T} E \{{({\hat{ϕ}}_{i}^{(m)})}^{2} {\hat{ϕ}}_{i}^{(l)} sgn ({\hat{φ}}_{i})\} E \{{\hat{ϕ}}_{j}^{(l)} sgn ({\hat{φ}}_{j})\} \\ + & F \sum_{m = 1}^{T} \sum_{\overset{l = 1}{l \neq m}}^{T} E \{{({\hat{ϕ}}_{j}^{(l)})}^{2} {\hat{ϕ}}_{j}^{(m)} sgn ({\hat{φ}}_{j})\} E \{{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})\} \\ = & 2 F \sum_{m = 1}^{T} \sum_{\overset{l = 1}{l \neq m}}^{T} E \{{({\hat{ϕ}}_{i}^{(m)})}^{2} {\hat{ϕ}}_{i}^{(l)} sgn ({\hat{φ}}_{i})\} E \{{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})\} \\ = & \frac{2 μ_{Y}^{2}}{F} T (T - 1) \end{matrix}

When

⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋

and

m = l

, the right-hand side of (A7) can be written as:

\begin{matrix} \sum_{m = 1}^{T} \sum_{\overset{n = 1}{⌊\frac{n}{F}⌋ \neq ⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋}}^{N} E \{{({\hat{ϕ}}_{n}^{(m)})}^{2}\} E \{{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})\} E \{{\hat{ϕ}}_{j}^{(m)} sgn ({\hat{φ}}_{j})\} \\ + & F \sum_{m = 1}^{T} E \{{({\hat{ϕ}}_{i}^{(m)})}^{3} sgn ({\hat{φ}}_{i})\} E \{{\hat{ϕ}}_{j}^{(m)} sgn ({\hat{φ}}_{j})\} + F \sum_{m = 1}^{T} E \{{({\hat{ϕ}}_{j}^{(m)})}^{3} sgn ({\hat{φ}}_{j})\} E \{{\hat{ϕ}}_{i}^{(m)} sgn ({\hat{φ}}_{i})\} \\ = & \frac{1}{F^{2}} (T (N - 2 F) μ_{Y}^{2} + 2 F μ_{Y}^{2} (3 T - 1)) \end{matrix}

Now, we consider the rest of elements which satisfy that

⌊\frac{i}{F}⌋ = ⌊\frac{j}{F}⌋

, which are the N diagonal terms and the remaining

N (F - 1)

off-diagonal ones. The right-hand side of (A7) is:

\begin{matrix} \sum_{m = 1}^{T} \sum_{l = 1}^{T} \sum_{n = 1}^{N} E {{\hat{ϕ}}_{n}^{(m)} {\hat{ϕ}}_{n}^{(l)} {\hat{ϕ}}_{i}^{(m)} {\hat{ϕ}}_{i}^{(l)}} = F \sum_{m = 1}^{T} \sum_{\overset{l = 1}{l \neq m}}^{T} E \{{({\hat{ϕ}}_{i}^{(m)})}^{2}\} E \{{({\hat{ϕ}}_{i}^{(l)})}^{2}\} \\ + & \sum_{m = 1}^{T} \sum_{\overset{n = 1}{⌊\frac{n}{F}⌋ \neq ⌊\frac{i}{F}⌋}}^{N} E \{{({\hat{ϕ}}_{n}^{(m)})}^{2}\} E \{{({\hat{ϕ}}_{i}^{(m)})}^{2}\} + F \sum_{m = 1}^{T} E \{{({\hat{ϕ}}_{i}^{(m)})}^{4}\} \\ = & \frac{1}{F^{4}} (F T (T - 1) + T (N - F) + 3 F T) \end{matrix}

where, for the last summand, we have used the fact that, for a zero-mean Gaussian random variable X with variance

σ^{2}

,

E {X^{4}} = 3 σ^{4}

. Then, the sum in (A7) is:

\begin{matrix} \sum_{m = 1}^{T} \sum_{l = 1}^{T} \sum_{n = 1}^{N} E \{{\hat{ϕ}}_{n}^{(m)} {\hat{ϕ}}_{n}^{(l)} {\hat{ϕ}}_{i}^{(m)} {\hat{ϕ}}_{j}^{(l)} sgn ({\hat{φ}}_{i}) sgn ({\hat{φ}}_{j})\} \\ = & N (N - F) [\frac{2 μ_{Y}^{2}}{F} T (T - 1) + \frac{1}{F^{2}} (T (N - 2 F) μ_{Y}^{2} + 2 F μ_{Y}^{2} (3 T - 1))] \\ + & \frac{N}{F^{3}} (F T (T - 1) + T (N - F) + 3 F T) \end{matrix}

and the resulting variance of

z_{j}

is:

\begin{matrix} Var {z_{j}} & = & (N - F) (\frac{2 μ_{Y}^{2}}{F} T (T - 1) + \frac{μ_{Y}^{2}}{F^{2}} T (N - 2 F) + \frac{2 μ_{Y}^{2}}{F} (3 T - 1)) \\ + & \frac{1}{F^{3}} (F T (T - 1) + T (N - F) + 3 F T) - \frac{T}{F^{2}} μ_{Y}^{2} {(N + \frac{F}{T} (T^{2} + T - 1)])}^{2} \\ = & - \frac{μ_{Y}^{2}}{F T} (F T^{4} + 4 F T^{3} + (N + F) T^{2} - 4 F T + F) + \frac{T}{F^{3}} (N + F (T + 1)) \end{matrix}

(A8)

Appendix A.4.2. Decomposition for Orthogonal Projectors

Now, for orthogonal projectors, we take a different route. Again, we define matrix

C ≐ [c_{0}, \dots, c_{N}]

, and we note that

C = {\hat{Ψ}}^{T} = \hat{Ψ}

. Recall also properties (13) and (14). We can collect all products

c_{j}^{T} sgn (\hat{φ})

,

j = 1, \dots, N

in a vector obtained as

C^{T} sgn (\hat{φ}) = \hat{Ψ} sgn (\hat{φ})

.

We are interested in computing the cross-product of this vector and

\hat{φ}

. Then, we can write:

{\hat{φ}}^{T} \hat{Ψ} sgn (\hat{φ}) = \frac{1}{F} {\hat{φ}}^{T} sgn (\hat{φ})

Assuming i.i.d. components in

\hat{φ}

, this implies that, for the orthogonal projectors, the cross-correlation of

c_{j}^{T} sgn (\hat{φ})

and

{\hat{φ}}_{j}

is the same as

\frac{1}{F} sgn ({\hat{φ}}_{j})

and

{\hat{φ}}_{j}

. We can then compute

E {{\hat{φ}}_{j} sgn ({\hat{φ}}_{j})} = E {| {\hat{φ}}_{j} |}

. Modeling

{\hat{φ}}_{j}

as

N (0, T / (F N))

, we find that

E {| {\hat{φ}}_{j} |} = \sqrt{\frac{2 T}{π F N}}

. Furthermore,

E {| | \hat{φ} {| |}^{2}} = E {{\hat{φ}}^{T} \hat{φ}} = E {φ^{T} Θ^{T} Θ φ} = \frac{T}{F}

.

The existence of a positive cross-correlation suggests writing

\frac{1}{F} sgn ({\hat{φ}}_{j}) = α {\hat{φ}}_{j} + {\tilde{n}}_{j}

, with

α

a suitable positive constant and

\tilde{n}

a zero-mean noise vector uncorrelated with

\hat{φ}

. Taking the cross-product and the expectation:

\frac{1}{F} E {{\hat{φ}}^{T} sgn (\hat{φ})} = \frac{N}{F} E {| {\hat{φ}}_{j} |} = α E {| | \hat{φ} {| |}^{2}} + E {{\hat{φ}}^{T} \tilde{n}} = α E {| | \hat{φ} {| |}^{2}}

Therefore, we find that:

α = \frac{N \cdot E {| {\hat{φ}}_{j} |}}{F \cdot E {| | \hat{φ} | |^{2}}} = \sqrt{\frac{2 N}{π T F}}

(A9)

Now, it is easy to measure the second order moment of

{\tilde{n}}_{j}

since:

\begin{matrix} E {{\tilde{n}}_{j}^{2}} & = & E {{(\frac{1}{F} sgn ({\hat{φ}}_{j}) - α {\hat{φ}}_{j})}^{2}} \\ = & \frac{1}{F^{2}} + α^{2} E {{\hat{φ}}_{j}^{2}} - \frac{2 α}{F} E {{\hat{φ}}_{j} sgn ({\hat{φ}}_{j})} \\ = & \frac{1}{F^{2}} (1 - \frac{2}{π}) \end{matrix}

With the above characterization, we can see that:

\begin{matrix} C^{T} sgn (\hat{φ}) & = & F α \hat{Ψ} \hat{φ} + F \hat{Ψ} \tilde{n} \\ = & α \hat{φ} + z \end{matrix}

Therefore, we have:

\begin{matrix} Tr [E {z z^{T}}] = F^{2} \cdot Tr [E {\hat{Ψ} \tilde{n} {\tilde{n}}^{T} \hat{Ψ}}] = F \cdot E {Tr [\tilde{n} {\tilde{n}}^{T} \hat{Ψ}]} = F \cdot E \{\sum_{i, j} {(\tilde{n} {\tilde{n}}^{T})}_{i, j} {(\hat{ψ})}_{i, j}\} \end{matrix}

Then,

E \{{(\tilde{n} {\tilde{n}}^{T})}_{i, j} {(\hat{ψ})}_{i, j}\} = E \{{\tilde{n}}_{i} {\tilde{n}}_{j}\} E \{\sum_{m = 1}^{T} {\hat{ϕ}}_{i}^{(m)} {\hat{ϕ}}_{j}^{(m)}\}

When

⌊\frac{i}{F}⌋ \neq ⌊\frac{j}{F}⌋

, the expectation above is zero. For the

N F

remaining terms, that is, when

⌊\frac{i}{F}⌋ = ⌊\frac{j}{F}⌋

, we have

{\tilde{n}}_{i} = {\tilde{n}}_{j}

and

{\hat{ϕ}}_{i}^{(m)} = {\hat{ϕ}}_{j}^{(m)}

. Thus:

E {{\tilde{n}}_{j}^{2}} \sum_{m = 1}^{T} E \{{({\hat{ϕ}}_{j}^{(m)})}^{2}\} = \frac{1}{F^{2}} (1 - \frac{2}{π}) \frac{T}{N F} = \frac{T}{N F^{3}} (1 - \frac{2}{π})

Finally, we can compute the variance of

z_{j}

as follows:

\begin{matrix} Var {z_{j}} = \frac{1}{N} Tr [E {z z^{T}}] = (1 - \frac{2}{π}) \frac{T}{F N} \end{matrix}

(A10)

Appendix A.5. Adam: Analysis with Denoising and Watermarking

Here, we join the denoising and watermarking cost functions and follow a similar approach to that in Section 3.2.5. Let

c

be a random vector of length N, now we should build a random projection matrix,

\hat{Φ} = Θ Φ

—with Gaussian or orthogonal projectors—, as a basis to generate samples of

Ξ

from (11) and to build

c

following (29).

We also consider that

η = η_{d} + η_{w m}

is a random vector of length N, where

η_{d}

and

η_{w m}

represent the denoising and watermarking update terms, respectively. The components of

η_{d}

can be computed from realizations of D and V like in Section 3.3 that is,

η_{d} = δ / (\sqrt{δ^{2} + ν})

. On the other hand,

η_{w m}

has the same definition than in Section 3.2.5.

Let

Ω

be a random variable with the same distribution as

c^{T} η_{d}

and let

M_{0}

, P, Q and R be random variables for which

m_{0, j}

,

p_{j}

,

q_{j}

are

r_{j}

are realizations, respectively. Then, we have:

\begin{matrix} M_{0} & = & - \frac{λ}{2} Ξ - D \\ P & = & Ξ^{2} \\ Q & = & - Ξ Ω - Γ Ξ (α Ξ + Z) \\ R & = & \frac{1}{4} (Ω^{2} + 2 Γ (α Ξ + Z) Ω + Γ^{2} {(α Ξ + Z)}^{2}) \end{matrix}

Additionally, let S be a random variable for which

s_{j}

is a realization. Now, it must include the contribution due to the denoising cost function so that:

\begin{matrix} S & = & \frac{λ^{2}}{4} (P - \frac{β_{2}}{1 - β_{2}} μ Q + \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} R) + D^{2} + V \\ = & \frac{λ^{2}}{4} [Ξ^{2} + \frac{β_{2}}{1 - β_{2}} μ (Ξ Ω + Γ Ξ (α Ξ + Z)) \\ + & \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} \frac{μ^{2}}{4} (Ω^{2} + 2 Γ (α Ξ + Z) Ω + Γ^{2} {(α Ξ + Z)}^{2})] + D^{2} + V \end{matrix}

Therefore, for a given realization (

ξ

, z,

δ

,

ν

) of (

Ξ

, Z, D, V), we must solve the following fourth degree equation to get samples of

Γ

(notice that, in order to generate samples, we must previously build the random vectors

η_{d}

and

c

):

A_{1} γ^{4} + A_{2} γ^{3} + A_{3} γ^{2} + A_{4} γ + A_{5} = 0

(A11)

where:

\begin{matrix} A_{1} & ≐ & \frac{λ^{2}}{4} \frac{(β_{2} + β_{2}^{2})}{{(1 - β_{2})}^{2}} \frac{μ^{2}}{4} {(α ξ + z)}^{2} \end{matrix}

A_{2} = A_{21} + A_{22}

with

A_{21}

and

A_{22}

given by:

\begin{matrix} A_{21} & ≐ & \frac{λ^{2}}{4} \frac{β_{2}}{(1 - β_{2})} μ ξ (α ξ + z) \\ A_{22} & ≐ & \frac{λ^{2}}{8} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} (α ξ + z) (Ω + \frac{δ}{\sqrt{δ^{2} + ν}} sgn (ξ) (α ξ + z)) \end{matrix}

A_{3} = A_{31} + A_{32} + A_{33} + A_{34}

with the following definitions:

\begin{matrix} A_{31} & ≐ & \frac{λ^{2}}{4} ξ^{2} \\ A_{32} & ≐ & \frac{λ^{2}}{4} \frac{β_{2}}{1 - β_{2}} μ ξ (Ω + 2 \frac{δ}{\sqrt{δ^{2} + ν}} sgn (ξ) (α ξ + z)) \\ A_{33} & ≐ & \frac{λ^{2}}{16} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} (Ω^{2} + \frac{δ^{2}}{δ^{2} + ν} {(α ξ + z)}^{2} + 4 \frac{δ}{\sqrt{δ^{2} + ν}} sgn (ξ) (α ξ + z) Ω) \\ A_{34} & ≐ & δ^{2} + ν \end{matrix}

A_{4} = A_{41} + A_{42} + A_{43} + A_{44}

where:

\begin{matrix} A_{41} & ≐ & \frac{λ^{2}}{2} \frac{δ}{\sqrt{δ^{2} + ν}} sgn (ξ) ξ^{2} \\ A_{42} & ≐ & \frac{λ^{2}}{4} \frac{β_{2}}{1 - β_{2}} μ \frac{δ}{\sqrt{δ^{2} + ν}} ξ (2 sgn (ξ) Ω + \frac{δ}{\sqrt{δ^{2} + ν}} (α ξ + z)) \\ A_{43} & ≐ & \frac{λ^{2}}{8} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} \frac{δ}{\sqrt{δ^{2} + ν}} Ω (\frac{δ}{\sqrt{δ^{2} + ν}} (α ξ + z) + sgn (ξ) Ω) \\ A_{44} & ≐ & 2 \frac{δ}{\sqrt{δ^{2} + ν}} sgn (ξ) (δ^{2} + ν) \end{matrix}

Finally,

A_{5} = A_{51} + A_{52} + A_{53} + A_{54}

with:

\begin{matrix} A_{51} & ≐ & \frac{λ^{2}}{4} ξ^{2} (\frac{δ^{2}}{δ^{2} + ν} - 1) \\ A_{52} & ≐ & \frac{λ^{2}}{4} \frac{β_{2}}{1 - β_{2}} μ \frac{δ^{2}}{δ^{2} + ν} ξ Ω \\ A_{53} & ≐ & \frac{λ^{2}}{16} \frac{β_{2} + β_{2}^{2}}{{(1 - β_{2})}^{2}} μ^{2} \frac{δ^{2}}{δ^{2} + ν} Ω^{2} \\ A_{54} & ≐ & λ δ ξ \end{matrix}

Finally, we can generate samples of

Δ w

as:

Δ w \approx k \cdot μ \cdot (\frac{δ}{\sqrt{δ^{2} + ν}} + γ \cdot sgn (ξ))

(A12)

Appendix B. Verification of Assumptions

Appendix B.1. Affine Growth Hypothesis for the Weights

In order to check the affine growth hypothesis that we introduced in (18), we extract the weight values from the embedding layer on each iteration. Figure A2 and Figure A3 show the evolution with k of four randomly selected weights when we respectively use: (i) SGD with orthogonal projectors, and (ii) Adam with Gaussian projectors. As is evident, the hypothesis holds quite accurately for these particular examples.

For a more general approach encompassing all weights, we carry out lineal regression on the evolution of each individual weight with k. Then, for each

j \in {1, \dots, N}

, we measure the correlation coefficient,

ρ_{j}

, between the observed values—the experimental weight values—and the predictor values. Figure A4 represents the Empirical Cumulative Distribution Function (ECDF) of

ρ_{j}

when using SGD with orthogonal projectors and Adam with both Gaussian and orthogonal projectors. We can state that, when we use Gaussian projectors with Adam optimization, for

90 %

of the weights at the embedding layer

ρ_{j} > 0.9410

and for

80 %

of them

ρ_{j} > 0.9970

. Moreover, the affine hypothesis is even stronger when using orthogonal projectors, both with SGD or Adam optimization, since for

95 %

of the weights at the embedding layer

ρ_{j} > 0.9975

.

Because these percentages show a high-linear behavior, we can confirm the validity of the affine hypothesis for the weights that is key to our theoretical analysis.

Figure A2. Evolution with k of four randomly selected weights. SGD optimization with orthogonal projectors,

λ = 20

,

T = 256

.

Figure A2. Evolution with k of four randomly selected weights. SGD optimization with orthogonal projectors,

λ = 20

,

T = 256

.

Figure A3. Evolution with k of four randomly selected weights. Adam optimization with Gaussian projectors,

λ = 1

,

T = 256

.

Figure A3. Evolution with k of four randomly selected weights. Adam optimization with Gaussian projectors,

λ = 1

,

T = 256

.

Figure A4. ECDF of the correlation coefficient

ρ

between the observed values of the weights over k and their predicted affine evolution. (a) SGD with orthogonal projectors,

λ = 20

; (b) Adam with Gaussian,

λ = 1

and orthogonal projectors,

λ = 10

.

Figure A4. ECDF of the correlation coefficient

ρ

between the observed values of the weights over k and their predicted affine evolution. (a) SGD with orthogonal projectors,

λ = 20

; (b) Adam with Gaussian,

λ = 1

and orthogonal projectors,

λ = 10

.

Appendix B.2. Negligibility of Weights at k = 0

We first check the validity of the approximation

w^{(0)} \approx 0

introduced in the analysis for SGD optimization. From (23), we can get

η

from the parameter settings detailed in Section 6.1 for the orthogonal case. We compute the gradient

g_{\nabla} ≐ \nabla_{w} f (w)

following (20) and also its approximation

g_{\nabla}^{'} ≐ - (λ / 2) \hat{φ} + (λ k μ / 4) \hat{Ψ} η

. In order to prove that

w^{(0)} \approx 0

, we can show that the preserved term, that is, the approximation

g^{'}

, is much larger than the discarded term

g_{\nabla} - g_{\nabla}^{'} = (λ / 4) \hat{Ψ} w^{(0)}

, using the Normalized Mean Squared Error (NMSE). Let

x

be a N-length vector and

x^{'}

its approximation, then the corresponding NMSE is defined as

NMSE = | | x - x^{'} {| |}^{2} / | | x^{'} {| |}^{2}

. Using this definition on

g_{\nabla}

and

g_{\nabla}^{'}

, we get an NMSE of

3.6464 \cdot 10^{- 6}

, a very small value that supports our assumption

w^{(0)} \approx 0

.

Now, we check the same approximation for the Adam analysis. From (43), we can generate samples of

γ

using the parameter settings in Section 6.1. Let us recall the following definitions:

\begin{matrix} m_{0} & = & - \hat{φ} + \frac{1}{2} \hat{Ψ} w^{(0)} \\ p_{j} & = & a_{j} - b_{j}^{T} w^{(0)} + \frac{1}{4} c_{j}^{T} w^{(0)} {(w^{(0)})}^{T} c_{j} \\ q_{j} & = & - b_{j}^{T} η + \frac{1}{2} c_{j}^{T} w^{(0)} η^{T} c_{j} \end{matrix}

for

j = 1, \dots, N

. Additionally, we define their corresponding approximations as:

\begin{matrix} m_{0}^{'} & = & - \hat{φ} \\ p_{j}^{'} & = & a_{j} \\ q_{j}^{'} & = & - b_{j}^{T} η \end{matrix}

For Gaussian projectors, we get a NMSE of

3.9803 \cdot 10^{- 3}

,

5.2153 \cdot 10^{- 3}

and

1.7148 \cdot 10^{- 3}

, for the approximations made for

m_{0}

,

p

and

q

, respectively. For orthogonal projectors, we get a NMSE of

3.6093 \cdot 10^{- 6}

,

5.1049 \cdot 10^{- 6}

and

1.8010 \cdot 10^{- 6}

, for the approximations made for vectors

m_{0}

,

p

and

q

, respectively. As we see, these are small enough values to consider that our hypothesis

w^{(0)} \approx 0

is reasonable for the analysis in Section 3.2.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV ’15), Santiago, Chile, 7–13 December 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 1026–1034. [Google Scholar]
Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
Le Merrer, E.; Pérez, P.; Trédan, G. Adversarial Frontier Stitching for Remote Neural Network Watermarking. arXiv 2017, arXiv:1711.01894. [Google Scholar] [CrossRef] [Green Version]
Adi, Y.; Baum, C.; Cissé, M.; Pinkas, B.; Keshet, J. Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring. arXiv 2018, arXiv:1802.04633. [Google Scholar]
Uchida, Y.; Nagai, Y.; Sakazawa, S.; Satoh, S. Embedding Watermarks into Deep Neural Networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR ’17), Bucharest, Romania, 6–9 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 269–277. [Google Scholar]
Nagai, Y.; Uchida, Y.; Sakazawa, S.; Satoh, S. Digital Watermarking for Deep Neural Networks. Int. J. Multimed. Inf. Retr. 2018, 7, 3–16. [Google Scholar] [CrossRef] [Green Version]
Cox, I.J.; Kilian, J.; Leighton, F.T.; Shamoon, T. Secure Spread Spectrum Watermarking for Multimedia. IEEE Trans. Image Process. 1997, 6, 1673–1687. [Google Scholar] [CrossRef] [PubMed]
Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Saad, D., Ed.; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR ’15), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Rouhani, B.D.; Chen, H.; Koushanfar, F. DeepSigns: A Generic Watermarking Framework for IP Protection of Deep Learning Models. arXiv 2018, arXiv:1804.00750. [Google Scholar]
Balles, L.; Hennig, P. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. In Proceedings of the 2018 International Conference on Machine Learning (ICML ’18), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Wilson, A.C.; Roelofs, R.; Stern, M.; Srebro, N.; Recht, B. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA; pp. 4151–4161. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a Fast and Flexible Solution for CNN-Based Image Denoising. IEEE Trans. Image Process. 2018, 27, 4608–4622. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fan, L.; Zhang, F.; Fan, H.; Zhang, C. Brief Review of Image Denoising Techniques. Vis. Comput. Ind. Biomed. Art 2019, 2, 7. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML ’15), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Sun, S.; Cao, Z.; Zhu, H.; Zhao, J. A Survey of Optimization Methods From a Machine Learning Perspective. IEEE Trans. Cybern. 2020, 50, 3668–3681. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Tieleman, T.; Hinton, G. Lecture 6.5—RMSProp, COURSERA: Neural Networks for Machine Learning; University of Toronto: Toronto, ON, Canada, 2012. [Google Scholar]
Wang, T.; Kerschbaum, F. Attacks on Digital Watermarks for Deep Neural Networks. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’19), Brighton, UK, 12–17 May 2019; pp. 2622–2626. [Google Scholar]
Burden, R.L.; Faires, J.D. Numerical Analysis, 9th ed.; Brooks/Cole: Boston, MA, USA, 2010. [Google Scholar]
Geyer, C.J. Practical Markov Chain Monte Carlo. Stat. Sci. 1992, 7, 493–497. [Google Scholar] [CrossRef]
Cachin, C. An Information-Theoretic Model for Steganography. In Information Hiding; Lecture Notes in Computer Science; Aucsmith, D., Ed.; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1525. [Google Scholar]
Comesaña, P. Detection and information theoretic measures for quantifying the distinguishability between multimedia operator chains. In Proceedings of the IEEE Workshop on Information Forensics and Security (WIFS12), Tenerife, Spain, 2–5 December 2012. [Google Scholar]
Barni, B.; Tondi, B. The Source Identification Game: An Information-Theoretic Perspective. IEEE Trans. Inf. Forensics Secur. 2013, 8, 450–463. [Google Scholar] [CrossRef]
Tassano, M.; Delon, J.; Veit, T. An Analysis and Implementation of the FFDNet Image Denoising Method. Image Process. Line 2019, 9, 1–25. [Google Scholar] [CrossRef]
Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo Exploration Database: New Challenges for Image Quality Assessment Models. IEEE Trans. Image Process. 2017, 26, 1004–1016. [Google Scholar]
Franzen, R. Kodak Lossless True Color Image Suite. 1999. Available online: http://r0k.us/graphics/kodak (accessed on 22 September 2020).
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proceedings of the 8th Int’l Conf. Computer Vision (ICCV 2001), Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
Wilson, S.G. Digital Modulation and Coding; Pearson: London, UK, 1995. [Google Scholar]

Figure 1. Architecture of the host network FFDNet.

Figure 2. Histograms from the embedding layer

l = 2

(

T = 256

,

λ = 1

and k = 32,140). (a) histogram of

w^{(0)}

; (b) histogram of

w^{(k)}

; (c) histogram of

Δ w = w^{(k)} - w^{(0)}

.

Figure 2. Histograms from the embedding layer

l = 2

(

T = 256

,

λ = 1

and k = 32,140). (a) histogram of

w^{(0)}

; (b) histogram of

w^{(k)}

; (c) histogram of

Δ w = w^{(k)} - w^{(0)}

.

Figure 3. Empirical histograms from the denoising gradients (a) distribution of the mean denoising gradient, D; (b) distribution of the variance of the batching noise, H.

Figure 4. Empirical histograms after the watermark embedding using SGD. (a) histogram of

w^{(k)}

, Gaussian,

λ = 5

, and orthogonal projectors,

λ = 20

; (b) histogram of

Δ w

, Gaussian,

λ = 5

; (c) histogram of

Δ w

, orthogonal,

λ = 20

.

Figure 4. Empirical histograms after the watermark embedding using SGD. (a) histogram of

w^{(k)}

, Gaussian,

λ = 5

, and orthogonal projectors,

λ = 20

; (b) histogram of

Δ w

, Gaussian,

λ = 5

; (c) histogram of

Δ w

, orthogonal,

λ = 20

.

Figure 5. Theoretical histograms of

Δ w

for SGD, orthogonal projectors,

λ = 20

. (a) only watermarking function, Equation (24); (b) including denoising and watermarking functions, Equation (48).

Figure 5. Theoretical histograms of

Δ w

for SGD, orthogonal projectors,

λ = 20

. (a) only watermarking function, Equation (24); (b) including denoising and watermarking functions, Equation (48).

Figure 6. Empirical histograms of

w^{(k)}

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

and

λ = 1

; (b) orthogonal,

λ = 0.5

and

λ = 10

.

Figure 6. Empirical histograms of

w^{(k)}

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

and

λ = 1

; (b) orthogonal,

λ = 0.5

and

λ = 10

.

Figure 7. Empirical histograms of

Δ w

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 7. Empirical histograms of

Δ w

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 8. Pdf of

Γ

, Gaussian projectors.

Figure 8. Pdf of

Γ

, Gaussian projectors.

Figure 9. Theoretical histograms of

Δ w

for Adam with only watermarking function using Equations (43) and (44). (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 9. Theoretical histograms of

Δ w

for Adam with only watermarking function using Equations (43) and (44). (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 10. Theoretical histograms of

Δ w

for Adam with denoising and watermarking functions using Equations (A11) and (A12). (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 10. Theoretical histograms of

Δ w

for Adam with denoising and watermarking functions using Equations (A11) and (A12). (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 11. Empirical histograms of

w^{(k)}

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

and

λ = 1

; (b) orthogonal,

λ = 0.5

and

λ = 10

.

Figure 11. Empirical histograms of

w^{(k)}

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

and

λ = 1

; (b) orthogonal,

λ = 0.5

and

λ = 10

.

Figure 12. Empirical histograms of

Δ w

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 12. Empirical histograms of

Δ w

after the watermark embedding using Adam. (a) Gaussian,

λ = 0.05

; (b) Gaussian,

λ = 1

; (c) orthogonal,

λ = 0.5

; (d) orthogonal,

λ = 10

.

Figure 13. (a) BER vs. pruning rate for Adam and BOP (pruning all layers or only the watermarked one does not have any impact on BER); (b) PSNR vs. pruning rate for Adam and BOP for the Kodak24 dataset; (c) PSNR vs. pruning rate for Adam and BOP for the CBSD68 dataset.

Table 1. FFDNet configurations for grayscale and RGB image denoising.

	Grayscale	RGB
Conv layers	15	12
Feature maps per layer	64	96
Receptive field	$62 \times 62$	$50 \times 50$

Table 2. PSNR (dB) results with noise level

σ = 25

, number of iterations k needed to converge, KLD and SIKLD between the distributions of

w^{(k)}

and

w^{(0)}

.

Table 2. PSNR (dB) results with noise level

σ = 25

, number of iterations k needed to converge, KLD and SIKLD between the distributions of

w^{(k)}

and

w^{(0)}

.

	SGD		Adam				BOP
	Gaussian	Orth.	Gaussian		Orth.		Gaussian		Orth.
$λ$	5	20	$0.05$	1	$0.5$	10	$0.05$	1	$0.5$	10
CBSD68	$30.76$	$31.09$	$31.18$	$31.17$	$31.21$	$31.16$	$31.20$	$31.16$	$31.19$	$31.15$
Kodak24	$31.66$	$32.03$	$32.13$	$32.10$	$32.15$	$32.10$	$32.15$	$32.08$	$32.14$	$32.09$
k	42,780	98,510	43,590	32,140	27,110	57,230	40,880	14,840	33,180	7150
KLD	$0.0477$	$0.0238$	$0.1149$	$0.2779$	$0.0281$	$0.8118$	$0.0463$	$0.0443$	$0.0227$	$0.0253$
SIKLD	$0.0468$	$0.0206$	$0.0879$	$0.2112$	$0.0280$	$0.4707$	$0.0449$	$0.0266$	$0.0197$	$0.0226$

Table 3. Position of both side spikes in the histograms of

Δ w

obtained from theoretical and empirical results.

Table 3. Position of both side spikes in the histograms of

Δ w

obtained from theoretical and empirical results.

	$λ$	Theoretical	Empirical
Gaussian	$0.05$	$- 0.04243$	$- 0.04278$
	$0.05$	$0.04239$	$0.04276$
	1	$- 0.03129$	$- 0.03119$
	1	$0.03126$	$0.03125$
Orthogonal	$0.5$	$- 0.02710$	$- 0.02710$
	$0.5$	$0.02710$	$0.02709$
	10	$- 0.05720$	$- 0.05717$
	10	$0.05720$	$0.05718$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cortiñas-Lorenzo, B.; Pérez-González, F. Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks. Entropy 2020, 22, 1379. https://doi.org/10.3390/e22121379

AMA Style

Cortiñas-Lorenzo B, Pérez-González F. Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks. Entropy. 2020; 22(12):1379. https://doi.org/10.3390/e22121379

Chicago/Turabian Style

Cortiñas-Lorenzo, Betty, and Fernando Pérez-González. 2020. "Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks" Entropy 22, no. 12: 1379. https://doi.org/10.3390/e22121379

APA Style

Cortiñas-Lorenzo, B., & Pérez-González, F. (2020). Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks. Entropy, 22(12), 1379. https://doi.org/10.3390/e22121379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

Abstract

1. Introduction

Notation

2. Preliminaries

2.1. Host Network: FFDNet

2.2. Optimization Algorithms

2.2.1. SGD Optimization

2.2.2. Adam Optimization

2.3. Digital Watermarking Algorithm

2.3.1. Embedding Elements

2.3.2. Embedding Process

2.3.3. Detectability Issues

2.3.4. Gaussian and Orthogonal Projection Vectors

3. Theoretical Analysis

3.1. Analysis for SGD

3.2. Analysis for Adam

3.2.1. Mean of the Gradient

3.2.2. Variance of the Gradient

3.2.3. Update Term

3.2.4. Rationale for the Sign Function

3.2.5. A Theoretical Expression for Δ w

3.3. The Denoising Term

3.3.1. SGD

3.3.2. Adam

4. Block-Orthonormal Projections (BOP)

5. Information-Theoretic Measures

6. Experiments and Results

6.1. Experimental Set-Up

6.1.1. Training the Host Network

6.1.2. Watermark Embedding

6.2. Experimental Results

6.2.1. Empirical Denoising Gradients

6.2.2. SGD

6.2.3. Adam

6.2.4. BOP

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Mathematical Derivations

Appendix A.1. Projected Weights at k = 0

Appendix A.2. Adam: Mean of the Gradient

Appendix A.3. Adam: Variance of the Gradient

Appendix A.4. Adam: A Projection-Based Decomposition of c j T sgn ( φ ^ )

Appendix A.4.1. Decomposition for Gaussian Projectors

Appendix A.4.2. Decomposition for Orthogonal Projectors

Appendix A.5. Adam: Analysis with Denoising and Watermarking

Appendix B. Verification of Assumptions

Appendix B.1. Affine Growth Hypothesis for the Weights

Appendix B.2. Negligibility of Weights at k = 0

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.5. A Theoretical Expression for $Δ w$

Appendix A.4. Adam: A Projection-Based Decomposition of $c_{j}^{T} sgn (\hat{φ})$