Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

As training Deep Neural Networks (DNNs) becomes more expensive, the interest in protecting the ownership of the models with watermarking techniques increases. Uchida et al. proposed a digital watermarking algorithm that embeds the secret message into the model coefficients. However, despite its appeal, in this paper, we show that its efficacy can be compromised by the optimization algorithm being used. In particular, we found through a theoretical analysis that, as opposed to Stochastic Gradient Descent (SGD), the update direction given by Adam optimization strongly depends on the sign of a combination of columns of the projection matrix used for watermarking. Consequently, as observed in the empirical results, this makes the coefficients move in unison giving rise to heavily spiked weight distributions that can be easily detected by adversaries. As a way to solve this problem, we propose a new method called Block-Orthonormal Projections (BOP) that allows one to combine watermarking with Adam optimization with a minor impact on the detectability of the watermark and an increased robustness.


Introduction
Deep learning has substantially impacted technology over the last years, becoming an important center of attention for researchers all over the world. Such a great impact stems from the versatility it offers as well as the excellent results that DNNs achieve on multiple tasks, like image classification or speech recognition, which often reach and even surpass human-level performance [1,2]. However, far from being a simple task, the design of new DNNs is generally expensive not only in terms of human effort and time needed to build effective model architectures but mostly because of the large volume of suitable data that must be gathered and the vast amount of computational resources and power used for training. Consequently, businesses owning costly models are interested in protecting them from any illicit use, and this growing need has recently led researchers to a common concern on how to embed watermarks on DNNs. As a result, several frameworks for protecting the intellectual property of neural networks were proposed in the literature, and can be classified as black-box or white-box approaches. The first kind of methods (i.e., black-box) do not need to access model parameters for detecting the presence of watermarks. Instead, key inputs are introduced as special triggers to identify the original network, either by the use of the so-called adversarial examples [3] or backdoor poisoning [4].
White-box approaches, in contrast, directly affect the model parameters in order to embed watermarks. One of the most significant white-box contributions was proposed by Uchida et al. in [5,6], the algorithm under study in this paper. This watermarking framework employs a regularization term • We provide mathematical and experimental evidence for SGD and Adam to show that: (1) in contrast to SGD, the changes in the distribution of weights caused by Adam can be easily detected when embedding watermarks following the approach in [5,6] and, hence, (2) the use of Adam considerably increases the detectability of the watermark. For the purpose of carrying out this analysis, we use FFDNet [14]-a DNN that performs image denoising tasks-as the host network.

•
We introduce a novel method based on orthogonal projections to solve the detectability problem that arises when watermarking a DNN which is being optimized with Adam. A side effect of this novel method is an increased robustness against weight pruning.
The remainder of this paper is organized as follows: Section 1 introduces the notation and Section 2 explains the frameworks and algorithms used in this study-host network, optimization algorithms, and watermarking method-in more detail. Section 3 presents the mathematical core that allows us to model the observed effects on the histograms of weights once the embedding process has finished. Then, in Section 4, we introduce BOP as a solution for using Adam and watermarking simultaneously, Section 5 presents the information-theoretic measures that we will implement, and Section 6 shows the experimental results. Finally, we point out some concluding remarks in Section 7. Two appendices give additional details on mathematical derivations (Appendix A) and the validity of certain assumptions (Appendix B).

Notation
In this paper, we use the following notation. Matrices and vectors are denoted by upper-case and lower-case boldface characters, respectively, while random variables and their realizations are respectively represented by upper-case and lower-case characters.
For matrix and vector operations, we proceed as follows. As an example, let A be a matrix. Then, its transpose is denoted by A T . Moreover, we use Tr[A] to represent the trace of A and (A) i,j to denote the (i, j)th element of A. I N refers to the N × N identity matrix. We use column vectors unless otherwise stated. In addition, we use 0 to denote a column vector of zeros and 1 for a column vector of ones. Let w be a column vector of length N; then, ∇ w is the gradient operator with respect to w that is: 1 , · · · , ∂ ∂w N T We use the operator • to denote the Hadamard (i.e., sample-wise) product and ⊗ for the Kronecker product. Finally, E{·} and Var{·} denote the mathematical expectation and the variance, respectively.

Host Network: FFDNet
The rapid development of Deep Learning over the last few years has led to new advances in the field of image restoration [15]. Several Convolutional Neural Networks (CNNs) have been designed to replace classical methods and, often, they offer new competitive advantages. This is the case of FFDNet [14], which performs image denoising tasks and is used in our work as the exemplary host network that is watermarked by means of the algorithm proposed in [5,6]. Image denoising is the task of removing noise from a given image. Let y be the input noisy image, x the clean image, and n the noise, which is usually modeled as zero-mean Additive White Gaussian Noise (AWGN), then we have y = x + n and we wish to obtain an estimatex of the clean image. As opposed to other CNNs for image denoising, FFDNet works on downsampled sub-images, and it is able to adapt to several noise levels using only a single network. For that purpose, a noise level map M is included also as an input, so that the function FFDNet aims to learn can be expressed as:x = F (y, M; w), where w represents the parameters of the network. Furthermore, FFDNet can handle spatially variant noise and offers competitive inference speed without sacrificing denoising performance [14]. Figure 1 shows the architecture of FFDNet. As we can see, it is composed of a downscaling operation on the input images, a nonlinear mapping consisting of Convolutional Layers, Batch Normalization steps [16], and ReLU activation functions; and, finally, an upscaling process to generate denoised images with the original size. Let {(y i , x i )} be L noisy-clean image pairs from the training dataset, the denoising cost function is: FFDNet can be used for grayscale or color images. Table 1 shows the main differences between both configurations. As we can see, the total number of convolutional layers can be set to 15 or 12 for grayscale or RGB denoising, respectively. The number of feature maps and the size of the receptive field also differ, but this is not the case of the kernel size, which is kept to 3 × 3 for either grayscale or RGB. In this paper, we implement the RGB version of FFDNet and proceed as follows: (1) train FFDNet from scratch using Adam optimization without embedding any watermark and (2) fine-tune the network to embed the desired watermark, using both Adam and SGD to compare the results.

Optimization Algorithms
The training of a DNN is an iterative process that makes it possible for the model to learn how to perform a given task. This is certainly the most challenging optimization problem when implementing deep learning models from scratch. In order to increase the efficiency of the training process, researchers have developed several optimization techniques in the last years, each of them with its own advantages and drawbacks [17].
An optimization algorithm determines the weight update rule that must be applied at every iteration. The goal is usually to minimize a cost function-also known as loss function-which generally compares predictions with expected values and computes an error metric that evaluates the performance of the network. The learning rate µ-or step size-is the hyperparameter that controls how much the weights can change after each iteration of the optimization algorithm.
In the following sections, we briefly review the mechanics of two widely known optimization algorithms: SGD and Adam.

SGD Optimization
SGD [8] is a classic optimization algorithm based on gradient descent and one of the most used. However, unlike standard gradient descent techniques that use all the training samples to compute the gradient of the cost function, SGD uses a small number of samples from the dataset-a minibatch-and then takes the average over these samples to get an estimate of the gradient, ∇ w f (w). Then, the update rule is given by: where w (k) is the weight vector at iteration k and w (0) is its initial value.

Adam Optimization
Adam [9] is a popular optimization algorithm that combines the ideas of AdaGrad [18] and RMSProp [19]. Some of the advantages of Adam include, among others, the fact of being easy to implement and computationally efficient, as well as being fast and suitable for complex settings with noisy and sparse gradients.
Just like SGD, Adam estimates the gradient from the samples of a minibatch, but, as opposed to SGD, it uses estimates of the first and second moments of the gradients to compute individual adaptive learning rates for each parameter in the network. In order to do that, exponential moving averages of the gradient and the squared gradient are calculated using two hyperparameters to control the decay rate, β 1 and β 2 , respectively. Let m (0) = 0 and v (0) = 0 be the initial values for the first and second moment vectors, respectively, then the steps of this algorithm at the kth iteration are the following [9]: where w , respectively, f (w) is the cost function and is a very small number that avoids diving by zero. The default set-up for the hyperparameters is [9]: β 1 = 0.9, β 2 = 0.999 and = 1 · 10 −8 .

Digital Watermarking Algorithm
In this paper, we analyze the digital watermarking algorithm proposed in [5,6]. As we mentioned previously, we employ the fine-tune-to-embed approach; therefore, the embedding function is applied only during some additional epochs after convergence is achieved for the original task-in this case, image denoising-.

Embedding Elements
We wish to embed a T-bit sequence, b ∈ {0, 1} T , into a certain layer l of the host DNN. For this specific layer, let (S, S), I and F represent the size of the convolution filter, the depth of input to the convolutional layer and the number of filters in that layer, respectively. Then, the weights can be represented by a tensor W l with dimensions S × S × I × F and then rearranged to form a vector w l of length N = S 2 IF.
One important point here is that, instead of directly using the weight vector w l , the authors in [5,6] suggest including an initial transformation of these coefficients. In order to reflect this, we must calculate the mean of W l over the F kernels. As a result, we obtain a new flattened vectorŵ l with M elements, where M = S 2 I. This transformation can also be formulated if we introduce a new matrix Θ of size N × M: Here, h is a column vector of length F with all of its elements set to 1/F, i.e., h T h = 1/F. In order to move to the T-dimensional space, the authors in [5,6] introduce a secret projection matrix Φ. The size of this projection matrix-also referred to as regularizer parameter in [5,6]-is M × T so that each column corresponds to a particular projection vector φ (i) , i = 1, · · · , T.
For the purpose of the subsequent theoretical analysis, we pair up matrices Φ and Θ and, therefore, we ascribe the initial transformation to the projection matrix, preserving the notation with the original weight vector w l . To that end, we define a new N × T projection matrix: so that now the projection vectors can be expressed asφ (i) = Θφ (i) .

Embedding Process
Now that we have introduced the basic elements, the embedding procedure can be described as follows [5,6]. Let w be a vector containing all the parameters of the network, then the watermarking regularizer, f watermark (w l ), is added to the global cost function f (w): where f 0 (w) is the original cost function and λ is the regularization parameter. The regularizer term is a composition of two functions: cross-entropy and the sigmoid, with y i = 1/ 1 + exp(−(φ (i) ) T w l ) . In order to minimize (10), y i will approximate to the value of b i . Therefore, the sigmoid function will force each projection (φ (i) ) T w l , i = 1, · · · , T, to progressively move towards +∞ and −∞ depending on whether b i = 1 or b i = 0, respectively. For successfully embedding the secret message, it is generally enough to guarantee that each projection lies on the proper side of the horizontal axis. When this happens, we reach a Bit Error Rate (BER) of 0 and all the projected weights are aligned with their corresponding bits, that is, they are positive when the bit is 1 and, conversely, negative when the bit is 0.

Detectability Issues
In this paper, we employ the Random technique proposed by the authors, in which the values of the projection matrix Φ-before applying the transformation in (9)-are independent samples from a standard normal distribution N (0, 1). Their results [5,6] show that this Random approach is the most appealing design method for Φ because, as they indicate, it does not significantly alter the distribution of weights at the embedding layer. However, there are some detectability issues here that should be considered.
On the one hand, the authors in [20] show that the standard deviation of the distribution of weights grows with the length of the embedded message. This information can be used by adversaries for detecting the watermark and even overwriting it.
On the other hand, one of the main conclusions of our work is that the presence or absence of alterations to the shape of the weight distributions is a consequence of the optimization algorithm used during the watermark embedding. In particular, the authors in [5,6] employ SGD with momentum in their experiments and the distributions of weights remain unchanged, yet the use of Adam would significantly alter the shape of the distributions even though we apply the same Random technique. In this paper, we will show that the results in [5,6] regarding the undetectability of the watermark do not hold when we use Adam optimization.
As an example to visualize this peculiar behavior shown by Adam when we employ the watermarking algorithm proposed in [5,6], we plot in Figure 2 the resulting histograms when T = 256 and λ = 1; specifically, the histograms of the weights before and after the embedding (corresponding to k = 32,140) and the histogram of the weight variations, respectively. As we can see, the distribution of the original weights has significantly changed, turning into a two-spiked shape that could be easily detected by an adversary. The complete set of histograms will be later shown in Section 6.

Gaussian and Orthogonal Projection Vectors
In addition to the Random technique suggested by the authors of this watermarking algorithm-whose projection vectors will be referred to as Gaussian projection vectors in this paper-we will also implement orthonormal projectors. In order to build these kinds of projectors, we first generate the projection matrix following the Random technique; that is, samples are drawn from a standard normal distribution N (0, 1). Then, from the Singular Value Decomposition of this projection matrix, we obtain an orthonormal basis for the column space so that we have Φ T Φ = I T . Notice that once we apply the initial transformation (i.e.,Φ = ΘΦ) the resulting projection vectorsφ (i) , i = 1, · · · , T, will still preserve the orthogonality between them, although they will not be normalized: Therefore, these kinds of projectors will be referred to as orthogonal. As we will see later from KLD results and histograms, implementing orthogonal projectors may help us to better preserve both the original shape of the weight distribution and the denoising performance.

Theoretical Analysis
From now on, we will omit the sub-index l for the sake of clarity although we are always addressing the coefficients of the embedding layer. The experiments clearly illustrate that the use of Adam optimization together with the watermarking algorithm proposed in [5,6] originates noticeable changes in the distribution of weights, as we see in Figure 2. In the following analysis, we delve into the reasons why this happens. To that end, we aim to get a theoretical expression of ∆w = w (k) − w (0) . This will allow us to prove and understand the nature of the observed behavior of the weights when watermark embedding is carried out. We start off by defining vectorφ and matrixΨ: Notice thatΨ =Ψ T . In addition, for the case of orthogonal projectors, the following properties can be straightforwardly proven:Ψφ These properties will come in handy later on in several theoretical derivations. Firstly, in order to simplify the analysis and understand more clearly how the watermarking cost function impacts on the movement of the weights when using both SGD and Adam optimization, we will just consider the presence of the regularization term, that is, we will not include the denoising cost function for now. The influence of the denoising part will be studied in Section 3.3. Therefore, given our embedding cost function in (10) and assuming for simplicity (and without loss of generality) that all embedded symbols are +1, we have: If we compute the gradient of this function, we obtain: In order to simplify the subsequent analysis, we introduce a series of assumptions which are based on empirical observations or hypotheses that will be duly verified. By construction, it is possible to show that the mean of (φ (i) ) T w (0) is zero and its variance is (N/F 2 )Var{w (0) } for Gaussian projectors and (1/F)Var{w (0) } for orthogonal projectors (see Appendix A.1). Since the variance of the weights at the initial iteration is generally very small-in our experiments, it is 0.0012-it can be considered that the variance of (φ (i) ) T w (0) will also be small enough so that we can assume |(φ (i) ) T w| 1 for all i = 1, · · · , T. Although this assumption might not be strictly true for all k-especially once we have crossed the linear region of the sigmoid function-it is reasonably good and it allows us to use a first-order Taylor expansion for 1 Plugging (16) into (15) and using the definitions given above, we can write: We introduce now one important hypothesis in this theoretical analysis to handle the previous equation: we assume that w (k) grows approximately affinely with k: (18) where η is a vector that contains the slopes for each weight, and it is to be determined in the following sections. We hypothesize this affine-like growth for the weights and, later, we will verify that this is consistent with the rest of the theory and the experiments (see Appendix B.1 for more details). Therefore, we can write the weight variations as:

Analysis for SGD
We first analyze the behavior of SGD optimization when we implement digital watermarking as proposed in [5,6]. Recall the SGD update rule in (2). If we use the approximation for the gradient in (17) and the affine growth hypothesis for the weights introduced in (18), we have: To simplify the analysis, we consider from now on that w (0) ≈ 0. We confirm the validity of this assumption in Appendix B.2. Then, we can write: If we consider orthogonal projectors, we can arrive at a more explicit expression for η. In particular, if we multiply (21) byΨ and use the properties (13) and (14), we obtain: Then, substituting (22) into (21), we can get a more concise expression for η: Thus, if F is large compared to λkµ-this certainly holds for our experimental set-up, cf. Section 6.1-η will be approximately proportional toφ. Then, the coefficients will follow an affine-like growth as we hypothesized in (18) (see Appendix B.1 for the empirical confirmation of this hypothesis). Now, the weight variations can be expressed as: As we can see, when we use SGD, ∆w will approximately follow a zero-mean Gaussian distribution, as induced by [9]. Because of this, and unlike Adam (as we will see later), the weights will evolve with random speeds when we embed watermarks using SGD optimization. Therefore, the impact on the original shape of the weight distribution will be small. However, the variance of the weight distribution may change considerably as stated in [20]. Since we have Var{φ} = T/(FN) for orthogonal projectors, the variance of ∆w can be computed as: Thus, considering that w (0) and ∆w are uncorrelated-we check this statement in Section 6.2.2-we arrive at the following expression for the variance of the weights at the kth iteration: As we can see from (25), when implementing the digital watermarking algorithm in [5,6] with SGD optimization and orthogonal projectors, the variance of the resulting weight distribution might change considerably. In order to preserve the original weight distribution when using SGD, it is important to take care with the values of T, F and N, especially. In addition, the standard deviation of the weights will (approximately) increase linearly with the number of iterations so it may be also important to limit the value of k. This is in line with the expected behavior: the weights will move away from their original value and they will be further if we perform more iterations.
Because the analysis for Gaussian projectors becomes considerably difficult, in this paper, we just address the study of SGD with orthogonal projectors. A more comprehensive analysis for Gaussian projectors that can be linked to the results obtained in [20] is left for future research. Regardless of this, the whole analysis for both kinds of projection vectors will be developed in the next section for Adam optimization.

Analysis for Adam
In the next sections, we will delve into the theory behind Adam optimization for DNN watermarking. In particular, we will obtain an expression for the mean and the variance of the gradient and then, as we did with SGD, we will analyze the update term to get an expression of the weight variations.

Mean of the Gradient
We are interested in computing the mean of the gradient that is used in Adam. Consideringf (w) as the global cost function, then, from (4), we can rewrite the mean at the kth iteration as: We use the gradient in (17) and do some derivations to find an explicit expression for m (k) = m (k) /(1 − β k 1 ) under the hypothesis in (18). Finally, we arrive at the following expression for the bias-corrected mean gradient when k is sufficiently large (see Appendix A.2 for all the mathematical details):m As we see from (27), the mean of the gradient also grows affinely with k.

Variance of the Gradient
The approximation in (17) for the jth element of this vector g(w), denoted by g j (w), is the following: Following the hypothesis in (18), we can write: In summary, for this affine-like growth, the square gradient vector can be written as: for some vectors p, q, r whose jth component can be defined as: Now, from (5), we can rewrite the variance of the gradient that is used in Adam as: The bias-corrected termv . Applied to the special case of (31), this yields (see Appendix A.

Update Term
Because µ is usually very small-we use µ = 1 · 10 −6 in our experiments-we can assume that kµ will be small enough to obtain an approximation of the update used in Adam. Recall that, for the jth weight, this is u j . Let: Then, assuming that µk 1, we can make a zero-order approximation of the update term, i.e.,: This approximation is accurate enough for the set of experiments we perform. In particular, for the orthogonal case, we could deal with k max = 625,000 and still get a correlation coefficient of 0.9900 between (λ 2 /4)s j andv In our experiments, we actually reach a BER of zero for values of k quite below k max (cf. Section 6.2).
From (36), we observe that the updated jth coefficient approximately follows the hypothesized growth, i.e., Notice that, as expected, the update does not depend on λ, following Adam's property that the update is invariant to rescaling the gradients [9]. Finding a more explicit expression runs into the problem that η depends on s, which in turn is a function of η through (32) and (33). The following subsections are devoted to solving this problem by conjecturing a form for η and refining it.
To simplify the analysis, we consider from now on that w (0) ≈ 0 since most of the values of the weights at the initial iteration are very small (see Figure 2a). We will verify the accuracy of this approximation in Appendix B.2.

Rationale for the Sign Function
Recall the expression (23) that we obtained for η when analyzing SGD, where we found η to be approximately proportional toφ. Now, for Adam, we take this as a starting point, so we conjecture first that η = γ ·φ, for some real positive γ. Here, we consider orthogonal projection vectors and use the property introduced in (13) and the following: In this particular case, we have the following identities: Substituting these values into (35), we find that When we divide m 0,j by √ s j , we obtain: It is then clear that η cannot be written in the form η = γ ·φ, as was conjectured at the beginning of this section.

A Theoretical Expression for ∆w
Although the conjectured form for η in Section 3.2.4 does not hold, the appearance of the sign function in (37) gives a key clue for an alternative approach, since the sign seems to reveal the reason behind the two-spiked histograms like the one shown in Figure 2c.
Therefore, let us write η j to explicitly contain the sign of ϕ j and allow γ j to take different (non-negative) values with j to reflect the varying magnitude (recall that even in Section 3.2.4 the conjectured value could be written as η j = γ|ϕ j | · sgn(ϕ j )). Let γ be the column vector containing γ j , In addition, thus, in order to meet the condition η j = γ j · sgn(φ j ), the following nonlinear equation should be solved for all γ j , j = 1, · · · , N : This equation can be solved with a fixed-point iteration method [21]. To that end, we should initialize γ and then iterate the following: (1) compute the right-hand side of (38), and (2) use it to update γ on the left-hand side. This process will converge to the solution of (38). Even though this method can be implemented to give the specific values for each γ j , we are more interested in obtaining a statistical characterization rather than a deterministic one. As we will see, the statistical approach offers a deeper explanation for the two-spiked distribution of ∆w which we ultimately seek.
We thus aim at finding the pdf of Γ, now considered as a random variable for which γ j , j = 1, · · · , N are nothing but realizations. Once again, Equation (38) can be solved iteratively (e.g., with Markov-chain Monte Carlo methods [22]) to yield the equilibrium distribution for Γ. Instead, we can resort to the results in Section 6 where we conclude that the pdf of Γ is strongly concentrated around its mode. With this observation, it is possible to consider that γ T (c j • sgn(φ)) approximately corresponds to realizations of Γ · (c T j sgn(φ)). In order to simplify the analysis even further, we are interested in decomposing c T j sgn(φ) using its statistical projection ontoφ j , i.e., c T j sgn(φ) = α ·φ j + z j . Here, α is a real multiplier and z j is zero-mean noise uncorrelated withφ j . More generally, if we define matrix C . = [c 0 , · · · , c N ], then we seek to write C T sgn(φ) = α ·φ + z. We do the analysis for the cases of Gaussian projectors and orthogonal projectors separately (refer to Appendix A.4 for the derivations). For the Gaussian case, we get: On the other hand, for the orthogonal projectors, we get instead: Recall that, by construction,φ can be seen as a random vector. In fact, we haveφ ∼ N (0, I · T/F 2 ) for Gaussian projection vectors, andφ approximately follows N (0, I · T/(FN)) for orthogonal projectors. Let Ξ, Z be random variables with the distribution of a single element ofφ and z, respectively, then q j and r j can be seen as realizations of (approximately): Q = −ΓΞ(αΞ + Z) and R = Γ 2 4 (αΞ + Z) 2 , so a stochastic version of (38) is: Squaring both sides, we find that, for a given realization (ξ, z) of (Ξ, Z), Γ must take the positive value γ that satisfies the following fourth degree equation: From (43), it is easy to generate samples γ of Γ and, accordingly, samples of ∆w, by recalling that: We note that, for the particular case when β 2 is very close to 1, . This simplification allows us to approximate (43) as which leads to the following fixed-point equation: When the noise term z is very small compared to αξ (which occurs with a fairly large probability, especially for the case of orthogonal projectors), then the solution to (46), denoted by γ s , will be independent of the value of ξ. This will cause the probability of Γ to be concentrated around γ s , and in turn this will make the pdf ∆w have two spikes centered at ±kµγ s . We will see these spikes appearing time and again in the experiments carried out with Adam (Section 6).

The Denoising Term
Thus far, we have considered only that our cost function isf (w) = λ f watermark (w); however, as we know, there is an additional term, the original denoising function, so our real cost function is: The gradients corresponding to this function, f 0 (w), will try to pull the weight vector towards the original optimal w (0) in a relatively hard to model way. In order to analyze this behavior, we can approximate the gradient of the denoising function at the kth iteration with respect to the jth coefficient as a sum of a constant term, d j , and a noisy one,ñ (k) j , which follows a zero-mean Gaussian distribution and is associated with the use of different training batches on each step. We will refer to this noise as batching noise. Thus, for each coefficient j, we can write: Like we did in the previous section, we can formulate a stochastic version of (47). To that end, we notice that the constant term of this gradient, d j , can take different values with j, as well as the variance of the batching noise, h j that is,ñ j is drawn from N (0, h j ). Therefore, in order to reflect the variability of these terms along the j-elements, we introduce two random variables with the distribution of the mean gradient and the variance of the batching noise, D and H, respectively, for which d j and h j are realizations. The pdf of these distributions will be obtained empirically in Section 6.2. Then, we can seeñ (k) j as a realization ofÑ ∼ N (0, H).

SGD
Similarly to Section 3.2.5, let Ξ be a random variable with the distribution ofφ. Let (ξ, δ,ñ) be a realization of (Ξ, D,Ñ), respectively, then for SGD using orthogonal projectors we can compute samples of ∆w adding both functions, i.e., denoising and watermarking:

Adam
The variance of the batching noise computed by Adam will be approximately given by the random variable V, whose realizations can be expressed as v j = 1−β 2 Notice that, for each realization of V, as the sum takes places over i, we must work with a fixed value h j for the variance of the batching noise. Then, with this variance, we generate k samples ofÑ to be used in the sum that produces v j . With this characterization, we can easily analyze how the denoising cost function shapes the distribution of the weight variations. Notice that this analysis could be adapted for any host network. Let δ and ν be realizations of D and V, respectively, then, we can generate samples of ∆w without including the gradients from the watermarking function as: Moreover, in order to get a more accurate description of the problem, we can combine both functions: denoising and watermarking. The analysis becomes somewhat complicated, but, as we will check in Section 6, the distributions resulting from this analysis do capture better the shapes observed in the empirical ones. See Appendix A.5 for the results of this analysis.

Block-Orthonormal Projections (BOP)
Here, we discuss BOP, the solution we propose to solve the detectability problem posed by Adam optimization when implementing the watermarking algorithm proposed in [5,6]. In order to hide the noticeable weight variations that appear when we use Adam-as seen in Figure 2-we introduce a prior transformation using a secret N × N matrix X (the details for its construction are given below). The procedure we follow has three steps per each iteration of Adam.
Firstly, we project the weights and gradients from the embedding layer using X: Then, we run Adam optimization on the projected weights, y, using the projected gradients, ∇ y f (y), as well, i.e., steps (3)-(8) are taken using y and ∇ y f (y (k−1) ) instead of w and ∇ w f (w (k−1) ), respectively. The key of BOP relies on the following: if we execute Adam on y instead of w, we can break the natural bond created by Adam between sgn(φ) and w-as we saw in the previous sections-responsible for the ant-like behavior of the weights and, consequently, the appearance of side spikes in their histograms. These undesired effects disappear when we de-project y using X −1 to get back to the weight vector w: In order to reduce the computational complexity and the memory requirements of this method-recall that N is generally a very large number and we must project and de-project the weights on each iteration-we consider X to be a block diagonal matrix with B identical N B × N B blocks. In this way, we only have to build and work with a single block X B , for which we can choose the size by simply adjusting the value of B. The values of this block are drawn from a standard normal distribution. In addition, X B is built as an orthonormal matrix so that X −1 B = X T B . Let y (i) and w (i) be the ith block of y and w, respectively, both of them of length N B ; therefore, we just compute: After executing Adam, we can get back to w (i) : As we will see in Section 6.2.4, BOP does not significantly alter the original distribution of weights, as opposed to standard Adam. This makes it possible to enjoy the advantages of Adam optimization when we implement the watermarking algorithm in [5,6] with a minimal increase in the detectability of the watermark. In addition, this has an advantage in terms of robustness: if the adversary is not able to infer which layer is watermarked, then he/she will have to exert his/her attack (e.g., noise addition, weight pruning) on every layer thus producing a larger impact on the performance of the network as measured by the original cost function. We will discuss this fact in the experimental section.

Information-Theoretic Measures
As already discussed, one of the potential weaknesses of any neural network watermarking algorithm is the detectability of the watermark. An adversary that detects the presence of a watermark on a certain subset of the weights can initiate an attack to remove or alter the watermark. For this reason, it is important that the weights statistically suffer the least modification possible while of course being able to convey the desired hidden message. To measure this statistical closeness, we propose using the KLD [13] between the distributions of weights before and after the watermark embedding. Let P and Q be two discrete probability distributions defined on the same alphabet X ; then, the KLD from Q to P is (notice that it is not symmetric): The KLD is always non-negative. The more similar the distributions P and Q are, the smaller the divergence. In the extreme case of two identical distributions, the divergence is zero.
It is interesting to note that the KLD has been proposed for similar problems in forensics, including steganographic security [23], distinguishability between forensic operators [24], or more general source identification problems [25].
In our case, the two compared distributions are those of w (0) and w (k) , for k just producing convergence with no decoding errors. Since the KLD is not symmetric, it remains to assign those distributions to P and Q so that the measure is as informative as possible. In particular, we are interested in properly accounting for the possible lateral spikes in the pdf of w (k) . As those spikes often appear where the pdf of w (0) is small if not negligible, this suggests assigning the latter pdf to Q and the former to P. However, this choice creates a problem in practice, as for some x ∈ X , the empirical probabilities are such that P(x) = 0 and Q(x) = 0, potentially leading to an infinite divergence. To circumvent this issue related to insufficient sampling, we use an analytical approximation to Q with infinite support, after noticing that the empirical distribution of w (0) with 1000 discrete bins (see Figure 2a) can be approximated by a zero-mean Generalized Gaussian Distribution (GGD) with shape parameter β = 0.64 and scale parameter α = 0.01, (for notational coherence with the literature, α is used in this section to denote a different quantity than in the rest of the paper.) for which the latter controls the spread of the distribution. As a reference, the KLD between the empirical distribution of w (0) and its GGD best-fit is 0.0177, which is smaller than any of the KLDs that we find in Table 2. In order to compute the KLD in our experiments, we use this infinite-support symmetric distribution for Q and the empirical one of w (k) for P after quantizing both to 1000 discrete bins. The use of the KLD is adequate to measure the detectability in those cases where the adversary has access to information about the 'expected' distribution of the weights. For instance, when only one layer is modified, the expected distribution can be inferred from the weights of other layers. However, this may be still too optimistic in terms of adversarial success, as while the expected shape may be preserved-and thus, inferred-across layers, the scale (directly affecting the variance) may be not so. For instance, if the original weights were expected to be zero-mean Gaussian and they still are after watermarking, the KLD (which depends on the ratio of the respective variances) may be quite large, but the adversary will not be able to determine if watermarking took place if he/she does not know what the variance should be and only measures divergence with respect to a Gaussian. To reflect this uncertainty, quite realistic in practical situations, we minimize the KLD with respect to the scale parameter α. This puts the adversary in a scenario where only the shape is used for detectability. Thus, let Q α correspond to a GGD with scale parameter α, then we define the Scale Invariant KLD (SIKLD) as:

Experiments and Results
In this section, we show the experimental results and we compare them to the theory that we have developed. We use MATLAB R2018b to implement the expressions obtained in Section 3 and represent the theoretical histograms. As we will see, both theory and experiments match reasonably well. In particular, for Adam optimization, we are able to reproduce the same position of the side spikes seen in the empirical histograms of ∆w, as well as some effects which are attributable to the influence of the denoising cost function. We will also verify the BOP method proposed in Section 4. In addition, the KLD will be computed to give a more precise measure of the similarity between the distributions of w (k) (i.e., after the embedding) and w (0) , when using SGD, Adam and BOP.

Experimental Set-Up
We employ the fine-tune-to-embed approach described in [5,6]. This means that the training process is divided into two phases, as we explained earlier: (1) training the host network from scratch, and (2) fine-tuning steps for embedding the watermark.

Training the Host Network
In order to perform the initial training of the host network FFDNet, we use the open-source implementation for PyTorch provided in [26]. We employ the FFDNet architecture for color images, which has a depth of 12 convolutional layers and 96 feature maps per layer. The training details are the same as in [26] and also the used datasets: Waterloo Exploration Database [27] for training and Kodak24 [28] for validation. We implement the cost function introduced in (1) and train 80 epochs with the milestones described in [26] on a GPU NVIDIA Titan Xp. We use Adam as the optimization algorithm with its hyperparameters set to their default values. After training the network, we test it on the CBSD68 [29] and Kodak24 datasets.

Watermark Embedding
Once we have trained and tested our host network, we embed our T-bit watermark, b = 1, T = 256, into the convolutional layer l = 2 of FFDNet. In the next section, we present the results for both SGD and Adam optimization algorithms. The size of the convolutional filter is 3 × 3 and the depth of input is I = 96, as well as the number of filters in the layer, F = 96. Therefore, we have: M = 96 · 3 · 3 = 864, and N = MF = 82,944. In addition, the learning rate µ is set to 10 −6 during these fine-tuning steps and, also, we do not perform weight orthogonalization as we did during the initial training.
In addition, we use the following values for the regularizer parameter. When we use SGD we set λ = 5 and λ = 20 for Gaussian and orthogonal projectors, respectively. In addition, for Adam optimization, we use different values of λ for each configuration to better reflect the influence of the denoising function. In particular, we set λ = 0.05 and λ = 1 when we use Gaussian projectors, and λ = 0.5 and λ = 10 when we employ orthogonal projectors. We finish our embedding process when we reach a BER of zero, that is, when all the projected weights are positive-recall that all the embedded bits are set to +1-, i.e., (φ (m) ) T w > 0, for all m = 1, · · · , T. Notice that these values of λ were selected with the goal of reaching a BER of zero in a relatively fast way and, as it can be seen, they are not straightforwardly comparable for Gaussian and orthogonal projectors. Finally, to check the validity of our proposed method BOP, we use the same values of λ as with Adam optimization, and set the number of blocks B to 12.

Experimental Results
Here, we present the experimental results. After the main training of the host network and before the watermark embedding, we obtain a PSNR of 31.18 dB and 32.13 dB on the CBSD68 and Kodak24 datasets, respectively, for a noise level of 25. These results are very close to those reported in [26].
Compare these values to those in Table 2, where we show the PSNR (dB) results on the CBSD68 and Kodak24 datasets for the same noise level after the watermark embedding was performed. As we see, when we embed the watermark using SGD with Gaussian projectors, the denoising performance drops about 0.45 dB, while, if we employ orthogonal projectors, the performance drops only 0.1 dB. Thus, employing orthogonal projectors with SGD optimization might be beneficial to better preserve the denoising performance. For Adam optimization and BOP, the original performance does not significantly drop and it even increases when using orthogonal projectors. Consequently, in order to keep a good performance after the watermark embedding, Adam would be preferable to SGD were it not for the conspicuousness of the weights. Our proposed method BOP is a good solution to bring the detectability of Adam down to similar levels as SGD and still enjoy the rest of advantages. Table 2 also presents the number of iterations required for obtaining a BER of zero and the KLD and Scale Invariant KLD (SIKLD) between the distributions of w (k) and w (0) for each configuration.

Empirical Denoising Gradients
In order to analyze the influence of the denoising function, we need to get the empirical distributions of the mean denoising gradient and the variance of the batching noise. To that end, we proceed as follows: firstly, we extract the denoising gradients from the embedding layer l = 2 and we average them for each coefficient over the number of iterations k to get the distribution of the mean. Then, the batching noise can be easily computed if, for each coefficient, we subtract the mean value from its corresponding denoising gradient value at each iteration. By computing the variance of this noise for each individual weight, we can estimate the overall distribution of the variance of the batching noise. Figure 3 shows the empirical distribution of the mean denoising gradient, D, and the variance of the batching noise, H.

SGD
We embed our watermark using SGD and its corresponding set-up, as we detailed in Section 6.1.2. We show the resulting histograms of w (k) and ∆w in Figure 4. As we can see, SGD does not significantly alter the original distribution of weights shown in Figure 2a. This is also reflected in the SIKLD, which is very small, especially when we use orthogonal projectors (see Table 2). Now, we check the theory we developed for SGD in Section 3.1. Recall that out theoretical analysis just covers the case of orthogonal projectors. Using (24), we can generate samples of ∆w without including the effect of the denoising cost function. The resulting histogram is shown in Figure 5a. Notice that the unusual appearance of this histogram can be attributed to the effects of applying the initial transformation explained in Section 2.3.1. In particular, each value of ϕ repeats F times to form vectorφ; hence, the discrete values in the y-axis of the histogram shown in Figure 5a. However, we see that the range of values of this theoretical histogram fits quite well the empirical one (Figure 4c). In order to get a more accurate representation, we can generate samples of ∆w according to (48), so that we add the effect of noise coming from the denoising cost function. As we see in Figure 5b, the resulting histogram is now very similar to the one in Figure 4c.
In addition, we confirm that (25) can be used to compute the variance of the distribution of w (k) when we implement orthogonal projectors. Firstly, we check the hypothesis that we made regarding the uncorrelatedness between w (0) and ∆w. For our particular case of λ = 20, the correlation coefficient between w (0) and ∆w is 2.539 · 10 −4 , a very small value that confirms our assumption. Using (25), we have that the variance of the empirical distribution of w (k) -red histogram in Figure 4c-is 1.192 · 10 −3 while the theoretical variance is 1.193 · 10 −3 . As we see, these values are almost identical.

Adam
In the following experiments, we employ Adam optimization for the watermark embedding and use the same settings as in Section 6.1.2. The resulting histograms of w (k) and ∆w are shown in Figures 6 and 7, respectively. As it can be observed, the shape of the weight distribution changes to a great extent for λ = 1 and λ = 10. For smaller values of λ, since the influence of the watermarking cost function is weaker, we can avoid having a significant alteration to the original distribution shape. This is also reflected on their SIKLD values in Table 2: as λ increases the SIKLD also increases considerably. However, notice that, whatever the value of λ is, the histograms of weight variations always present the characteristic side spikes. These footprints left by Adam can increase the detectability of the watermark. In addition, when λ is small, we can observe in these histograms the influence of the denoising cost function: it causes the appearance of a central peak with values that spread till the location of both side spikes. Figure 8 represents the pdf of Γ obtained from (43). Notice that, as we stated in Section 3.2.5, the pdf is concentrated around its mode. Figure 9 shows the histograms of ∆w obtained from (44) when only the watermarking lossf (w) is optimized and the denoising component is set to zero. Compare these histograms to those in Figure 7: the theory developed in Section 3.2 is able to explain the two-spiked distributions of ∆w. Notice that these theoretical expressions provide a good enough approximation since they allow us to predict the position of the side spikes. We show in Table 3 the values of these positions obtained from both theoretical ( Figure 9) and empirical (Figure 7) results. As we can see, these side spikes are placed in almost identical positions by both theory and experiments, hence, we can confirm that the sign phenomenon in Adam is responsible for this ant-like behavior shown by the weights at the embedding layer. Finally, in order to reflect the influence of the denoising function and obtain more realistic histograms, we can solve the fourth-degree Equation (A11) and, then, generate samples of ∆w according to (A12). The resulting histograms are shown in Figure 10 and are now quite similar to those in Figure 7. As it can be seen, we can emulate the central peak and the dispersion of the values of the side spikes in the histograms and, thus, we can confirm that these effects are attributable to the influence of the denoising cost function.

BOP
Here, we represent the empirical histograms when we implement our method BOP. Figures 11 and 12 show the histograms of w (k) and ∆w, respectively. As we see from these histograms and the KLD and SIKLD values in Table 2, this method allows us to remove the side spikes of the histograms and much better preserve the original shape of the weight distribution. As a result, the detectability of the watermark due to Adam optimization is strongly reduced. A positive side effect of the undetectability of the watermark is that the robustness is increased because an adversary will not know which layer must be modified in order to alter the embedded watermark. This is illustrated in Figure 13 where we compare the robustness of standard Adam with that of BOP against weight pruning. The network is trained long enough for both Adam and BOP to guarantee a similar BER vs. pruning rate, as shown in Figure 13a. Then, the PSNR obtained after training and pruning is shown in Figure 13b for the Kodak24 dataset which illustrates the following facts: (1) BOP produces a network that is more robust to pruning in terms of PSNR, which can be valuable towards model compression; for instance, for a pruning rate of 0.35 (that has no impact on the BER of the hidden information), BOP degrades the original PSNR by about 1 dB, whereas Adam would produce a degradation of more than 3 dB. (2) This robustness might be detrimental in case it is an attacker who does the pruning in an attempt to degrade the watermark; for instance, for a pruning rate of 0.82, the BER for both Adam and BOP rises to 0.02 (see Figure 13a). For this pruning rate, the PSNR of BOP is around 25 dB while Adam gives slightly less than 24 dB. Then, in the case of BOP, the adversary would be able to produce a network that performs closer to the original in terms of PSNR-notice, however, that the degradation in both cases is quite severe, so this heavy pruning would render a denoiser with little practical use -. (3) In any case, the previous comparison would assume that the adversary knows the layer that contains the watermark; as we have properly justified, this is reasonable for Adam but not so for BOP. If the adversary does not know the layer that must be pruned, then, to achieve the same target BER, he/she must prune all the layers. In this case, for the same pruning rate of 0.82 that causes the BER to increase to 0.02, the PSNR drops to less than 18 dB. Similar conclusions can be extracted from Figure 13c that shows the PSNR vs. pruning rate for the CBSD68 dataset. These experiments clearly show the higher robustness brought about by the undetectability of BOP, as it prevents attacks targeted to a specific layer.

Conclusions
Throughout this paper, we have shown the importance of being careful with the optimization algorithm when we embed watermarks following the approach in [5,6]. The choice of certain optimization algorithms whose update direction is given by the sign function can originate footprints in the distributions of weights that are easily detectable by adversaries, thus compromising the efficacy of the watermarking algorithm.
In particular, we studied the mechanisms behind SGD and Adam optimization and found that the sign phenomenon that occurs in Adam is detrimental for watermarking, since it causes the appearance of two salient side spikes on the histograms of weights. As opposed to Adam, the sign function does not appear when we use SGD. Therefore, SGD does not significantly alter the original shape of the distribution of weights although, as we showed in the theoretical analysis, it slightly increases its variance. The analysis in this paper can be extended to other optimization algorithms.
In addition, we introduced orthogonal projectors and observed that, compared to the Gaussian case, they generally preserve the original performance and weight distribution better. However, a deeper analysis on this subject is left for further research.
Finally, we presented a novel method that uses orthogonal block projections to address the use of Adam optimization together with the watermarking algorithm under study. As we checked in the empirical section, this method allows us to solve the detectability problem posed by Adam and still enjoy the rest of advantages of this optimization algorithm. Funding: This work was partially funded by the Agencia Estatal de Investigación (Spain) and the European Regional Development Fund (ERDF) under projects RODIN (PID2019-105717RB-C21) and RED COMONSENS (RED2018-102668-T). In addition, it was funded by the Xunta de Galicia and ERDF under projects ED431C 2017/53 and ED431G 2019/08.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Mathematical Derivations
Appendix A.1. Projected Weights at k = 0 We want to compute the mean and the variance ofΦ T w (0) . It is easy to see that the mean is zero for any (φ (i) ) T w (0) , i = 1, · · · , T because E{φ (i) j } = 0. Now, in order to obtain an expression for the variance, we may compute the expectation of the trace of the covariance matrix ofΦ T w (0) divided by T. Therefore, if we use the properties of the trace, we have: We can write: It is straightforward to see that, for the off-diagonal terms, i.e., i = j, the expectation in (A1) is zero. For the N diagonal terms, i.e., i = j, we have instead: where Var{φ (m) } is the variance of any projection vectorφ (m) , m = 1, · · · , T. When we use Gaussian projectors, we have Var{φ (m) } = 1/F 2 and when we use orthogonal projectors, we have Var{φ (m) } = 1/ (FN). Therefore, the variance for Gaussian projectors is

. Adam: Mean of the Gradient
In order to find an expression for the mean of the gradient, we substitute (17) into (26) and obtain: The bias-corrected mean gradient is: The main difficulty in finding an explicit expression form (k) is the last sum which requires a closed-form expression for w (i) , for all i = 0, · · · , k, which in turn depends on the Adam adaptation. This easily leads to a difference equation whose solution is cumbersome. Alternatively, we start by conjecturing the affine-growth in (18) and write: Therefore, under the affine-growth hypothesis, we can write: and arrive at the expression in (27).

Appendix A.3. Adam: Variance of the Gradient
To get an expression for the variance of the gradient, we plug (31) into (34) and write: We want to write c T j sgn(φ) = α ·φ j + z j . To find α, we compute the cross-correlation withφ, i.e., E{φ T C T sgn(φ)} = αE{||φ|| 2 }. Then, z is simply C T sgn(φ) − α ·φ. In the following, we show separately the derivations for the Gaussian and orthogonal projectors. Appendix A. 4

.1. Decomposition for Gaussian Projectors
Recall that the projection vectors have i.i.d. components drawn from a N (0, 1) distribution. Since the quantity of interest E{φ T C T sgn(φ)} is a scalar, we will use the trace to manipulate the involved matrices, as follows: (m) , m = 1, · · · , T, i = 1, · · · , N, and θ i is the ith row of matrix Θ. In order to compute the previous expectation, we need to consider separately the diagonal terms and the off-diagonal ones.
We start with the off-diagonal elements (i.e., i = j). Here, we also need to distinguish the following cases: i F = j F , satisfied by N(N − F) elements, and i F = j F , which applies for the remaining N(F − 1) off-diagonal terms. Therefore, in the first subcase, we have that θ i = θ j , soφ (m) i =φ (m) j and: In order to compute the expectation E{φ where n i can be seen as a realization of an i.i.d. Gaussian distribution with zero-mean and variance (T − 1)/F 2 . Assume without loss of generality thatφ (m) i > 0. Then, the probability that sgn(φ i ) = −1 is i sgn(φ i ) will take the value −φ i sgn(φ i ) will correspond to that of a random variable Y constructed as follows: letting X ∼ N (0, 1/F 2 ), we define Y as Y .
= |X| · 1 − 2Q(|X|F/ √ T − 1) . Then, It can be shown that this integral gives [30]: Therefore, when i = j and i F = j F , we can compute (A2) as: j . Thus: where ξ Y can be computed as the mean of Y defined in a similar way as above, i.e.,: let X ∼ Using Mathematica, it is found that: Unfortunately, the double integral required to compute the expectation in (A4) does not seem to admit a closed-form solution. On the other hand, its numerical computation through Monte Carlo integration is rather straightforward. Let X, Y ∼ N (0, 1/F 2 ) and Z ∼ N (0, T−2 F 2 ), all mutually independent, then the desired expectation is E{XY 2 sgn(X + Y + Z)} . = κ(T), where, with κ(T), we stress the fact that the integral depends on T alone. The result is represented in Figure A1. It is possible to show that, as Then, for moderately large T, the sum in (A4) is approximately µ Y (T 2 + 2T − 1)/F 2 . For the i = j case (i.e., diagonal terms), we obtain the same result as (A4). Therefore, when computing ∑ i,j E{ C • (sgn(φ)φ T ) i,j }, there are N(N − F) terms with value µ Y T/F 2 , and NF terms with approximate value µ Y (T 2 + 2T − 1)/F 2 , hence: In addition, E{||φ|| 2 }} = E{φ Tφ } = E{ϕ T Θ T Θϕ} = MT F = NT F 2 , so we have: We are interested now in measuring the variance of z j , j = 1, · · · , N, which, by construction, we assume to be i.i.d. First, we note that the covariance matrix is: Then, the variance of z j will be the expectation of the trace of this matrix divided by N. The second term is immediate to compute, as E{Tr[φφ T ]} = NT/F 2 , so we focus next on Tr[C T sgn(φ)sgn(φ T )C] = Tr[CC T sgn(φ)sgn(φ T )]. Again, we can write: Note that: We analyze first the off-diagonal terms, i.e., i = j, when i F = j F , and consider separately the cases where m = l and m = l. For the first case, the right-hand side of (A7) can be written as: When i F = j F and m = l, the right-hand side of (A7) can be written as: Now, we consider the rest of elements which satisfy that i F = j F , which are the N diagonal terms and the remaining N(F − 1) off-diagonal ones. The right-hand side of (A7) is: where, for the last summand, we have used the fact that, for a zero-mean Gaussian random variable X with variance σ 2 , E{X 4 } = 3σ 4 . Then, the sum in (A7) is: and the resulting variance of z j is:
We are interested in computing the cross-product of this vector andφ. Then, we can write: Assuming i.i.d. components inφ, this implies that, for the orthogonal projectors, the cross-correlation of c T j sgn(φ) andφ j is the same as 1 F sgn(φ j ) andφ j . We can then compute E{φ j sgn(φ j )} = E{|φ j |}. Modelingφ j as N (0, T/(FN)), we find that E{|φ j |} = 2T πFN . Furthermore, E{||φ|| 2 } = E{φ Tφ } = E{ϕ T Θ T Θϕ} = T F . The existence of a positive cross-correlation suggests writing 1 F sgn(φ j ) = αφ j +ñ j , with α a suitable positive constant andñ a zero-mean noise vector uncorrelated withφ. Taking the cross-product and the expectation: Therefore, we find that: Now, it is easy to measure the second order moment ofñ j since: With the above characterization, we can see that: C T sgn(φ) = FαΨφ + FΨñ = αφ + z Therefore, we have: Then, When i F = j F , the expectation above is zero. For the NF remaining terms, that is, when i F = j F , we haveñ i =ñ j andφ (m) i =φ (m) j . Thus: Finally, we can compute the variance of z j as follows: Appendix A.5. Adam: Analysis with Denoising and Watermarking Here, we join the denoising and watermarking cost functions and follow a similar approach to that in Section 3.2.5. Let c be a random vector of length N, now we should build a random projection matrix,Φ = ΘΦ-with Gaussian or orthogonal projectors-, as a basis to generate samples of Ξ from (11) and to build c following (29).
We also consider that η = η d + η wm is a random vector of length N, where η d and η wm represent the denoising and watermarking update terms, respectively. The components of η d can be computed from realizations of D and V like in Section 3.3 that is, η d = δ/( √ δ 2 + ν). On the other hand, η wm has the same definition than in Section 3.2.5.
Let Ω be a random variable with the same distribution as c T η d and let M 0 , P, Q and R be random variables for which m 0,j , p j , q j are r j are realizations, respectively. Then, we have: Additionally, let S be a random variable for which s j is a realization. Now, it must include the contribution due to the denoising cost function so that: Therefore, for a given realization (ξ, z, δ, ν) of (Ξ, Z, D, V), we must solve the following fourth degree equation to get samples of Γ (notice that, in order to generate samples, we must previously build the random vectors η d and c): where: (αξ + z) 2 A 2 = A 21 + A 22 with A 21 and A 22 given by: Finally, A 5 = A 51 + A 52 + A 53 + A 54 with: Finally, we can generate samples of ∆w as:

Appendix B. Verification of Assumptions
Appendix B.1. Affine Growth Hypothesis for the Weights In order to check the affine growth hypothesis that we introduced in (18), we extract the weight values from the embedding layer on each iteration. Figures A2 and A3 show the evolution with k of four randomly selected weights when we respectively use: (i) SGD with orthogonal projectors, and (ii) Adam with Gaussian projectors. As is evident, the hypothesis holds quite accurately for these particular examples.
For a more general approach encompassing all weights, we carry out lineal regression on the evolution of each individual weight with k. Then, for each j ∈ {1, · · · , N}, we measure the correlation coefficient, ρ j , between the observed values-the experimental weight values-and the predictor values. Figure A4 represents the Empirical Cumulative Distribution Function (ECDF) of ρ j when using SGD with orthogonal projectors and Adam with both Gaussian and orthogonal projectors. We can state that, when we use Gaussian projectors with Adam optimization, for 90% of the weights at the embedding layer ρ j > 0.9410 and for 80% of them ρ j > 0.9970. Moreover, the affine hypothesis is even stronger when using orthogonal projectors, both with SGD or Adam optimization, since for 95% of the weights at the embedding layer ρ j > 0.9975.
Because these percentages show a high-linear behavior, we can confirm the validity of the affine hypothesis for the weights that is key to our theoretical analysis.  We first check the validity of the approximation w (0) ≈ 0 introduced in the analysis for SGD optimization. From (23), we can get η from the parameter settings detailed in Section 6.1 for the orthogonal case. We compute the gradient g ∇ .