Dynamics of Fourier Modes in Torus Generative Adversarial Networks

Generative Adversarial Networks (GANs) are powerful Machine Learning models capable of generating fully synthetic samples of a desired phenomenon with a high resolution. Despite their success, the training process of a GAN is highly unstable and typically it is necessary to implement several accessory heuristics to the networks to reach an acceptable convergence of the model. In this paper, we introduce a novel method to analyze the convergence and stability in the training of Generative Adversarial Networks. For this purpose, we propose to decompose the objective function of the adversary min-max game defining a periodic GAN into its Fourier series. By studying the dynamics of the truncated Fourier series for the continuous Alternating Gradient Descend algorithm, we are able to approximate the real flow and to identify the main features of the convergence of the GAN. This approach is confirmed empirically by studying the training flow in a $2$-parametric GAN aiming to generate an unknown exponential distribution. As byproduct, we show that convergent orbits in GANs are small perturbations of periodic orbits so the Nash equillibria are spiral attractors. This theoretically justifies the slow and unstable training observed in GANs.


Introduction
Since their very inception, Generative Adversarial Networks (GANs) have revolutionized the areas of Machine Learning and Deep Learning. They address very successfully one of the most outstanding problems in pattern recognition: given a collection of examples of a certain phenomenon that we want to replicate, construct a generative model able to create new completely synthetic instances following the same patterns as the original ones. Ideally, the goal would be to capture the underlying pattern so subtlety that no external critic would be able to distinguish between real samples and synthesized instances.
The proposal of Goodfellow et al. [13] is to confront two neural networks, in an adversary game, to solve this problem. More precisely, the proposal was to consider a neural network G playing the role of a generator agent, and a network D acting as discriminator. The discriminator D is trained to distinguish as accurately as possible between real samples and fake/synthetic samples. On the other hand, G aims to generate synthetic instances of high quality in such a way that D is barely able to distinguish from real data. The two networks are, thus, in effective competition. When, as byproduct of this competition, the agents reach an optimal point we get a generator able to generate almost indistinguishable synthetic samples as well as a discriminator very proficient in classifying real and fake instances.
The way in which these networks are trained to reach this optimal point is through a common objective function. Explicitly, in [13] it is proposed to consider the function where θ D are the inner weights of D, θ G the weights of G, Ω is the probability space of the real data and Λ is the latent probability space from which G samples noise to be transformed into synthetic instances. In this manner, F is essentially the error that D suffers in the classification problem between real and fake examples so D tries to maximize it and G to minimize it. Hence, it gives rise to a non-convex min-max game and the goal of the training process is to reach a Nash equilibrium of it.
Several training approaches have been proposed to reach these Nash equlibria but the most widely used method is the so-called Alternating Gradient Descend (AGD). Roughly speaking, the idea is to, alternatively, train D by tuning θ D with cost function F and weights θ G fixed and, after a certain amount of epochs, to reverse the roles and to update θ G with cost function −F and weights θ D fixed. This optimization procedure has led to astonishing results, particularly in the domain of image processing and generation. Using several architectures and sophisticated multi-level training, GANs are able to generate images with such a high quality that a human eye is not capable to distinguish them from real images [16].
Despite of these achievements, stability of the AGD algorithm for GANs is a major issue. In [21], the authors proved that the Nash equillibria for GANs are locally stable provided that some ideal conditions on the optimality of the equillibria are fulfilled. Nevertheless, these conditions may be unfeasible, as shown in [19], so actual convergence and stability is not guaranteed in real applications. In particular, one of the most challenging problems arising during the training of GANs is the so-called mode collapse [12]. This state is characterized by a generator that has degenerated into a network that is only able to generate a single synthetic sample (or a very small number of them) with almost no variation, and such that the discriminator confuses with a real sample (typically, because the synthetic sample is actually very close to a real one). In this state, the system is no longer a generative model, but simply a copier of real data.
Furthermore, by construction, neural network-based GANs have some intrinsic constraints in their expressivity that lead to very unrealistic synthetic samples in context far from image generation. For instance, neural networks produce a smooth output function, which provokes that GANs have lots of difficulties to deal with the generation of real samples drawn from a discrete distribution (e.g. according to an exponential distribution) [18], or with some drastic semantic restrictions (e.g. non-negative values for counters) [10]. These scenarios do not typically appear in image generation, but are common in other domains like data augmentation for Machine Learning [1]. These problems lead to additional inconveniences for stable convergence and usually give rise to highly unstable models that require a very handcrafted stopping criteria and optimization heuristics.
Multitude of works have been oriented towards a deeper understanding of the instability of the training of GANs as well as to propose solutions. A thorough theoretical study of the sources of instability and their causes can be found in [2], and in [6,7] the authors analyze the real capability of the GAN for learning the distribution both through a theoretical and an empirical approach. In addition, in order to mitigate the instability of the training in [25] the authors propose a collection of heuristical methods, through variations of the standard backpropagation algorithm, that contribute to stabilize the training process of GANs. Moreover, in [23] the use of regularization procedures is proposed to speed up the convergence.
Other very active research line is to propose of alternative models for GANs that guarantee a better convergence. It is well known that the key reason why GAN should capture the original distribution is because they implicitly optimize the Jensen-Shannon divergence (JSD) between the real underlying distribution and the generated distribution of the synthetic data [13]. In order to change this framework, in [3] the authors propose to modify the cost function in such a way that the new GAN does not optimize JSD but an Earth-mover distance known as Wasserstein distance, giving rise to the celebrated Wasserstein Generative Adversarial Networks (WGANs). In a similar vein, in [22] it is proposed to use the f -divergence (a divergence in the spirit of the Kullback-Leibler divergence) as criterion for training GANs. Even genetic algorithms have been used to stabilize the training process, as in [27], where the authors applied genetic programming to optimize the use of different adversarial training objectives and evolved a population of generators to adapt to the discriminator, which acts as the hostile environment driving evolution. Nevertheless, despite of all these efforts, no master method is currently available and hence assuring a fast, or even effective, convergence of GANs is an open problem.
Our contribution. In this paper we propose a novel method to analyze the convergence of GANs through Fourier analysis. Concretely, we propose to approximate the objective function F by its Fourier series, truncated with enough precision that the local dynamics of F can be understood by means of a trigonometric polynomial.
Recall that any function F (θ) : T n → C defined on the n-dimensional torus T n = (S 1 ) n (equivalently, an n-periodic function on R n ) can be decomposed into a series of complex exponential functions, known as its Fourier series where the series is indexed by the so-called Fourier modes or frequencies, m defined on the rectangular lattice Z n ⊆ R n . In principle, the previous equality must be understood as a decomposition in the Hilbert space of square-integrable functions, L 2 (T n ). However, if F has enough regularity, then the Fourier series on the right hand side also converges uniformly to the original function F . This implies that, taking enough Fourier modes, F can be effectively approximated by a truncated Fourier series. Moreover, if F is real-valued, expressing the complex exponential as a combination of sine and cosine functions, we obtain an approximation of F by a trigonometric polynomial, Θ(F ).
This approximation can be applied to the study of the convergence of GANs as follows. The continuous version of the AGD algorithm can the though as a path of weights, (θ D (t), θ G (t)), depending on the time parameter t ∈ R. In particular, (θ D (0), θ G (0)) are the initial random weights of the GAN and (θ D (t), θ G (t)) determine the state of the networks after training for a time t > 0. In this manner, if we seek to increase F (θ D , θ G ) in the direction θ D , and to decrease it in the direction θ G , the AGD gives rise to a system of Ordinary Differential Equations (ODEs) given by where θ D and θ G denote the derivatives of the functions θ D (t) and θ G (t) with respect to the time t. This flow aims to converge to a Nash equilibrium of the objective function F of the GAN and, for this reason, we will refer to it as the Nash flow.
However, in many interesting cases the function F may be very involved and lacks of an analytic closed expression which would enable an explicit analysis (e.g. even in the toy example of Equation (13) the cost function is intractable analytically). To address this problem, we propose to approximate F by its truncated Fourier series, Θ(F ). In this way, at least locally, the dynamic of the original Nash flow can be read from the solutions to the simplified system In order to analyze this system of ODEs, we propose a novel method focused on studying the dynamics of the Nash flow on Fourier basic functions and on subsequent further approximations. As we will see, for the Nash flow of a basic trigonometric function, the Nash equillibria are not attractors of the flow but centers, that is, they are surrounded by periodic functions that spin around the critical point. When we consider more Fourier modes in the Fourier expansion of F , these periodic orbits may break leading to spiral attractors or spiral repulsors. The conditions that bifurcate the centers into spiral sinks or sources can be given explicitly in terms of the combinatorics of the considered Fourier modes.
This provides a theoretical justification to the empirically observed instability of the GAN training: the convergent orbits towards a Nash equilibrium are mere perturbations of periodic orbits, falling slowly and spirally to the optimal point. For this reason, small variations in the training hyperparameters, like the learning rate, the number of epochs or the batch size may lead to very different dynamics, which confers to the training its characteristic instability. In addition, in this paper we will empirically evaluate this method against a GAN that aims to generate samples according to an unknown exponential distribution. To facilitate the visualization, we consider a simple GAN, with 1-dimensional parameter spaces each network, in such a way that the Nash flow can be plotted as a planar path. We will show that the proposed approach allow us to understand the simplified dynamics of the GAN and to extract qualitative information of the Nash flow.
It is worth mentioning that, in order to have a natural Fourier series, the considered objective function F of the GAN must be periodic. This may seem unrealistic in real-life GANs, but this is actually not a very strong condition. Usually, seeking to prove theoretical results about the convergence of GANs, many works force that F has compact support (for instance, to assure that it is Lipschitz as in WGANs). In practice, this is accomplish by clipping the output of the generator and discriminator functions for large inputs. This provokes that, artificially, the objective function turns into a periodic function and, thus, it can be studied through the method introduced in this paper. We expect that this work will open the door to new methods for analyzing and quantifying the convergence of GANs by importing well-established techniques of harmonic analysis and dynamical systems on closed manifolds, as studied in global analysis.
The structure of this paper is as follows. In Section 2 we review the theoretical fundamentals of GANs and their associated objective function and training method. In Section 2.1 we sketch briefly some basic concepts of Morse theory, a very successful theory that allows us to relate analytic properties of the function to be optimized with the topological properties of the underlying space. In Section 2.2 we introduce the Nash flow and discuss some of the arising problems for its convergence. In Section 3 we introduce torus GANs and, particularly, in Section 3.1 we explain how to perform Fourier analysis on the torus. Section 4 is devoted to the analysis of the Nash flow for truncated Fourier series both for basic function (Sections 4.1 and 4.2) and for more complicated combinations (Sections 4.3 and 4.4). In addition, in Section 5 the empirical testing of this method is performed, with comparisons between the real dynamic and the predicted ideal dynamic. Finally, in Section 7 we summarize some of the keys ideas of this paper and sketch some lines of future work.
Acknowledgements. The authors thank David Fontecha and María del Mar González for their careful reading of this manuscript and for pointing out several typos in a previous version. This work was supported in part by the European Union's Horizon 2020 Research and Innovation Programme under Grant 833685 (SPIDER).

GANs dynamics
As introduced by Goodfellow in [13], a GAN network is a competitive model in which two intelligent agents (typically two neural networks) compete to improve their performance and to generate very precise samples according to a given distribution.
To be precise, let X : Ω → R d be a d-dimensional random vector, defined on a certain probability space Ω. This random vector X should be understood as a very complex phenomenon whose instances we would like to replicate. For this purpose, we consider two functions called the discriminator and the generator, respectively. Here, Λ is a probability space, called the latent space, and Θ D , Θ G are two given topological spaces. These functions should be seen as parametric families of functions D θ D : The aim of the GAN is to tune the parameters θ D and θ G is such a way that, given x ∈ R d , D θ D (x) intends to predict whether x = X(ω) for some ω ∈ Ω or not i.e. whether x is compatible with being a real instance or it is a fake datum. Observe that, throughout this paper, we will follow the convention that D θ D (x) is the probability of being a real instance, so D θ D (x) = 1 means that D θ D is sure that x is real and D θ D (x) = 0 means that D θ D is sure that x is fake. On the other hand, the generative function, G θ G , is a d-dimensional random vector that seeks to converge in distribution to the original distribution X. Typically, the probability space Λ is R l with a certain standard probability distribution λ, as the spherical normal distribution or a uniform distribution on the unit cube.
Remark 2.1. In typical applications in Machine Learning, Ω is given by a finite set Ω = {x 1 , . . . , x N }, with x i ∈ R d , and endowed with a discrete probability (typically, the uniform one) so X is just the identity function. In customary applications of GANs, we have that the instances x i are images, represented by their pixel map, so the objective of the GAN is to generate new images as similar as possible to the ones in the dataset Ω.
The competition appears because the agents D and G try to improve non-simultaneously satifactible objectives. On one hand, D tries to improve its performance in the classification problem but, on the other hand, G tries to generate as best results as possible to cheat D. To be precise, recall that perfect fit for the classification problem for D θ D is given by D θ D (x) = 1 if x is an instance of X and D θ D (x) = 0 if not. Hence, the L 1 error made by D θ D with respect to perfect classification is where E Ω and E Λ denote the mathematical expectation on Ω and Λ, respectively. In this way, the objective of D θ D is to minimize E while the goal of G θ G is to maximize it. It is customary in the literature to consider as objective the function 1 − E and to weight the error with a certain smooth concave function f : R → R. In this way, the final cost function is Remark 2.2. Typical choices for the weight function f are f (s) = − log(1 + exp(−s)), as in the original paper of Goodfellow [13], or f (s) = s as in the Wasserstein GAN [4].
However, in sharp contrast with what is typical in Machine Learning, the aim of the GAN is not to maximize/minimize F . The objectives of the D and G agents are opposed: while D tries to maximize F , the generator tries to minimize it. In this vein, the objective of the GAN is In the case that the latent space Λ is naturally equipped with a topology (as in the case Λ = (R l , λ)), in is customary to require that F : Θ D × Θ G → R is a continuous function. In addition, in our case Θ G and Θ D will be differentiable manifolds, so we will require that both D and G are C 2 maps in both arguments and, thus, F is a differentiable function on Θ D × Θ G .
To be precise, algorithm proposed by Goodfellow [13] suggests to freeze the internal weights of G and to use it to generate a batch of fake examples from Λ. With this set of fake instances and another batch of real instances created using X (i.e. sampling randomly from the dataset of real instances), we train D to improve its accuracy in the classification problem with the usual backpropagation (i.e. gradient descent) method. Afterwards, we freeze the weights of D and we sample a batch of latent data of Λ (i.e. we randomly sample noise using the latent distribution) and we use it to train G using gradient descent for G with objective function θ G → E Λ f (−D(G θ G )). Finally, we can alternate this process as many times as needed until we reach the desired results. Several metrics have been proposed to quantify this performance, specially regarding the domain of image generation, like Inception Score (IS) [25], Fréchet Inception Distance (FID) [15] or perceptual similarity measures [26]. For a survey of these techniques, please refer to [9].
2.1. Review of Morse theory. Let us suppose for a while that, instead of looking for solutions of (2) we were seeking to local maxima of F . In this situation, the standard approach in Machine Learning is to consider the Morse flow, also known as gradient ascent flow. For it, let us fix riemannian metrics on Θ D and Θ G . Using them, we can compute the gradient of F , respectively. Then, the Morse flow is the differentiable flow on Θ D × Θ G generated by the vector field ∇F . Explicitly, it is given by the system of ODEs This flow has been objective of very intense studies in the context of differentiable geometry and geometric topology. For instance, it is the crucial tool used in Smale's proof of the Poincaré conjecture in high dimension [20], and has been successfully used to understand the topology of moduli spaces of solutions to highly non-linear Partial Differential Equations coming from theoretical physics [8], among others.
Obviously, the critical points of the system (3) are exactly the critical points of F , in the sense that the differential dF | (θ 0 D ,θ 0 G ) = 0. In order to control the dynamics of this ODE around a critical point, a key concept is the notion of index of a point.
where d D and d G are the dimensions of Θ D and Θ G respectively. Then, Hessian is the matrix of second derivatives In particular, the only sinks of the Morse flow are precisely the local maxima of F , in which Hess(F ) is negative-define and, thus, Another important fact that we will use is the following topological interpretation of the indices, known as the Poincaré-Hopf theorem. It claims that, if Θ D and Θ G are compact then Here, Crit(F ) denotes the (finite) set of critical points of F and χ is the Euler characteristic of the space.
2.2. The Nash flow. Now, let us come back to our optimization problem (2). Despite of the simplicity of the formulation of the cost function, this problem is very far from being trivial. The best scenario would be to obtain a so-called Nash equilibrium.
Remark 2.5. A Nash equilibrium is in particular a critical point of F .
In this vein, it is natural to consider an analogous differentiable flow to (3) but converging to Nash equilibria. For this purpose, fix riemannian metrics on Θ D and Θ G as above and consider the gradient ∇F = (∇ D F , ∇ G F ). Now, we twist the gradient to consider the Nash vector field Definition 2.6. The Nash flow is the differentiable flow on Θ D × Θ G generated by the Nash vector field N (F ). Explicitly, it the the system of ODEs This flow (or, more precisely, the associated discrete-time version known as the AGD flow) has been intensively used for training GANs from their very inception. Already in Goodfellow's seminar paper [13], this flow is proposed as a method for seeking to a Nash equilibriums of the game (2).
To understand the dynamics of the Nash flow, let us study it around a critical point. Working in a local chart around a critical point, with an adapted basis have that the differential of the Nash vector field is the Nash Hessian In this manner, in a small neighborhood of a critical point (θ 0 D , θ 0 G ) ∈ Θ D × Θ G of F (in particular, around a Nash equilibrium), the dynamics are determined by the linearized version However, in sharp contrast with the Morse flow, even if F has non-degenerate critical points, it may happen that the Nash equilibria are not attractors. For instance, if the Nash Hessian has vanishing diagonal (as in Section 4.2), then periodic orbits arise around the critical point and the flow is nonconvergent.
Nonetheless, this behavior can be controlled. Suppose for simplicity that d D = d G = 1 (higher dimensional scenarios can be treated analogously by splitting the tangent space). In that case, the eigenvalues of N Hess(F ) are either both real or complex conjugated.
• If the eigenvalues are real, around a Nash equilibrium both eigenvalues must be non-negative since in the usual Hessian they have different signs. Hence, the Nash equilibrium is a nonrepulsor of the Nash flow. Moreover, if F is Morse, then its eigenvalues do not vanish and, thus, the Nash equilibrium is an attractor. • If the eigenvalues are complex conjugated, say λ, λ ∈ C, then the dynamic is controlled by the real part of λ, Re(λ). There is an invariant way of computing this quantity as through the trace of N Hess(F ) since Observe that this is nothing but the wave operator acting on F . In the case that this trace is negative, the critical point is an attractor with spiral dynamic; if it is positive, it is a repulsor; and if it vanishes, it is a center with surrounding periodic orbits.
It is worth mentioning that, in the case of GANs, the function F of (2) to be optimized does not define a convex-concave game so, in general, the convergence of the usual training methods through Nash flow is not guaranteed [21]. Under some ideal assumptions on the behaviour of the game around the Nash equilibrium points, in [21] the authors proved that the Nash flow is locally asymptotically stable. However, the hypotheses needed to apply this result are quite strong and seem to be unfeasible in practice. For instance, in [19], the authors show an example of a very simple GAN, the so-called Dirac GAN, for which the usual gradient descend does not converge.

Torus GANs
From now on, let us focus on a very particular case of GAN, that we shall call a torus GAN. Let us denote T n = S 1 × . . . S 1 n times the n-dimensional torus. Then, we will take as parameter spaces Θ D = T d D and Θ G = T d G . In this way, the cost functional becomes a function Remark 3.1. This particular choice is not as arbitrary as it may seem at a first sight. In the end, a torus GAN is any GAN in which the generator and discriminator are periodic functions on their parameters θ D and θ G for some large enough period. In standard neural network-based GANs it is customary to clip the output of the neural network in order to prevent the internal weights to become arbitrary large. This is particularly important specially Wasserstein GANs, where the objective function is required to be Lipschitz and this is achieved by forcing the cost function to have compact support. In this way, after clipping, both the generator and the discriminator agents are periodic functions and, thus, they define a torus GAN.
Working on the torus has important consequences to the dynamics the Morse flow. Some of them are the following: • Divergent orbits are not allowed. Since T n is compact, standard results of prologability of solutions for short-time show that the orbits of any vector flow cannot blow-up. Intuitively, they cannot escape by tending to infinity. In particular, if F is a Morse function, all the orbits in the Morse flow must converge to a critical point. This is a consequence of the fact that, along a non-constant orbit of the Morse flow , the function F is strictly increasing since Thus, since F is bounded, the flow is forced to converge to a constant orbit, that is, to a critical point of F . This prevents the appearance of periodic orbits in the Morse flow. In the Nash flow, this may no longer hold and periodic orbits may arise (as in Section 4.2). • Topological restrictions. The Euler characteristic of T n is χ(T n ) = χ(S 1 ) n = 0. Hence, equation (4) implies that In other word, there is the same number of critical points of even index than of odd index. In particular, if d D = d G = 1, there are as many saddle points (which are points of index 1) as maxima and minima (which are points of index 2 or 0).

3.1.
Fourier analysis in the torus. In order to understand the cost function F of a torus GAN, we shall apply techniques of harmonic analysis to it. We will suppose that the reader is familiar with basic notions of Fourier and harmonic analysis, like Hilbert spaces and orthogonal Schauder basis on them. Otherwise, please refer to [24].
Let us see T n = R n /Z n so that functions on T n are n-periodic functions on the unit square. Recall that a fundamental result of Fourier analysis is that the space L 2 (T n ) of complex-valued squareintegrable functions on T n is a Hilbert space with product given by Moreover, this space is spanned by the orthonormal basis of functions where m = (m 1 , . . . , m n ) ∈ Z n , θ = (θ 1 , . . . , θ n ) ∈ T n and m · θ = m 1 θ 1 + . . . + m n θ n is the standard inner product. In other words, any F ∈ L 2 (T n ) can be uniquely written as a sum in the sense that this sum is convergent in L 2 (T n ) and converges to F . This expression is referred to as the Fourier series of F . The coefficients α m are called the Fourier coefficients or the Fourier modes of F . Using the orthogonality of the functions e m (θ), they can be obtained as In principle, the convergence of the Fourier series to F is only in the L 2 sense (c.f. [11] for a Fourier series of a continuous function not converging pointwise everywhere, or [17] for an everywhere divergent Fourier series of a L 1 function). However, if F is C 1 , since we are working on a compact space, it is automatically Hölder and, thus, its Fourier series converges uniformly [28]. This means that, for every > 0 for all N large enough. Similar approximations can be obtained for the k first derivatives of F if it has enough regularity (concretely, if it is C k+1 ).
This approximation is very useful for estimating the associated flow. Recall that, using Gronwall inequality [14], if X, Y are two Lipschitz vector fields, then there exists a constant M > 0 such that their associated flows θ(t) and ϑ(t) satisfy for all t. In other words, for medium-times, the flow of X may be approximated through the flow of Y .
Remark 3.2. The previous estimation implies that, locally, the dynamics of the flows θ(t) and ϑ(t) are similar. In particular, this is useful for analyzing convergence around critical points. Nevertheless, the global dynamics of θ(t) and ϑ(t) may be quite different, say, they may have different numbers of critical points.
In our context, this idea can be exploited as follows. Let us denote by the truncated Fourier series of F . If F is C 2 , then ∇F and ∇Θ N (F ) are close vector fields and, thus for N large enough, where θ(t) is the Morse flow for F and θ N (t) is the Morse flow for Θ N (F ). Working verbatim with the Nash vector fields we obtain similar estimates for the solutions of the Nash flow.

Dynamics of Fourier basis
In this section, we focus on the Nash flow of truncated approximations of Fourier series of a C 2 function F . As we mentioned above, these solutions approximate quite well the real Nash flow of F for short times (particularly, around critical points).
For the sake of simplicity, in this section we shall focus on the 2-dimensional case in which Moreover, we will truncate the Fourier series at level N = 2. Similar arguments can be carried out for higher dimension and more accurate precision of the Fourier series with similar results, but the calculations become more involved.
First of all, let us re-write the Fourier series of F as a trigonometric polynomial. Recall that the trigonometric functions can be obtained from the complex exponential as cos(2πθ) = e 2πiθ + e −2πiθ 2 , sin(2πθ) = e 2πiθ − e −2πiθ 2i .
Since the function F is real-valued, we can group the coefficients and to obtain a formula for the Fourier series in term of trigonometric functions as The coefficients are real numbers that can be obtained as where δ m 1 ,m 2 is a coefficient that δ m 1 ,m 2 = 1 if m 1 = m 2 = 0; δ m 1 ,m 2 = 2 if m 1 = 0 and m 2 > 0, m 1 > 0 and m 2 = 0; and δ m 1 ,m 2 = 4 m 1 , m 2 > 0.
To shorten notation, from now on we shall denote This notation is particularly useful because, for any α, β ∈ Z 2 where the sum is interpreted as sum in Z 2 .
From this expression of the Fourier series, we will approximate the dynamics of the Nash flow for F by truncating the Fourier series. In particular, we sort the coefficients a α,β m 1 ,m 2 by decreasing order of their absolute value. Looking only at the two largest coefficients, and normalizing so that the leading coefficient is 1, we will consider the approximation to F where α, β, γ, δ ∈ Z 2 , (m 1 , m 2 ) are the leading Fourier modes and (n 1 , n 2 ) are the second largest modes, and |µ| < 1.

4.1.
Nash flow for single variable Fourier basis. From now on, we aim to analyze the Nash flow for a truncated Fourier series. As we will see in Section 5, from it we can envisage the global dynamics of the Nash flow for the objective function of a GAN.
For this reason, in many cases the effect of the ∆ 1 and the ∆ 2 parts to the dynamics is negligible and can be ignored.

4.2.
Nash flow for Fourier basis. In this section, we shall analyze the dynamics of the Nash flow for the remaining Fourier basis. For this purpose, let us consider the function Λ α,β m 1 ,m 2 , for some α, β ∈ Z 2 with m 1 , m 2 ≥ 1. The Nash vector field associated to it is Recall that if (θ 1 , θ 2 ) ∈ T 2 is a zero of Λ α,β m 1 ,m 2 , then it satisfies 4θ 1 m 1 ≡ 2k 1 + α mod 4Z, or 4θ 2 m 2 ≡ 2k 2 + β mod 4Z, for some k 1 , k 2 ∈ Z. In other words, if we take into account the periodicity of the function Λ α,β m 1 ,m 2 , the zeros are given by for 0 ≤ k 1 < 2m 1 and 0 ≤ k 2 < 2m 2 . Observe that all these values are different, so Λ α,β m 1 ,m 2 has 4m 1 m 2 zeros.
Coming back to Equation (7), we observe that if (θ 1 , θ 2 ) ∈ T 2 is a critical point of the Nash vector field (i.e. a critical point of Λ α,β m 1 ,m 2 ) then it satisfies one of the following two possibilities Beware of the change of sign in the coefficient of α and β for points (I). This is just a matter of notational convenience, as it will be shown below. Equivalently, the these conditions can be written explicitly as Thus, the Nash vector field has 8m 1 m 2 critical points: 4m 1 m 2 critical points of type (I) and 4m 1 m 2 of type (II).
Regarding the Nash Hessian, it is explicitly given by Therefore, evaluated at a critical point of the form (I), we get that These are all saddle points for the Nash flow, with an attractive direction and a repulsive direction.
On the other hand, the Nash Hessian evaluated at a critical point of the form (II) is In this situation, we get a center critical point, with periodic orbits around it and no convergent flow lines. This dynamic is depicted in Figure 1. Putting together these calculations, we have proven the following result.

4.3.
Nash flow for simplified truncated Fourier series. In [19] it is proven that, under some ideal conditions, the Nash flow associated to the cost function of a GAN has stable Nash equilibriums. For this reason, according to Proposition 4.1, these cost functions cannot be basis functions of the Fourier series. In other words, its Fourier approximation (6) is non-trivial. Hence, in order to capture the actual dynamics of the GAN flow, let us consider a general truncated Fourier series of the form Θ = Λ α,β m 1 ,m 2 + µΛ γ,δ n 1 ,n 2 , for some α, β, γ, δ ∈ Z 2 , −1 ≤ µ ≤ 1 and Fourier modes m 1 , m 2 , n 1 , n 2 ≥ 1.
At this point, we have the following two options.

4.4.
Nash flow for general truncated Fourier series. In the general case, the calculation is similar but more involved. To alleviate notation, let us consider the auxiliary functions Notice that these maps are just the sign functions of the trigonometric functions σ 0 (θ) = sign (sin(2πθ)) and σ 1 (θ) = sign (cos(2πθ)), with the customary assumption that the sign function vanishes at zero. If needed, we may extend them to the whole real line by periodicity.
Remark 4.4. Even though half of the critical points near the points of the form (II) are attractors for the Nash flow of Θ, the dynamic is an small perturbation of a center. In this manner, the convergence is slow, highly spiralizing towards the Nash equilibrium. This theoretically justifies the slow and bad conditioned convergence observed in GANs networks.

Empirical analysis
In this section, we show empirically how these Fourier approximations can be useful for understanding the convergence in the training of GANs. For this purpose, in this section we will consider a simple model for a 2-parametric torus GAN (i.e. with d D = d G = 1) and we shall analyze its convergence by means of its truncated Fourier series.
In the notation of Section 3, we shall take d = 1 (1-dimensional real data) and the parameter spaces will be Θ D = Θ G = S 1 . The latent space will be Λ = [0, 1] ⊆ R with the uniform probability (standard Lebesgue measure). Fix a periodic functions χ : S 1 → R. Choose a 1-parametric continuous distribution D ξ depending on the parameter ξ ∈ R, with cumulative distribution function F ξ and probability density function f ξ . Fixed ω ∈ S 1 , the real data X will be sampled according to the distribution X ∼ D χ(ω) .
As discriminator function, for θ 1 ∈ S 1 , we consider the function D θ 1 : R → R given by .
On the other hand, for θ 2 ∈ S 1 , the generator will be the function G θ 2 : Λ = [0, 1] → R given by is the quantile function of D χ(θ 2 ) . With these choices of generator and discriminator, and taking as weight function f (t) = − log(1 + exp(−t)) as in [13], the cost functional (1) reduces to Remark 5.1. These choices of shapes for the discriminator and generator functions are justified by [13,Proposition 1]. There, it is proven that, for fixed generator G with transformed probability density function f G , then the optimal discriminator D θ 0 1 is given by On the other hand, recall that if Λ = [0, 1] with the uniform probability then F −1 ξ : Λ = [0, 1] → R is a random variable with distribution D(ξ). Thus, in our case, G θ 2 is a random variable with distribution D χ(θ 2 ) and, therefore, transformed density f χ(θ 2 ) .
In this vein, the goal of the generator G given by (12) is to adjust θ 2 to reach the value θ 2 = ω, for which G generates exactly the real data. At the other side, for fixed parameter θ 2 for G, D given by (11) aims to tune θ 1 to the value θ 1 = θ 2 , for which D is the perfect discriminator (14).
For the purposes of these experiments, we will fix as underlying distribution D ξ to be the exponential distribution with mean 1/ξ, and χ(θ) = sin(πθ) 2 + 1. Recall that, in this situation, f ξ (x) = ξe −ξx y F ξ (x) = 1 − e −ξx . In this way, the discriminator function (11) and the generator (12) are given by Moreover, from now on we fix ω = 1/4, so that χ(ω) = 3/4. The resulting probability density and cumulative distribution functions of the real data are plotted in Figure 3.
With this choice of real distribution, the generator function, as well as the transformed probability density function are plotted in Figure 4, and the discriminator function is shown in Figure 5.
In addition, in Figure 6 we show graphically the cost function F (θ 1 , θ 2 ) of (13) on T 2 . The numerical approximation of the integrals in (13) have been carried out with the Simpson rule. The function was sampled at 225 knot points and subsequently interpolated by means of a multiquadratic radial basis interpolation. Observe that one of the Nash equilibria of F is at (θ 1 , θ 2 ) = (1/4, 1/4)       Now, let us decompose F according to its Fourier series. In Table 1 we show the modes with the largest absolute Fourier coefficients. These coefficients have been computed using the formulae of  Table 1. Fourier modes of the cost function for the torus GAN. The ten modes with the largest absolute value of their associated coefficient are shown. The last column shows the ratio between each Fourier coefficient and the largest coefficient. Section 4, by applying rectangular quadrature as numerical integration method and looking at the modes with 1 ≤ m 1 , m 2 ≤ 10.
The associated Nash flow is depicted in Figure 8. As can be checked there, the critical points nearby points of type (II) are (approximately) centers for s ≤ 3. The reason for this behavior is twofold. In the following, let (θ • For s ≤ 2, we have that ∇Θ s | (θ 0 1 ,θ 0 2 ) = 0 since, in the gradient, there is always a term with a factor cos(2πθ) that vanishes at these points. Hence, the critical point of Θ s is exactly at (θ 0 1 , θ 0 2 ). Nevertheless, since all the terms Λ α,β m 1 ,m 2 appearing in the Fourier series have equal (α, β) = 1, as mentioned in Section 4.3 we still have that the Nash Hessian has the form (8) with vanishing diagonal entries. Hence, the critical point (θ 0 1 , θ 0 2 ) is still a center. • For s = 3, we find that ∇Θ 3 | (θ 0 1 ,θ 0 2 ) = 0 so a new critical point (θ 1 ,θ 2 ) appears near (θ 0 1 , θ 0 2 ). Nevertheless, for this new mode we have that m 3 1 = m 3 2 = 2 so Equation (10) still vanishes, proving that the new critical point is still a center.

Methodology for practical applications
The discussion of Sections 4 and 5 opens the door to a practical application of the analysis techniques introduced in this paper to study convergence of real-world GANs. Observe that, in general, the knowledge of the underlying cost function F (c.f. Equation (1)) of a GAN is very limited. Indeed, several metrics have been proposed in the literature to screen the evolution of the training of the GAN. These metrics provide a way of measuring indirectly the convergence of the GAN, but definitely skip a thorough analysis of the cost function. Nevertheless, using the techniques introduced in this paper, we will show that it is possible to methodically analyze the dynamics of the Nash flow for the GAN problem through the partial sums of the Fourier series of the cost function. It is remarkable that this valuable information about the behaviour of the training process cannot be extracted from F itself.
In this section, we aim to organize the previous analysis into a precise methodology that can be applied in practice. As it will become clear, this process was implicit in the reasoning provided in Section 5. The proposed process of analysis comprises the following steps: (1) Evaluate cost function F (θ D , θ G ) in a uniform grid for the parameters (θ D , θ G ) (the weights of the two neural networks forming the GAN in the deep learning framework). Observe that for these evaluations it is not necessary to train the GAN networks. The sampling process amounts to fixing the weights of the networks and to compute the mean prediction error of the discriminant against real and synthetic instances. No optimization of the weights must be carried out. After this process, we will have found a truncation level s 0 such that the local dynamics of Θ s 0 around the critical points are conjugated to the local dynamics of F around its Nash equilibria. This information can be exploited to analyze the training process of the GAN. For instance, if the convergence to the critical point is very slow, in the sense that the trace of the Nash Hessian is close to zero, then a hard convergence of the training process should be expected. This will lead to remarkable unstabilities during the learning process that may prevent the system to converge with a raw gradient descent optimization procedure. In that case, the obtained results strongly suggest that several heuristics for stabilizing the training process must be implemented. Additionally, since the equilibria are spiral attractors, if the learning rate of the gradient descend method is not small enough, the discrete time approximation may not converge. In that case, the information about the convergence rate in the simplified Fourier model can be used to properly anneal the learning rate, leading to a much stable convergence.
Despite the utility of the proposed methodology, it suffers several issues that must be addressed in future works to obtain an efficient analysis procedure. The first one is that the previous proposal has an obvious bottleneck: the sampling process of the cost function on the parameters (θ D , θ G ) may require a huge number of samples due to the course of dimensionality. Nevertheless, it is important to mention that it is not necessary to use a very dense grid since we want to understand the Fourier modes of the cost function F and not to obtain a detailed picture of the landscape of F . This will largely alleviate the sampling process to make it feasible.
Another possible solution is to not sample on the whole (θ D , θ G )-space, but on a smaller dimensional subspace concentrating the flow. For that purpose, the GAN network can be trained and, after some epochs, flow will have entered in a certain 'convergence subspace' that will enclose the long-time evolution of the flow. This subspace can be estimated by several methods, for instance by considering the subspace generated by the last k ≥ 1 gradient vectors obtained in the training process. In that case, instead of working on the high dimensional (θ D , θ G )-space, we can restrict our analysis to the k-dimensional affine space generated by these vectors. This is a much smaller subspace in which the sampling process can be carried out. Nevertheless, proposing other efficient methods of sampling that enable accurate approximations of the Fourier series of F is an interesting topic for future work.
Another important remark is that the methodology proposed to estimate the Fourier series through the FFT is much more efficient than the quadrature methods used in Section 5. However, it also may lead to poorer estimations of the Fourier coefficients. This inaccuracy may produce errors when choosing the leading Fourier modes if their importance (absolute value of their Fourier coefficients) are similar. To avoid these problems, all the possible permutations of these similar modes (say, modes whose coefficients differ less that a fixed threshold) must be considered during the analysis of Nash flow of the Fourier series.

Conclusions
In this paper we have studied a novel approach to deeply analyze the converge of GAN networks on tori. This is an outstanding open problem in Machine Learning and Deep Learning that prevents GANs to be suitable for use in arbitrary domains, as feature generation outside the world of image processing.
In this paper, we propose to decompose the cost function of a GAN into its Fourier mode and to envisage the dynamics around the Nash equilibria through its truncated Fourier approximation. For that purpose, we have performed a thorough analysis of the dynamics of trigonometric series with one and two terms. Roughly speaking, this analysis has shown that if we truncate the Fourier series at its first mode, all the critical points are centers surrounded by periodic orbits. When we add subtler Fourier modes to the approximation, this dynamic may be preserved or may bifurcate to give rise to spiral attractors or repulsors. This dynamic is essentially determined by the trace of the Nash Hessian of the cost function. Hence, following this idea, in this paper we have exhibit explicitly the bifurcation condition for the Nash flow of the truncated Fourier approximations. These conditions have an involved shape taking into account the monotonicity of the trigonometric functions on a neighborhood of the critical point but, eventually, the conditions are very explicit and can be easily checked. As byproduct of this analysis, we have observed that, even though the Nash equilibria are stable points as proven in [19], the dynamic of the training process is close to a center and the convergence is slow and spiral.
To test this idea, we have conducted an experimental analysis with a torus GAN toy-model. Through this example, we have observed that the number and distribution of the critical points is determined by the first Fourier model. Nevertheless, it was necessary to reach the forth Fourier term to discover the attractive dynamics, as predicted in the GAN literature. Comparing the approximated flow with the real flow, we observe that the approximation is able to replicate not only the local but also the global dynamics of real GAN. We expect that this work will be useful for quantifying the complexity and convergence properties of GAN. To show how this theoretical analysis can be put into practice, in Section 6 we propose a methodology of analysis that enables a characterization of the training dynamics of real-world GANs by means of the techniques developed in this work. From the obtained information about the convergence of the learning process of the networks, several improvements for stabilizing the training can be implemented, like a progressive reduction of the learning rate to adapt the geometry of the spiral flow.
It is worth mentioning that the results presented in this paper do not only apply to torus toymodels, but also to more realistic networks. It may seem at a first sight that standard GANs do not fulfil the periodicity requirement to be defined on a torus. However, in many cases, the outputs of the generator and the discriminator networks are clipped for large enough inputs. This fix is crucial to maintain several required analytic properties, as the Lipschitz condition for Wasserstein GANs [4]. After this clipping, the GAN does actually turn into a torus GAN since the generator and discriminator functions are periodic (with a large period). In this manner, most of the regular GANs used in image generation and feature generation fit in the framework introduced in this paper. And this is crucial, since dynamics on a closed manifold are deeply related to the underlying topology, for instance through the Poincaré-Hopf theorem or deeper Morse-like results.
Nevertheless, much work must be done before this project can be turned into a reality. First, in order to compute the Fourier series of the cost function, we had to sample the cost function of the GAN at a dense mesh of weights. Using this sampling, we were able to estimate the Fourier coefficients through standard quadrature techniques, as the Simpson rule. In shallow networks with few neurons a similar approach can be applied, but for deeper networks this dense sampling is unfeasible. For this reason, better methods for estimating the Fourier coefficients of the cost function are needed, maybe by exploding the analytical and harmonical properties of the trigonometric functions. In addition, to illustrate the method, in this paper we have carried out all the calculations on a 2-dimensional torus. The computation in higher dimensional tori may follow similar lines, but definitely a thorough analysis of the bifurcation conditions in the higher dimensional setting is not obvious.
Summarizing, in this paper we have introduced a novel method for understanding the dynamics of GANs through harmonic analysis. We have shown that, despite that the Nash equilibria of the GAN are stable, the convergence is a perturbation of a center and, thus, slow and complicated. The method has allow us to identify a simplified model of the dynamics that may be useful for tuning several hyper-parameters of the used GANs, as the learning rate of the number of epochs to be trained. We expect that this work will open the door to new methods of study of dynamics of GAN by using harmonic analysis and trascendental methods.