Mixture Density Conditional Generative Adversarial Network Models (MD-CGAN)

Generative Adversarial Networks (GANs) have gained significant attention in recent years, with particularly impressive applications highlighted in computer vision. In this work, we present a Mixture Density Conditional Generative Adversarial Model (MD-CGAN), where the generator is a Gaussian mixture model, with a focus on time series forecasting. Compared to examples in vision, there have been more limited applications of GAN models to time series. We show that our model is capable of estimating a probabilistic posterior distribution over forecasts and that, in comparison to a set of benchmark methods, the MD-CGAN model performs well, particularly in situations where noise is a significant in the time series. Further, by using a Gaussian mixture model that allows for a flexible number of mixture coefficients, the MD-CGAN offers posterior distributions that are non-Gaussian.


I. INTRODUCTION
Generative Adversarial Networks (GANs) have been one of the many breakthroughs in Deep Learning methods in recent years. Several different variations of the model have been introduced since the method was first introduced in [2]. One of the most popular variations of the work are Conditional Generative Adversarial Networks (CGAN) [3] in which the generator and discriminator are both conditioned on some observed information. Within time series forecasting, future values are conditioned on information observed from the past -either from the time series itself, exogenous data or a combination of the two. This makes the CGAN approach particularly useful for time series prediction.
Most applications of (C)GANs have been within computer vision and, to a lesser extent, in natural language processing and there has been some use of GAN models for simulations in a variety of settings, [7], [8], and [9].
The literature on the application of GAN models to problems associated with time series is, to date, limited. However, some work shows the potential usefulness of the method. For example, [4] apply a recurrent GAN to generate realistic synthetic medical data series. In [5] a GAN is used to forecast highfrequency stock datasets, and [6] use GANs to generate missing values for incomplete time series.
In this work, we present a method that expands on the CGAN algorithm. In our model, the generator estimates a multimodal posterior distribution, via a finite mixture of Gaussians. Unlike most variations of GAN models, in which the generator makes a point estimate, the MD-CGAN is capable of estimating a April 9, 2020 The authors are with the Machine Learning Research Group and the Oxford-Man Institute of Quantitative Finance, University of Oxford, Oxford, UK. (e-mail: jz@robots.ox.ac.uk, sjrob@robots.ox.ac.uk). flexible probability distribution. This paper is set out as follows: in Section II we present the structure of the MD-CGAN model. In Section III we test the model on various datasets and discuss the results. Finally, in Section IV, we conclude.

II. THE MD-CGAN MODEL FRAMEWORK
We consider a time series, y t . Our aim is to infer the posterior over some y t >t , conditioned on a set of observations which we denote x t . In order to form the posterior distribution we model the conditional density p(y t |x t ) as an adversarial network. To achieve this we use a Mixture Density Network (MDN) model similar to the one presented in [1] for the generator G. The inputs to the generator network are x t and z g , where z g is a collection of samples from a normal distribution, p(z n ) g = N (0, var data ). The outputs of G t (x t , z g ) are the parameters of the Gaussian mixture model, with mixing, s.d. and mean for the i-th component α i , σ i , and µ i . As first proposed in [1], we achieve this by using latent variables s = {s α , s σ , s µ }, conditioned on the inputs. The mapping from [x t , z g ] → s → {α i , σ i , µ i } is modelled via our network. As the mixings must satisfy i α i = 1, we map s α to α via the softmax function. The elements of σ are strictly positive so we adopt, σ i = exp(s σ,i ). Finally the means can be mapped directly from the latent variables, hence µ i = s µ,i .
The above formalism allows us to directly model the predictive likelihood conditioned on an input, and the likelihood of G, conditioned on the observations x t and samples z g as: where m is the number of mixture components.
As in the CGAN model, the discriminator, D, is also conditioned on x t . The input to the discriminator model is, by where σ a is the s.d. of the true y t in the MDN model and the output is x t . For true values of y t , √ 2πσ a L(y t ) is maximized. The generator tries to 'fool' the discriminator by generating G t such that the √ 2πσ a L(G t ) is maximized. The loss function for the generator, L G in Equation 1, reflects this concept. The discriminator network, on the other hand, tries to differentiate between true y t values and the pseudo-values created by the generator. The loss function for the discriminator, L D in Equation 2, reflects this, where the lowest value is achieved when √ 2πσ a L(y t ) is maximal (unity) and L(G t (x t , z g )) is minimal (zero).
LG] 8 Apr 2020 The algorithm follows the steps in Algorithm 1. for j steps do do 3:

Algorithm 1 MD-CGAN Algorithm
Update the discriminator by descending its stochastic gradient: Update the generator by descending its stochastic gradient: Figure 1 illustrates the structure and the interaction between the generator and the discriminator for the MD-CGAN model.

A. Comparison with other Learning Models
We compare the MD-CGAN model to the Mixture Density Network model (MDN) [1], the CGAN model [3] and a standard neural network (SNN).
We perform experiments on four datasets, the Mackey-Glass chaotic dataset, sunspot dataset ( [11]), US initial jobless claims (USIJC, weekly intervals, [12]), and the EURUSD foreign exchange daily rates (EURUSD FX rate). For consistency, the generator in the MD-CGAN, has the same structure as the MDN, the generator of the CGAN and the SNN for our experiments. For each dataset, the series is split into training and out-of-sample test sets. The training data sets in all our experiments comprise 2000 samples (save for the sunspot dataset which is 1000 data points) and test sets consist of 400 data points post the training set. All algorithms have as input the last k data points, set to k = 5 for the purpose of our experiments; this value is not optimized, however, and is chosen to allow simple comparisons across methods. All data sets are normalized, to the [0,1] interval, again to allow for simpler comparison across data and methods. We further note that both CGAN and SNN make point estimate predictions, whilst MD-CGAN and MDN estimate posterior distributions.
To enable a simple comparison, we therefore report the meansquare error (MSE) for all methods. The number of mixture components, m, is set to unity (we vary this in Section III-D), to further enable comparison. The most-likely value (which for m = 1 is the mean) of the predictive distribution is taken as the forecast value for both the MDN and MD-CGAN models.

B. Data & one-step forecasts
Mackey-Glass and Sunspot time series: We start our experiments looking at one-step ahead forecasts on two well known data sets, the Mackey-Glass chaotic time series [10] and the Sunspot data set [11]. One-step forecast error comparisons are indicated in Table I. We note that both GAN models (CGAN and MD-CGAN) our outperformed, in terms of MSE, for these datasets. We attribute this to the high signal to noise associated with these data sets and thus the low levels of observed noise in the test data. In these circumstances, the 'adversarial advantage' [3] of GAN approaches, which offers robustness to input perturbations (in this case noise), is less important.
In the next experiments we thus add (30% by amplitude) normally distributed noise to the test data (from a GAN perspective, these input perturbations are, in effect, treated as adversarial attacks). We note that no noise is added to the training dataset. MSE errors are presented in Table I, and referenced as "Mackey-Glass with Noise" and "Sunspot with Noise" respectively. We note that, under this noisy-observation paradigm, GAN models perform best, confirming adversarial robustness of the GAN models.
US initial jobless claims (USIJC) and Euro-Dollar foreign exchange (EURUSD FX) daily rate: Financial times series are highly stochastic, and we expect GAN approaches to be well-suited to forecasting in these circumstances. We consider two financial time series, namely the USIJC and EURUSD FX datasets. Test set forecasts are shown in Figure 2 and the associated MSEs in Table I. We note that, for these data sets, the GAN approaches perform well and our method, MD-CGAN has lowest errors.

C. Financial forecasts over longer-horizons
One-step forecasts were presented in Subsection III-A. Here we extend analysis of the financial data over longer horizons. All models were used to make estimates over a horizon of ten weeks for an extended set of financial time series; the USIJC, EURUSD FX rate (as previous), WTI crude oil spot prices (WTI, [13]), Henry Hub Natural Gas spot prices (Nat Gas, [13]), and the CBOE Volatility Index (VIX [14]). A ten-week horizon represents a 50 step forecast for the daily datasets (FX, WTI, Nat Gas & VIX) and 10 steps for the weekly USIJC dataset. Further we perform comparisons against standard econometric linear models, namely a 5-th order autoregressive, AR(5), model and the martingale, or AR(0) model, in which the forecast is the last observed datum. Taking the martingale model as a baseline, we present in Table II the mean-square errors as the ratio to the martingale model error. We note that the MD-CGAN approach delivers ratios below unity and provides the lowest error of all models in this scenario.

D. Multi-modal posterior predictions
Finally, we compare the performance of MD-CGAN over varying numbers of mixture components. In all the previous experiments we set m = 1 (hence the model produced a single predictive Gaussian posterior). Here we briefly present the  results for the five finance datasets with m ∈ {1, 2, 3}. We report (negative) log-likelihood measures (as we do not compare against point-value models in this section) and consider onestep forecasts on all the data sets. Table III presents the performance across data sets for varying numbers of mixture components in the posterior prediction. We note performance improvement for some datasets for m > 1. We note that we do not attempt to infer m, though choosing its value based on performance on a set of cross-validation data would be an option, as would enforcing regularization over the mixture model posterior, through more extensive use of Bayesian inference. We leave these extensions for future research.

IV. CONCLUSION
In this paper we presented the MD-CGAN model an extension of the CGAN [3] method. In the experiments considered, we find the MD-CGAN outperforms all other models on the noisy Mackey-Glass and Sunspot datasets as well as all financial time series, for both short and long term forecast horizons. As a GAN model, our approach retains adversarial robustness, most notable when noise is extensively present in data, making the approach particularly well suited to dealing with financial data. Furthermore, our MD-CGAN model can effectively estimate a flexible posterior distribution, in contrast to standard GAN models. Exploiting the rich, multimodel, posterior distribution is not reported in detail here but  will feature in follow-up work. In summary, the MD-CGAN model combines the advantageous features of both probabilistic forecasting and GAN methods. We see this as a particularly useful approach for dealing with time series in which noise is significant (such as in financial data) and for providing robust, long-term forecasts beyond simple point estimates.