Scenario generation for market risk models using generative neural networks

In this research, we show how to expand existing approaches of using generative adversarial networks (GANs) as economic scenario generators (ESG) to a whole internal market risk model - with enough risk factors to model the full band-width of investments for an insurance company and for a one year time horizon as required in Solvency 2. We demonstrate that the results of a GAN-based internal model are similar to regulatory approved internal models in Europe. Therefore, GAN-based models can be seen as a data-driven alternative way of market risk modeling.


Introduction
Generating realistic scenarios of how the financial markets might behave in the future is one key component of internal market risk models used by insurance companies for Solvency 2 purposes. Currently, these are built using economic scenario generators (ESGs) based mainly on financial mathematical models, see Bennemann [2011] and Pfeifer and Ragulina [2018]. These ESGs require strong assumptions on the behavior of the risk factors and their dependencies, are time-consuming to calibrate and it is difficult in this framework to model complex dependencies.
An alternative method for scenario generation can be a special type of neural networks called generative adversarial networks (GANs), invented by Goodfellow et al. [2014]. This network architecture consists of two neural networks which has gained a lot of attention due to its ability to generate realistic looking images, see Aggarwal et al. [2021].
As financial data, at least for liquid instruments, is consistently available, GANs are used in various areas of finance, including market prediction, tuning of trading models, portfolio management and optimization, synthetic data generation and diverse types of fraud detection, see Eckerli and Osterrieder [2021]. Henry-Labordere [2019], Lezmi et al. [2020], Fu et al. [2019], Wiese et al. [2019], Ni et al. [2020] and  have already used GANs for scenario generation in the financial sector. The focus of their research was the generation of financial time series for a limited number of risk factors (up to 6) or a single asset class. To the best of our knowledge, there is no research performing a full value-at-risk calculation for an insurance portfolio based on GAN generated scenarios.
In this work, we perform a market risk calculation for typical insurance portfolios using a GAN instead of a classical ESG. We base our research on publicly available financial data from Bloomberg. The research of Ngwenduna and Mbuvha [2021], Cote et al. [2020] and Kuo [2019] also uses a GAN in an actuarial context, but they use it to generate publicly available data from restricted data or to create new data for solving the issue of having imbalanced data sets. However, the methods introduced in those papers could be used when dealing with illiquid instruments where no consistent tabular data is available.
In this research we • expand the scenario generation by a GAN to a complete market risk calculation serving for Solvency 2 purposes in insurance companies and • compare the results of a GAN-based ESG to the ESG approaches implemented in regulatory approved market risk models in Europe.
As a novelty, this research shows that there is an alternative way of market risk modeling beyond traditional ESGs which can also serve regulatory approved models as they perform well in the EIOPA (European Insurance and Occupational Pensions Authority) benchmarking study. Therefore, the proof of concept of whether a GAN can serve as an ESG for market risk modeling is successful. The paper is structured as follows: In Section 2, we provide some background both on market risk calculation under Solvency 2 and on GANs. The MCRCS (market and credit risk comparison study), a benchmarking exercise for approved market risk models in Europe conducted annually by EIOPA, is also introduced in Section 2. Section 3 explains how GANs can be used as ESGs, how they can be included in an internal model process and how GANs can be implemented. Comparison of the results of a GAN-based internal model with the results of the models in the MCRCS study is presented in Section 4. A discussion of the stability of these results is also included in this section. Section 5 concludes and provides on overview of the differences of this GAN-based approach with traditional methods.

Background
Before we present our work, we give a short introduction to the two main topics involved: economic scenario generators (ESG) and their usage for market risk calculation under Solvency 2 and generative adversarial networks (GANs).

Market risk calculation under Solvency 2
In 2016, a new regulation for insurance companies in Europe was introduced: Solvency 2. One central requirement is the calculation of the solvency capital requirement, called SCR. The amount of SCR depends on the risks to which the insurance company is exposed, see e.g. Gründl et al. [2019, Chapter 4]. The eligible capital of an insurance company is then compared with the SCR to determine whether the eligible capital is sufficient to cover all the risks taken by the insurance company.
The solvency capital requirement equals the Value-at-Risk (VaR) at a 99.5%-level for a time horizon of one year, see Bennemann [2011, Chapter 2.3]. A mathematical definition of the VaR and a derivation of its usage in this context can be found in Denuit et al. [2006, p. 69].
The risk of an insurer can be divided into six different modules: market risk, health underwriting risk, counterparty default risk, life underwriting risk, non-life underwriting risk and operational risk. The modules themselves consist of sub-modules, see EIOPA [2014]. Market risk, e.g. consists of the six submodules interest rate, equity, property, spread, currency and concentration risk.
The SCR can be calculated using either the standard model or an internal model. For the standard model, the regulatory framework sets specific rules for the calculation for each risk encountered by the insurance company, defined in European Commission [2015]. Each internal model has to cover the same types of risks as the standard model and must be approved by local supervisors to ensure accordance with the principles of Solvency 2.
In this work, we will focus on the calculation of the market risk of a non-life insurer. However, the methods presented here can be applied for other risks, too. The reason for selecting market risk here is threefold: • the underlying data in the financial market is publicly available and equal for all insurers, • market risk forms a major part of the SCR of an insurance company (EIOPA [2021b, p. 22] states that market risk accounts for 53% of the net solvency capital requirement before diversification benefits; this varies between life (59%) and non-life (43%) insurers) and • a comprehensive benchmark exercise, called "market and credit risk comparison study" MCRCS conducted by EIOPA is available for comparison of the results.
Current internal models for market risk often use Monte-Carlo simulation techniques to derive the risk of the (sub)modules and then use correlations or copulas for aggregation, see Bennemann [2011, p. 189] and Pfeifer and Ragulina [2018]. The basis of the Monte-Carlo simulation is a scenario generation performed by an "economic scenario generator" ESG. A definition of an ESG can be found in Pedersen et al. [2016, p. 7]: Definition 1. An economic scenario generator (ESG) is a computer-based model of an economic environment that is used to produce simulations of the joint behavior of financial market values and economic variables.
The ESG implements financial-mathematical models for all relevant risk factors (e.g. interest rate, equity) and their dependencies. Under those scenarios, the investment and liabilities portfolio of the insurer is evaluated and the risk is given by the 0.5%-percentile of the loss in these scenarios.

Introduction to the MCRCS study
Since 2017, EIOPA performs an annual study, called the market and credit risk comparison study, abbr. MCRCS. According to the instructions from EIOPA MCRCS Project Group [2020b], the "primary objective of the MCRCS is to compare market and credit risk model outputs for a set of realistic asset portfolios". In the study, all insurance undertakings with significant exposure in EUR and with an approved internal model are asked to participate, see EIOPA [2021a]. In the study as of year-end 2019, 21 insurance companies from 8 different countries of the European Union participated.
All participants have to model the risk of 104 different synthetic instruments. Those comprise all relevant asset classes, i.e. risk-free interest rates, sovereign bonds, corporate bonds, equity indices, property, foreign exchange and some derivatives. A detailed overview of the synthetic instruments that are used in this study can be found in EIOPA MCRCS Project Group [2020a].
Additionally, those instruments are grouped into ten different asset-only benchmark portfolios, two liability-only benchmark portfolios and ten combined portfolios. These portfolios "should reflect typical asset risk profiles of European insurance undertakings", see EIOPA MCRCS Project Group [2020b, Section 2]. This analysis sheds light into the interaction and dependencies between the risk factors.
The combined portfolios are linear combinations of the asset and liability benchmark portfolios, e.g. BMP1+L1 combines the asset-only benchmark portfolio BMP1 with liability portfolio L1. Figure 2.1 presents the asset-type composition of the asset-only benchmark portfolios. All asset portfolios mainly consist of fixed income securities (86% to 94%) as this forms the main investment focus of insurance companies. However, there are significant differences both in ratings, durations and also in the weighting between sovereign and corporate exposures.
The two liability profiles are assumed to be zero-bond based, so they represent the liabilities of non-life insurance companies and differ in their durations (13.1 years vs. 4.6 years).
Annually, EIOPA publishes a detailed article of the MCRCS exercise. It provides an anonymized comparison of the risk charges of the different insurance companies' market risk models by portfolios, instruments and some additional analysis, e.g. dependencies of the risk factors. The study for year-end 2019 can be found on the EIOPA homepage, see EIOPA [2021a]. We will use the results of this study for comparison in Section 4.

Generative adversarial networks
Generative adversarial networks, called GANs, are an architecture consisting of two neural networks which are interacting with each other. In 2014, GANs were introduced by Goodfellow et al. [2014] and have gained a lot of attention afterwards because of their promising results especially in image generation. A good introduction to GANs can be found in Goodfellow et al. [2014], Goodfellow [2016] and Chollet [2018]. According to Motwani and Parmar [2020] and Li et al. [2020], GANs are one of the dominant methods for the generation of realistic and diverse examples in the domains of computer vision, image generation, image style transfer, text-to-image-translations, time-series synthesis, natural language processing, etc.
Other popular methods for the generation of data based on empirical observations are variational autoencoders and fully visible belief networks, see Goodfellow [2016, p. 14].
Technically, a GAN consists of two neural networks, named generator and discriminator. The discriminator network is trained to distinguish real data points from "fake" data points and assigns every given data point a probability of this data point being real. The input to the generator network is random noise stemming from a so called latent space. The generator is trained to produce data points that look like real data points and would be classified by the discriminator as being real with a high probability. Formally, we can define GANs as follows, see Goodfellow et al. [2014, Chapter 3] and Wiese et al. [2020, Chapter 4]. For this purpose, let (Ω, F, P) be a probability space and N X , N Z ∈ N. Furthermore, assume that X and Z are R N X − and R N Z −valued random variables, respectively. The random variable Z represents the latent random noise variable and X the targeted random variable. (R N Z , B(R NZ )) is called the latent space. Usually, N Z > N X is chosen, see Goodfellow [2016, p. 18]. The goal of the GAN is to train a generator network, such that the generator mapping of the random variable Z to R N X has the same distribution as the target variable X. Let's first define the generator formally: The counterpart of the generator in this game is the discriminator which assigns to each generated or real data point x ∈ R N X a probability of being a realization of the target distribution.
Definition 3. A neural network with D θ D : R N X → [0, 1] with parameter space Θ D and θ D ∈ Θ D is called a discriminator.
Given these two neural networks, we can now define a GAN as in Goodfellow [2016, Chapter 3]: Definition 4. A GAN (generative adversarial network) is a network consisting of a discriminator D θ D and a generator G θ G . The parameters θ D ∈ Θ D and θ G ∈ Θ G of both networks are trained to optimize the value function with X the targeted random variable and Z the latent random noise variable.
Z can be sampled from the latent space with a any chosen distribution, however, one usually uses normal distribution for Z, see Chollet [2018, Chapter 8.5.2]. The value function V has been defined in Goodfellow [2016]. Where does it come from? The discriminator D θ D has to distinguish between real and fake samples, i.e. solve a binary classification problem, see Goodfellow [2016, Chapter 3.2]. The discrete version of the value function V corresponds to the binary cross entropy loss for the discriminator which is usually used for binary classification issues solved by neural networks. A definition of the binary cross entropy loss can be found e.g. in Ho and Wookey [2019, Section 3].
The optimization (the GAN objective) is given by That means that the discriminator is optimized to distinguish real samples from generated samples whereas the generator tries to 'fool' the discriminator by generating such good samples that the discriminator is not able to distinguish them from real ones. A detailed derivation of the optimization of the objective and the modifications that can be used in GAN training are found in Goodfellow [2016, p. 22] and Wiese et al. [2020, p. 9].
In the inner loop, V (G θ G , D θ D ) takes its maximum value if the discriminator correctly assigns a value of 1 for all "real" data points and a value of 0 to all generated data points. The parameters of the discriminator network θ D are optimized to fulfill this task. In the outer loop, the generator tries to fool the discriminator and its parameters θ G are optimized to maximize D θ D (G θ G (Z)) meaning that the discriminator shall assign a high probability of being real to the "fake" data points. To achieve optimization of the value function, in practice, the training alternates between k steps of optimizing D θ D and one step of optimizing G θ G . k is one of the hyperparameters of the GAN. The starting point of the parameters of the neural networks D θ D and G θ G is given by random initialization.
An algorithm for the GAN training can be found in Goodfellow et al. [2014, Chapter 4, Algorithm 1]. In every iteration of the training process the neural networks are trained in turns, while the parameters of the other network are fixed. We provide here a short version of this algorithm: Algorithmus 1 Algorithm for GAN training with SGD (stochastic gradient descent) as optimizer, see Goodfellow et al. [2014, Chapter 4, Algorithm 1] The discriminator is trained k times more often than the generator, the dimension of the latent space Z is N Z , M ∈ N is the batch size. All are hyperparameters of the GAN. The learning rates of the SGD algorithm are γ D and γ G .
• Initialize parameters w for discriminator and θ for generator.
• For each optimization step of the SGD: for k steps do * Randomly draw sample batch {x 1 , ..., x M } of size M from data generating distribution * Randomly sample {z 1 , ..., z M } independent realizations of random variable Z * Update the parameters w of the discriminator (with fixed parameters θ of the generator): ] .
-Randomly sample {z 1 , ..., z M } independent realizations of random variable Z -Update the parameters θ of the generator (with fixed parameters w of the discriminator) Despite their success, GANs remain difficult to train as stated e.g. in Motwani and Parmar [2020]. The largest issue with GANs according to Goodfellow [2016, p. 34] is the non-convergence often observed in practice. This is due a GAN not being a normal optimization task but a dynamic system seeking for an equilibrium between two forces, see Chollet [2018, Chapter 8.5.2] and Salimans et al. [2016, p. 2]. Each time any parameter of one component, either discriminator or the generator, is modified, it results in the instability of this dynamic system, see Motwani and Parmar [2020, p. 2]. Mazumdar et al. [2020, p. 25] provide a mathematical analysis of this non-convergence issue and state that "first, the equilibrium [of the GAN] that is sought is generally a saddle point and second, the dynamics of GANs are complex enough to admit limit cycles". Therefore, the model architecture and the hyperparameter have to be chosen carefully. Unfortunately, at the moment, there is no way to tell which hyperparameters and which architecture will perform best in the training, see Motwani and Parmar [2020, p. 2]. Therefore, some kind of validation has to be performed to control convergence of the GAN. One possible validation measure is the Wasserstein distance, see Section 3.4.
The evolution of the quality of the output during training can be visualized in case of a twodimensional data distribution in Figure 2.3. The red data here represents the empirical data to be learned whereas the blue data is generated at the current stage of training of the generator. For the illustration, we here use the same amount of red and blue dots in each figure. One can clearly see that the generated data matches the empirical data more closely when more training iterations of the GAN have taken place. Figure 2.3: Scatterplots of results of a GAN for a two-dimensional data distribution at different training iterations, red = empirical data, blue = generated data 3 Methodology and data

Workflow of a GAN-based internal model
The strength of GANs is especially what ESGs should be good at -producing samples of an unknown distribution based on empirical examples of that distribution. Therefore, we will apply a GAN as an ESG. This is a different task from re-sampling, as explained e.g. in Yu [2002, p. 3-7], where the task is to use subsets or bootstrapping from the empirical data to derive data sets with the same statistical properties as the empirical data. As we only have about 20 years of financial market history and need to evaluate an 0.5-percentile, i.e. an one in 200 years event, we need to produce financial data that has not yet occurred in the financial market but could have. Therefore, re-sampling is not an option. The dependency modelling in a GAN-based ESG is similar to using an empirical copula for the risk factors in a classical ESG based on Monte-Carlo technique. Figure 3.1 shows that a GAN in contrast to resampling really generates new scenarios. In this figure, we show scatterplots of four different risk factor pairs: 5-year interest rates vs. 10-year interest rates, Eurostoxx50 vs. German government bond spreads, Italian vs. German government bond spreads and AAA corporate credit spreads vs. BBB corporate credit spreads. Details to the data can be found in Section 3.2 and 3.3.
The orange dots in the figure represent the 50,000 scenarios generated by the GAN for these risk factors. The blue dots are the 4330 empirical data points used for GAN training. The structure of the orange dots mimics the structure of the blue dots, but also generates new results that are not found within the blue dots. In resampling, the data would not go beyond the boundaries of the blue dots. For all 50,000 generated 46-dimensional scenario data points, we calculated the euclidian distance to the nearest empirical data point in the 46-dimensional space. This distance is always above 0 and varies between 0.3 and 5.7. Figure 3.2 shows the histogramm of those 50,000 minimum distances. In re-sampling, the distance between generated and empiric data points always equals 0. This illustrates that the GAN really generates new scenarios. , a GAN can be used to create new and distinct scenarios that capture the intrinsic features of the historical data. Fu et al. [2019, p. 3] already noted in their paper that a GAN, as a non-parametric method, can be applied to learn the correlation and volatility structures of the historical time series data and produce unlimited real-like samples that have the same characteristics as the empirical observed time-series. Fu et al. [2019, Chapter 5] tested this with two stocks and calculated a 1-day VaR.
In our work here, we demonstrate how to expand this to a whole internal model -with enough risk factors to model the full band-width of investments for an insurance company and for a one year time horizon as required in Solvency 2. Diagram 3.3 illustrates schematically the workflow of a GAN-based internal model. It follows the process of a classical internal model, see Bennemann [2011, p. 177 cont.] and Gründl et al. [2019, p. 82], except that the ESG step is replaced by a GAN instead of a Monte-Carlo simulation. Details to the steps are provided in the next Sections and are similar to the steps taken in Wiese et al. [2020, Chapter 6].

Data selection
For the purpose of this article, we choose to model the risk charge for the ten MCRCS asset-only benchmark portfolios, the two liability-only portfolios and for the ten combined asset-liability portfolios. We are especially interested in the ability of the GAN to generate the joint movement of the risk factors, and not the distribution of single instruments. Therefore, we want to simulate only instruments and risk factors that are needed in the benchmark portfolios.
Out of the 104 instruments in the MCRCS study, only 71 are included in the benchmark portfolios; the 33 single instruments are left out in our model. To model those instruments, we have to select financial time series for the relevant risk factors (i.e. equity index, interest rate buckets) as most of the instruments itself are not traded and therefore have no time series. We found 46 risk factors to be sufficient to evaluate those instruments because several instruments depend on the same risk factor.
All relevant financial data will be derived from Bloomberg. An aggregated view of these risk factors together with the Bloomberg sources ("ticker") can be found in Appendix A. A mapping table indicating for each of the 71 instruments which risk factors are used for calculation can be found in Appendix B. In the same appendix, comments on any approximations used are found, too.
EUR swap rates and EUR corporate yields are available on a daily basis in Bloomberg since 25.02.2002 resp. 28.03.2002. Since the study is conducted at year-end 2019, we take the datapool from end of March 2002 until Dec. 2019 as the basis of our GAN training. Therefore, We use 4588 daily observations which covers almost 18 years of 46 time series to train the GAN.

Data preparation
For solvency 2, we need to model the market risk with a time horizon of one year, but we have daily observations. According to Yoon et al. [2019, p. 1] this temporal setting poses a particular challenge to generative modeling. One solution is to use the daily data to train the GAN model and then use some autocorrelation function to generate an annual time series based on daily returns, see Fu et al. [2019] and Deutsch [2004, Chapter 34]. However, an autoregressive model can only be used if strong assumptions about the underlying processes are made.
Another solution is to use overlapping rolling windows of annual returns on a daily basis to train the model. We decide to calculate the returns in rolling annual time windows for all available days. This method is e.g. used in Wiese et al. [2020, p. 16] and also in EIOPA [2021a, p. 17]. Deutsch [2004] describes the procedure in Section 32.3. He explains the drawback that the generated annual returns have a high autocorrelation as they only differ in one daily return. However, we will make the assumption that each of these one-year returns is just one possible scenario how the risk factors can shift within one year even if this implies the input data is not independent anymore. This however is similar to classical ESGs where for calibration rolling windows are often used as there is not enough data available to use disjoint annual return data.
The second decision to be made is how to calculate the returns. Returns can be calculated either as a simple difference of the two time points in question, one can calculate relative returns, log-returns are often used or one can use other transformations. EIOPA [2021a] in its analyses uses simple differences for interest rates and credit spreads and relative returns for equities, real estate and foreign exchange. The reasoning behind this is that in the low interest rate environment, relative returns do not make sense in many cases. Therefore, we will stick with this scheme.
To calculate rolling one-year returns, we make the assumption that one year has 258 trading days. This is obtained by dividing the number of daily data we have for the risk factors (T = 4588 observations) by the number of years from which the data originate (17.8 years).
That means for each risk factor with a value of s t at time t ∈ 1, ..., T − 258 with t being the number of days in the data set, the rolling annual-return is calculated as For interest rates and credit spreads, however, we calculate absolute rolling returns as Therefore, the training data for the GAN comprises T − 258 = 4330 observations of annual returns for each of the 46 risk factors. A neural network does not work properly when the input data is at different scales, see Chollet [2018, Chapter 3.6.2]. Therefore, we will normalize the return data r t by risk factor (dividing by standard deviation, adjusting by the mean). As with GANs, there is no split of the data into a training/test/validation data sets, we can use all the data for training, see Goodfellow [2016, Chapter 3].

Implementation of a GAN-based ESG
We implement the GAN in the programming language Python which has a lot of packages that are useful in data science contexts. For pre-and post-processing we use the packages pandas, numpy, scipy and matplotlib. Our GAN implementation itself is based on the package keras as described in Chollet [2018, Chapter 8.5].
It runs in a cloud with 144 virtual CPUs and 48 GiB RAM on a 3rd generation Intel Xeon processor. The training time for this GAN is about half an hour; the generation of the 50,000 scenarios as output takes less than one minute. We used the following configuration for the GAN-based ESG: • 4 layers for discriminator and generator • Initialization of generator and discriminator using multivariate normal distribution with mean = 0 and std = 0.02 • We use LeakyReLu as activation functions except for the output layers which use Sigmoid (for discriminator) and linear (for generator) activation functions. We use the Adam optimizer and the regulation technique batch normalization after each of the hidden layers in the network. The loss function is binary crossentropy.
LeakyRelu hereby is an alternative to the popular activation function Relu and is defined as In our implementation, we set α = 0.2. The adam optimizer is used here instead of the SGD, see Algorithmus 1. The adam optimizer uses the following algorithm to update the weights to w k+1 given w k in the generator and discriminator networks: with v 0 und m 0 being null vectors and ⊙ the Hadamard product. The parameters used in our implementation are Definitions and explanations for these functions and terms can be found in e.g. Viehmann [2019] and Chollet [2018].
According to Goodfellow [2016, Chapter 5.2] "it is not clear how to quantitatively evaluate generative models. Models that obtain good likelihood can generate bad samples, and models that generate good samples can have poor likelihood. There is no clearly justified way to quantitatively score samples". According to Theis et al. [2015, p. 1] "generative models need to be evaluated directly with respect to the application(s) they were intended for".
In literature, where scenario generation for financial or similar data is done by GANs or other generative models, there is no clear favorite measure being used for validation of the results. Many papers focus on visual inspection of histograms, scatterplots, etc. together with the comparison of some basic statistics (mean, standard deviation, skewness, kurtosis, percentiles, (auto)correlation). Examples for this evaluation method are the following papers: Franco-Pedroso et al. As for internal models, the marginal distribution of the risk factors are of great importance, we will use Wasserstein distance in this paper to evaluate whether the generated data matches the empirical data. The Wasserstein distance is commonly used to calculate the distance between two probability distribution functions, as mentioned in Borji [2019] and . The definition of the Wasserstein distance can be found e.g. in Hallin et al. [2021, p. 5]. We use here the univariate Wasserstein distance with p = 1 as defined in Hallin et al. [2021, eq. 1]: Definition 5. For univariate distributions P and Q with distribution functions F and G, the p-Wasserstein distance is given by the L p -distance One could alternatively use other metrics to measure the distance between two probability distribution functions, e.g. Kullback-Leibler divergence. We decided for the Wasserstein distance as it is easily interpretable, implementable and often used in the machine learning context. Figure 3.4 shows the development of the Wasserstein distances between the empirical distribution functions of the training and the generated data for all 46 risk factors in our GAN. We can see in this graph that the Wasserstein distance for all risk factors decreases over the training iterations in this configuration. To arrive at the GAN architecture mentioned above, we trained 25 different GANs varying in the number of layers and the number of neurons per layer. For each of the 25 configurations, we then calculated the maximum of the Wasserstein distances over all the risk factors. We then compared the minimum of this maximal Wasserstein distance over the training iterations between the 25 configurations and choose the configuration having the lowest value. Details and results of this procedure can be found in Appendix C.
We now use the trained generator with this configuration to generate 50,000 financial scenarios for all risk factors.

Valuation of financial instruments and portfolio aggregation
As mentioned before, we want to evaluate the benchmark portfolios of MCRCS study which comprise 71 instruments. So, first, we have to evaluate the instruments in each scenario applying the generated data for the 46 risk factors. For this task, we use the following formulas and methods.

Zero-coupon bond valuation
All interest-and spread-related instruments are zero-coupon bonds and can be valued with present value discounting for scenario n = 1, ..., 50000 as where ZC means the value of a zero-coupon bond with maturity τ with starting interest rate being r 0 τ and the shift in scenario n being △r n τ and the starting spread being s 0 τ and the shift in scenario n of the spread being △s n τ . τ hereby equals the maturity of the bond as found in Table 5 in Appendix B. The starting time for MCRCS study is year-end 2019. For details on the valuation of zero coupon bonds, we refer to Albrecht and Maurer [2016, Chapter 8.4].
For the default and migration process of corporate bonds, we use the rating migration matrix from Ratings [2018]. This is the S&P average European 1-year corporate transition rate for the years 1981-2018where we assume that high yield bonds are B-rated. Since the credit spread can be interpreted as the probability of a bond defaulting, we scale the downgrade probabilities in the scenarios according to the percentage the generated spread is above the starting spread. For sovereign bonds, we analogously assumed a default according to the rating of the country, scaled using the same methodology as for corporate bonds. The recovery rate is set at 45%.

Equity and property instrument valuation
For equities and property, the market value of the instruments in the scenarios is scaled with the percental shift of the respective risk factor to evaluate those instruments in scenario n.

Valuation of the liabilities
The method used for discounting of the liabilities in Solvency 2 is laid out in EIOPA [2019]. The Solvency 2 framework assumes that liquidity in the interest rate market can only be assumed up to a maturity of 20 years. In this period, a credit risk adjustment of 10 basispoints is deducted before discounting. Afterwards, the yield curve is extrapolated to a so-called ultimate forward rate, abbr. "UFR", which is the 1-year forward rate to be valid at a maturity of 60 years. The UFR is updated on an annual basis and at year-end 2019 was 3.9%. The extrapolation method is based on Smith and Wilson [2001] and its features are discussed e.g. in Viehmann [2019] and Lagerås and Lindholm [2016]. In our implementation, after generating the scenarios, we use this extrapolation method to derive the risk-free yield curve and discount the two liability benchmark portfolios accordingly.

Portfolio aggregation
After calculating the profit&loss of each of the relevant instruments in each scenario, the portfolio aggregation is straightforward using the weights given in the EIOPA MCRCS study, see EIOPA MCRCS Project Group [2020a]. The risk charge then is the 0.5%-percentile of the 50.000 scenarios for each portfolio.

Comparison of GAN results with the results of the MCRCS study
Now we can compare the results of our GAN-based model for both risk factors and benchmarking portfolios with the risk derived from approved internal models in Europe using the results of the MCRCS study. The study for year-end 2019 can be found on EIOPA's homepage, see EIOPA [2021a].
The results on risk factor basis are analyzed based on the shocks generated or implied by the ESGs in the study in Section 4.1. A shock hereby is defined in EIOPA [2021a, p. 11] as Definition 6. A shock is the absolute change of a risk factor over a one-year time horizon. Depending on the type of risk factor, the shocks can either be two-sided (e.g. interest rates 'up/down') or one-sided (e.g. credit spreads 'up'). This metric takes into account the undertakings' individual risk measure definitions and is based on the 0.5% and 99.5% quantiles for two-sided risk factors and the 99.5% quantile for one-sided risk factors, respectively.
The main comparison between the results for the the benchmark portfolios is based on the risk charge which is defined in EIOPA MCRCS Project Group [2020b, Section 2]. We analyze this in Section 4.2.
Definition 7. The risk charge is the ratio of the modeled Value at Risk (99.5%, one year horizon) and the provided market value of the portfolio.
The results are presented in diagrams, showing the 10%, 25%, 75% and 90%-percentile of all insurance companies participating in the study. There are 21 participants and only insurances having at least some exposure in this risk factor are shown. For our comparison here, we want to compare our results with the whole bandwidth of internal models in Europe. Therefore, we calculated an implied mean and standard deviation based on the 10%-and 90%-percentile under the assumption of a normal distribution of results. Based on this, we derived a theoretical 1%-and 99%-percentile to show as boxes for each maturity / sub-type of the risk factors resp. portfolios. The given 10%-and 90%-percentile are shown as frames inside the boxes.
We enrich those MCRCS results with a blue dot representing the shock resp. risk charge for that risk factor or portfolio generated by our GAN-based model.
After that, in Section 4.3, we use the results shown in a so-called excursus focusing on the market development in the Covid-19 crisis to compare the results of the approved models with our GANbased model, too. The dependency structures of the internal models are analyzed using joint quantile exceedance as a metric in EIOPA [2021a, Chapter 5.2.6]. The comparison to the GAN results can be found in Section 4.4. The Section concludes with an examination of the stability of the GAN output.

Comparison on risk-factor level
In this work, we will show the comparison of the five most important risk factor categories (corporate and sovereign credit spread, equity, interest rate up and down). Other risk factors show a similar behaviour. For corporate as well as sovereign credit spreads in Figure 4.1, we see a very good alignment between the GAN-based model and the approved internal models. The comparison of the Ireland sovereign spread has been excluded from the comparison as only five participants submitted results for this risk factor. The shock reported in EIOPA [2021a, p. 26] varies between 1.1% and 3.5% which seems inconsistent to the sharp increase of up to 8.3% within 12 months that Ireland experienced during financial crisis. Therefore, data quality here seems not to be sufficient for a comparison. On the equity side in Figure 4.2, the shifts are also similar for most of the risk factors. For the FTSE100, the GAN produces less severe shocks than most of the other models. This behaviour, however, can actually be found in the training data as the FTSE100 is less volatile than the other indices for the time frame used in GAN training. So, the GAN here produces plausible results.
For interest rates, however, the picture is a bit more complex as Figure 4.3 illustrates: The up-shifts generated by the GAN-based model are within the boxes for all buckets. However, for longer maturities, the shifts tend to be at the lower end of the boxes. This effect is due to the time span of the data used for the training of the GAN where interest rates are mostly decreasing.
For the down-shifts, we can observe the following behaviour: Short term interest rates are below the boxes, whereas the middle and longer term interest rates are inside the boxes. This can be explained by the interest rate development: The time span used for training of the GAN shows a sharp decrease of interest rates especially in the short term whereas longer term interest rates behaved more stable. This behaviour is mimicked by the GAN. In traditional ESGs, additionally to the longer time span used for calibration, often expert judgement by the insurers leads to a lower bound on how negative interest rates can become. One of the most common arguments for a lower bound of interest rates according to Grasselli and Lipton [2019, Chapter 4] is the fact that instead of investing money with negative interest rates, asset managers could also convert the money into cash and store this. However, the conversion of large amounts of money into cash poses a lot of issues and is therefore unrealistic. Grasselli and Lipton [2019] use this argument to derive a cash-related physical lower bound of about -0.5%. Danthine [2017, Chapter 2] states that the lower boundary for interest rates is not far below -0.75% in current environment. For illustration purposes, we introduced light blue dots in Figure 4.3 where we limited the downshift to -1.9% in the GAN (as this is the value of the 10%-percentile). This, however, doesn't change the results on the portfolio level as presented in Section 4.2 significantly (difference in VaR always below 0.1%). If an insurance company wishes to limit downside interest rate shifts, this would be a reasonable approach.

Comparison on portfolio level
First, we show the comparison of the risk charges for the ten asset-only benchmark portfolio of MCRCS study: The risk charge of the GAN-based model fits well to the risk charges of the established models and always lays within the gray boxes. The blue dot tends to be at the lower part of the boxes for the portfolios. This is due to the fact that increasing interest rates form a main risk. As we showed in Section 4.1, the shocks for the GAN-based model for increasing interest rates are at the lower part of the boxes for longer maturities, too. So this behaviour can be explained.
The two liability-only portfolios differ by duration: L1 has a duration of 13.1 years versus a 4.6 years duration of L2. For the liability-only portfolios, too, the risk charge of the GAN-based model fits well to the risk charges of the established models and always lays within the gray boxes. The blue dot tends to be at the upper part of the boxes for the portfolios. The risk charge of the liability-only portfolios is caused by scenarios with decreasing interest rates. In the risk factor comparison of the interest rate down shock in 4.1, the interest rate down shifts tend to be more severe for most maturities for the GAN-based model than for the average of the other models. Therefore, it seems plausible for the resulting portfolio risk to be at the upper part, too.

Comparison of the Covid-19 backtesting results
In a so called excursus, EIOPA [2021a, p. 17-21], the study examines whether the turmoil in the financial markets following the Covid-19 crisis in spring 2020 is part of the generated scenarios of the tested models. This could be seen as a kind of backtesting exercise as this event was not part of the calibration / training process for the ESGs. In the study, EIOPA calculates for each benchmark portfolio P the worst case one-year rolling return including the Covid-19 crisis by Let F P,M be the empirical distribution function under model M for the relative returns of portfolio P with respect to the 50, 000 scenarios relative to the current market value. By we denote the probability that a relative loss as least as severe as the Covid-19 crisis occurs to portfolio P under model M . EIOPA [2021a, p. 21] states that "the Covid-19 related market impacts can certainly be seen as significant. From a general perspective of internal market risk models, there is no evidence that this could be interpreted as an event beyond the scope of application of these models." This means that the worst-case losses in this time period considered for every benchmark portfolio should be within the loss distribution generated by the models and ideally α P,M is above 0.5%. The worst case for most portfolios occurs during Covid-19-related market turmoil in the first half of 2020, see EIOPA [2021a, p. 19].
Explicit comparison of the values α P,M for the models is only provided for the asset-only benchmark portfolio BMP1 and the asset-liability benchmark portfolio BMP1+L1. As in Sections 4.1 and 4.2 above, the same meaning of the boxes apply in the graph. The results match with the results of Section 4.2. One can state that the GAN-based model in this examination, too, behaves similar to the other models. For the other benchmark portfolios, we also calculate the implied percentiles of the WorstCase return in spring 2020 and the corresponding value α P,GAN for the GAN-based model α P,GAN . In the Table 1, for portfolio BMP1+L1 a value of α P,GAN = 3.5% means that in 3.5% of the scenarios the GAN-based model generated a return more severe than −58.2% with −58.2% being the WorstCase return of portfolio BMP1+L1 encountered during Covid-19 crisis.
As displayed in Table 1, for all portfolios α P,GAN for the GAN-based model is above 0.5% which is in line with the EIOPA expectations mentioned above.

Comparison of joint quantile exceedance results
The market-risk dependency structures in the models are examined in EIOPA [2021a, Chapter 5.2.6] on a risk factor basis. Results are only presented for the comparison of joint quantile exceedance which is defined as follows: Definition 8. The bivariate Joint Quantile Exceedance probability (JQE) is the joint probability that both risk factors will simultaneously surpass the same quantile.
For the comparison a percentile of 80% is used. This is a compromise to have enough data to examine, but also to focus on the tail of the distribution. For independent risk factors, the joint quantile exceedance therefore equals JQE = 20% · 20% = 4%. If the risk factors have a correlation of 1, the JQE equals 20%; for a correlation of -1, JQE is 0%. In the study, EIOPA [2021a, p. 34], the matrix of JQEs is presented as boxplots for all pairs of 7 selected risk factors. Please note that the joint quantile exceedance of the risk factor with itself is not shown.
We show here the results of the comparison for one credit spread, one equity and one interest rate risk factor. Other comparisons follow a similar pattern. For the pairwise JQE results, the GAN-based model always lays within the boxes of the internal models of the MCRCS study. Therefore, the dependency structure generated by the GAN-based ESG seems to resemble the dependency structures used in internal market risk models in Europe.

Stability of GAN results
One important question for an internal model is how stable the results of the GAN-based ESG are. Since the GAN is initialized with random parameters and the sampling for the batches in training is random as well as the generation of the random variables in latent space, we want to test whether the results in the previous sections are stable for different GAN runs.
We trained four different GANs with the same architecture (but with different random initialization) and used each of those four trained generators to generate50, 000 scenarios 5 times for each of the 46 risk factors, leading to 20 sets of 50,000 scenarios each. For these 20 sets, we now check how stable the resulting risk charge for the risk factors is, in particular whether the shift (up-and down) in each of the sets is within the gray boxes from the EIOPA MCRCS study used above. Figure 4.9 illustrates this for four different risk factors, namely 5-years interest rate up-and down shift, 5-years corporate credit spreads AA and 5-years German sovereign credit spreads. For interest rate down shifts, we display the absolute values in the graph. We here present the gray boxes from MCRCS study as above, but instead of one blue dot for the GAN result, we show here 20 coloured dots -each dot representing one of the 20 different runs. As the dots overlap each other, not all dots can be seen in this graph. For all the risk factors, even those that are not shown in the figure, we analyzed that all shifts are within the gray boxes for every of the 20 runs, except for the outliers in interest rates as commented on in Section 4.1.
For a quantitative stability check, we first calculate for each of the 46 risk factors in each of the 20 scenario sets, the 0.5-percentile and the 99.5-percentile (equalling the up-and down shift used in MCRCS study). Then, for each risk factor and each percentile, we compute the empirical first and third quartiles Q1 and Q3 over the 20 runs and report the coefficient of quartile variation by see Kokoska and Zwillinger [2000, eq. (2.30)]. Appendix A shows the whole table of the results for all risk factors and both percentiles. We can state that the coefficient of quartile variation for both percentiles differs between 0.5% and 10.0%. This indicates that the results are stable over different runs of the GAN.
Overall, the GAN-based model shows in every dimension in the study comparable results to the certified internal models in Europe. Therefore, the proof of concept of whether a GAN can serve as an ESG for market risk modeling is successful. Please note that from a regulatory perspective, it is desirable that all approved internal models for market risk lead to comparable results. Therefore, this research indicates that a GAN-based model can be seen as an appropriate alternative way of market risk modeling.

Conclusions and discussion of results
In this research, we have shown how a generative adversarial network (GAN) can serve as an economic scenario generator (ESG) for the calculation of market risk in insurance companies. We used data from Bloomberg to model financial instruments and to derive financial scenarios how they can behave over a one year time horizon. We applied this to the risk factors and benchmark portfolios of the EIOPA MCRCS study which reflect typical market risk profiles of European insurance undertakings. We have shown that the results of the GAN-based model are comparable to the currently used ESGs which are usually based on Monte-Carlo simulation using financial mathematical models. Hence our research indicates that this approach could also serve as a regulatory approved model as it performs well in the EIOPA benchmark study.
Compared to current approaches, a GAN-based ESG approach does not require assumptions about the development of the risk factors, e.g. on the drift of equities or on the negativity of interest rates. The only assumption in a GAN-based model is that "it is all in the data". This is similar to the assumption made in the calibration of a traditional ESG where the empirical returns in the past serve as calibration targets.
The dependencies in a GAN-based model are automatically retrieved from the empirical data. In modern risk models, dependencies are often modeled using copulas. A certain copula has to be selected for each dependency and must be fitted so that the desired dependencies are reached. In practice, the modeling of dependencies is very difficult.
Calibration of the financial models to match the empirical data is a task that has to be performed regularly by risk managers to keep the models up to date. This is a cumbersome process and there is no standard process for calibration, see DAV (Deutsche Aktuarsvereinigung e.V.) [2015, Chapter 2.1]. This task is not needed for GAN-based models which makes them easier to use. If new data are to be included, the GAN simply has to be fed with the new data. Once the configuration and hyperparameter optimization of the GAN has been set up, the training process is fairly straightforward.
One drawback of a GAN-based ESG is the fact that it relies purely on events that have happened in the past in the financial markets and cannot e.g. produce new dependencies that are not included in the data the model is trained with. Classical financial models aim to derive a theory based on developments in the past and can therefore probably produce scenarios a GAN cannot come up with. Moreover, the classical models are easier to interpret and explain to a Board or a regulator whereas a GAN can be considered a "black box".
However, it is probable that a GAN-based ESG adapts faster to a regime-switch in one of the model's risk factors: A classical ESG requires a new financial model to be developed and implemented, whereas a GAN has just to be trained with new data. To some extent, this behaviour can be obtained in classical ESGs, too, if new data is weighted more heavily compared to data longer ago.
In summary, a GAN-based internal market risk model is feasible and can be seen as either an alternative to classical internal models or as a benchmark.

Data Availability Statement
All the financial data used in this research is available via Bloomberg with the tickers provided in the appendix.   Please find here the explanations to the approximations and simplifications used due to data availability reasons:

Conflicts of interest
• As most participants in the study, we do not distinguish between different types of corporate bond spreads, i.e. financial and non-financial corporates are modeled with the same data. As written in EIOPA [2021a, p. 24], this is a simplification used by two thirds of the participants.
• As for the required supranational paper issued by ESM (European Stability Mechanism), there is no long time series to be found, we use the approximation of the German spreads instead.
• There is no reliable daily data source for AAA and high yield bonds in Bloomberg. For AAA rated bonds, as most participants in the study, we use the covered bond spreads which are also rated AAA instead see EIOPA [2021a, p. 24]. The most frequent data for high yield bonds that we found can be derived from the Meryll Lynch spread index which is a weekly index.
• For real estate, there is no direct transaction based data available on high frequency. The most frequent direct real estate data is available on a monthly basis. We will therefore use an index representing Real Estate Investment Trusts (REITs) and stocks from Real Estate Holding & Development Companies. As there is no index to be found that is geography specific for the real estate holdings in the study, we will use a diversified European index for all real estate instruments.
• As the liquidity of government bonds becomes thin with longer maturities, we use the 10-years spreads for the 10-and 20-years bonds in the study.
As the training of a neural network needs a lot of data, we use daily data wherever possible. For some risk factors, single data points are missing. We replace them with the preceding data point. If longer data periods are missing for spreads or interest rates, we replace this part with the time series of the same risk factor with a different maturity. For the Ireland government bond spreads where there is a four month time period where this is not possible, we use regression and interpolation techniques with Portugal spreads which were at a similar level at that time to fill in the gap.
These approximations in the risk factors will hold for the purpose of this paper. If some of those risk factors are very important to an insurance company which wants to adopt this concept, the data can be either been sourced from a different data provider or some other technique for data enrichment can be used, too.
High yield bond spreads are the only instruments where we have only weekly and not daily data. There we use the rolling 12-month absolute returns for all weekly data points and we interpolate between these points using a regression from the BBB-spreads.
For not risk-free bonds, usually yields instead of spreads are available. For this exercise, we transform them into spreads by subtracting the relevant interest rate.
C Optimization of GAN architecture using Wasserstein distance A full optimization is not possible as, see Motwani and Parmar [2020, Chapter 2], the "selection of the GAN model for a particular application is a combinatorial exploding problem with a number of possible choices and their orderings. It is computationally impossible for researchers to explore the entire space." So, in our work, we trained different GANs with • the number of layers for generator and discriminator varying between 2, 4, 6 and 8 • the number of neurons for generator and discriminator varying between 100, 200 and 400 During the experiments, we fixed the following choices: • Batch size is M = 200 • k = 10 training iterations for the generator in each discriminator training • Dimension of the latent space is 200, distribution of Z is multivariate normal with mean = 0 and std = 0.02 • Initialization of generator and discriminator using multivariate normal distribution with mean = 0 and std = 0.02 • We use LeakyReLu with α = 0.2 as activation functions except for the output layers which use Sigmoid (for discriminator) and linear (for generator) activation functions. Additionally, we apply the regulation technique batch normalization after each of the hidden layers in the network.
• We use the Adam optimizer with the parameters given in Section 3.4 in Equation 3.1.
To arrive at one evaluation figure per GAN configuration, we aggregate the 46 Wasserstein distances in each training iteration for each tested GAN configuration n by defining the following target function with W n i being the Wasserstein distance between the empirical distribution functions of the training and the generated data for risk factor i in GAN configuration n, as mentioned in Section 3.4. As in our case, we test 16 different configurations for the number of layers and 9 different configurations for the number of neurons, n varies between 1 and 25.
First, we run the GAN with 16 different configurations of layers for the discriminator D and the generator G with 200 neurons in each layer of the network. For each neural network we use between 2 and 8 layers in steps of 2 layers. The resulting tf n −values are the following (the number of layer in D are in the columns, those of G in the rows):  Table 6: tf n -values for varying number of layers in both networks The minimum of our target function tf n is reached for both neural networks having 4 layers (marked in bold). We cannot confirm in this experiment the thesis of Goodfellow [2016, p. 33] stating that the discriminator is usually deeper than the generator. What we can clearly see is that the number of layers plays an important role in the performance of the GAN.
We run the GAN with 9 different configurations of neurons per hidden layer for the discriminator D and the generator G. In this experiment, we always set the number of neurons per layer being equal inside the respective neural network. The resulting tf n −values are the following (the number of neurons per layer in D are in the columns, those of G in the rows):  The minimum of our target function tf n is reached for the discriminator having 400 and the generator having 200 neurons per layer (marked in bold). This is the configuration which we then used in our further research.