A generative adversarial network approach to calibration of local stochastic volatility models

We propose a fully data driven approach to calibrate local stochastic volatility (LSV) models, circumventing in particular the ad hoc interpolation of the volatility surface. To achieve this, we parametrize the leverage function by a family of feed forward neural networks and learn their parameters directly from the available market option prices. This should be seen in the context of neural SDEs and (causal) generative adversarial networks: we generate volatility surfaces by specific neural SDEs, whose quality is assessed by quantifying, in an adversarial manner, distances to market prices. The minimization of the calibration functional relies strongly on a variance reduction technique based on hedging and deep hedging, which is interesting in its own right: it allows to calculate model prices and model implied volatilities in an accurate way using only small sets of sample paths. For numerical illustration we implement a SABR-type LSV model and conduct a thorough statistical performance analyis on many samples of implied volatility smiles, showing the accuracy and stability of the method.


Introduction
Each day a crucial task is performed in financial institutions all over the world: the calibration of stochastic models to current market or historical data.So far the model choice was not only driven by the capacity of capturing well empirically observed market features, but also by the computational tractability of the calibration process.This is now undergoing a big change since machine learning technologies offer new perspectives on model calibration.
Calibration is the choice of one model from a pool of models, given current market and historical data, possibly with some information on their significance.Depending on the nature of data this is considered as an inverse problem or a problem of statistical inference.We consider here current market data, e.g.volatility surfaces, therefore we rather emphasize the inverse problem point of view.We however stress that it is the ultimate goal of calibration to include both data sources simultaneously.In this respect machine learning might help considerably.
We can distinguish three kinds of machine learning inspired approaches for calibration to current market prices: First, having solved the inverse problem already several times, one can learn from this experience (i.e.training data) the calibration map from market data to model parameters directly.Let us here mention one of the pioneering papers by A. Hernandez [26] that applied neural networks to learn this calibration map in the context of interest rate models.This was taken up in [11] for calibrating more complex mixture models.Second, one can learn the map from model parameters to market prices and then invert this map possibly with machine learning technology.In the context of rough volatility modeling, see [17], such approaches turned out to be very successful: we refer here to [3] and the references therein.Third, the calibration problem is considered as search for a model which generates given market prices and technology from generative adversarial networks, first introduced by [19], is used.This means parametrizing the model pool in a way which is accessible for machine learning techniques and interpreting the inverse problem as an training task of a generative network, whose quality is assessed by an adversary.We pursue this approach in the present article.
1.1.Generative adversarial approaches in finance.At first sight it might seem a bit unexpected to embed calibration problems in the realm of generative adversarial networks which are rather applied in areas like photorealistic image generation: a generative adversarial network, here mainly in its causal interpretation, is a neural network (mostly of recurrent type) G θ depending on parameters θ, which transports a standard input law P I (e.g., in our case of LSV modeling the law of a Brownian motion W together with an exogenously given stochastic volatility process α) to a target output law P O .In the context of finanical modeling P O corresponds to the "true" law of the market, deduced from data (e.g., the empirical measure) and not necessarily fully specified.It can either correspond to the physical or to a risk neutral measure or even both, depending on the goal.
Denoting the push-forward of P I under the transport map G θ by G θ * P I , the goal is to find parameters θ such that G θ * P I ≈ P O .For this purpose appropriate distance functions have to be used.Standard examples include entropies, integral distances Wasserstein or Radon distances, etc.The adversarial aspect appears when the chosen distance is represented as supremum over certain classes of functions, which can themselves be parametrized via neural networks of a certain type.This leads to a game, often of zero-sum type, between the generator and the adversary.In our case the measure P O is not even fully specified, but only certain functionals, namely the given set of market prices, are provided.Therefore we shall measure the distance by the following neo-classical calibration functional inf where C is the class of option payoffs and w γ C weights, which are be parametrized by γ to account for the adversarial part.Similary one could also consider distances of Wasserstein type (in its dual representation) inf where γ parametrizes appropriate function classes, e.g.approximations of Lipschitz functions, neural networks or again payoffs of options.
In general we can consider distance functions d γ such that the game between generator and adversary appears as A distance function here just maps two laws on path space to a non-negative real number allowing for minima if P O is fixed: we do neither assume symmetry, nor a triangle inequality nor definiteness.
The advantage of this point of view is two-fold: (i) we have access to the unreasonable effectiveness of modeling by neural networks, i.e. the counter-intuitive highly over-parametrized ansatz to consider a local stochastic volatility model as recurrent neural network is beneficial through its miraculous generalization and regularization properties; (ii) the game theoretic view disentangles realistic price generation from discriminating certain options, e.g. by putting higher weights.Notice that (1.1) is not the usual form of GAN problems, since the adversary distance d γ is non-linear in P I and P O , but we believe that it is worth taking this abstract point of view.
There is no reason why these generative models, if sufficient computing power is available, should not take market price data as inputs, too.This would correspond, from the point of view of generative adversarial networks, to actually learn a map G θ,market prices , such that for any price configuration of market prices one has instantaneously a generative model given, which produces those prices.This requires just a rich data source of typical market prices (and computing power!).This would ultimately connect the first and third approach of calibration.
Even though it is usually not considered like that, one can also view the generative model as an engine producing a likelihood on path space: given historic data this would allow, with precisely the same technology, a maximum likelihood approach.In this case P O is just the empirical measure of the one observed trajectory, that is inserted in the likelihood function on path space.This then falls in the realm of generative approaches that appear in the literature under the name "market generators".Here the goal is to mimick precisely the behavior and features of historical market trajectories.This line of research has been recently pursued in e.g.[31,43,25,6,2].
From a bird's eye perspective this machine learning approach to calibraiton might just look like a standard inverse problem with another parametrized family of functions.We, however, insist on one important difference.Implicit or explicit regularization (see e.g., [24]), which always appear in machine learning applications and which are cumbersome to mimick in classical inverse problems, are one of the secrets of this success.
In general, machine learning approaches are becoming more and more prolific in mathematical finance.Concrete applications include hedging [5], portfolio selection [16], stochastic portfolio theory [37,12], optimal stopping [4], optimal transport and robust finance [15], stochastic games and control problems [28] as well as high dimensional non-linear partial differential equations (PDEs) [22,29].Machine learning also allows for new insights into structural properties of financial markets as investigated in [39].For an exhaustive overview on machine learning applications in mathematical finance, in particular for option pricing and hedging we refer to [36].
1.2.Local stochastic volatiliy models as neural SDEs.In the present article we focus on calibration of local stochastic volatility (LSV) models, which is still an intricate task, both from a theoretical as well as a practical point of view.LSV models, going back to [32,34], combine classical stochastic volatility with local volatility to achieve both, a good fit to time series data and in principle a perfect calibration to the implied volatility smiles and skews.In these models, the discounted price process (S t ) t≥0 of an asset satisfies where (α t ) t≥0 is some stochastic process taking values in R, and (a sufficiently regular function) L(t, s) the so-called leverage function depending on time and the current value of the asset and W a one-dimensional Brownian motion.Note that the stochastic volatility process α can be very general and could for instance be chosen as rough volatility model.By slight abuse of terminology we call α stochastic volatility even though it is strictly speaking not the volatility of the log price of S (we can, however, imagine L to be close to 1).
For notational simplicity we consider here the one-dimensional case, but the setup easily translates to a multivariate situation with several assets and a matrix valued analog of α as well as a matrix valued leverage function.
The leverage function L is the crucial part in this model.It allows in principle to perfectly calibrate the implied volatility surface seen on the market.In order to achieve this goal L has to satisfy , where σ Dup denotes Dupire's local volatility function (see [14]).For the derivation of (1.4), we refer to [21].Note that (1.4) is an implicit equation for L as it is needed for the computation of E[α 2 t |S t = s].This in turn means that the stochastic differential equation (SDE) for the price process (S t ) t≥0 is actually a McKean-Vlasov SDE, since the law of S t enters the characteristics of the equation.Existence and uniqueness results for this equation are not at all obvious, since the coefficients do not satisfy any kind of standard conditions like for instance Lipschitz continuity in the Wasserstein space.Existence of a short-time solution of the associated non-linear Fokker-Planck equation for the density of (S t ) t≥0 was shown in [1] under certain regularity assumptions on the initial distribution.As stated in [20] (see also [30], where existence of a simplified version of an LSV model is proved) a very challenging and still open problem is to derive the set of stochastic volatility parameters for which LSV models exist uniquely for a given market implied volatility surface.Despite these intriguing existence issues, LSV models have attracted -due to their appealing feature of a potentially perfect smile calibration and their econometric properties -a lot of attention from the calibration and implementation point of view.We refer to [20,21,10] for Monte Carlo methods, to [34,40] for PDE methods based on non-linear Fokker-Planck equations and to [38] for inverse problem techniques.Within these approaches the particle approximation method for the McKean-Vlasov SDE proposed in [20,21] works impressively well, as very few paths have to be used in order to achieve very accurate calibration results.
In the current paper we propose an alternative, fully data driven approach circumventing in particular the interpolation of the volatility surface, being necessary in several other approaches in order to compute Dupire's local volatility.To achieve this we learn or train the leverage function L in order to generate the available market option prices accurately.This approach allows in principle to take all kind of options into account.In other words we are not limited to plain vanilla European call options, in contrast to many existing methods.Setting T 0 = 0 and denoting by T 1 < T 2 • • • < T n the maturities of the available options, we parametrize the leverage function L(t, s) via a family of neural networks F i : R → R with weights θ i ∈ Θ i , i.e.
We here consider for simplicity only the univariate case.The multivariate situation just means that 1 is replaced by the identity matrix and the neural networks This is the way how we parametrize the transport map G θ introduced in Section 1.1.Indeed, this leads to so-called neural SDEs (see [18] for related work), which in the case of time-inhomogenous Itô SDEs, just means to parametrize the drift µ(•, • | θ) and volatility σ(•, • | θ) by neural networks with parameters θ, i.e., In our case, there is no drift and the volatility function (for the price) reads as The solution measure of the neural SDE corresponds to the transport G θ * P I where P I is the law of (W, α).
Progressively for each maturity, the weights of the neural networks are learned by optimizing the calibration criterion (1.1), where we allow for more general loss functions than just the square, i.e.We here write π mod j (θ) and π mkt j for the model and market option prices, which correspond exactly to the expected values under G θ P I and P O , respectively.Moreover, for every fixed γ, γ is a nonlinear nonnegative convex function with γ (0) = 0 and γ (x) > 0 for x = 0, measuring the distance between model and market prices.The parameters w γ j , for fixed γ, denote some weights, e.g., of vega type (compare [9]), which allow us to match implied volatility data rather then pure prices, our actual goal, very well.
Notice that, somehow quite typical for financial applications, we need to guarantee a very high accuracy of approximation, whence a variance reduction technique to approximate the model prices is crucial for this learning task.Notice also that due to the general structure of the adversary no diversification effects can be expected: we have to deal with nonlinear functions of expectations with respect to P I .
The precise algorithm is outlined in Section 3 and Section 4, where we also conduct a thorough statistical performance analysis.The implemention relies strongly on a variance reduction technique based on hedging and deep hedging: it allows to compute accurate model prices π mod (θ) for training purposes with only up to 5 • 10 4 trajectories.Let us remark that we do not aim to compete with existing algorithms, as e.g. the particle method by [20,21], in terms of speed but rather provide a generic data driven algorithm that is universally applicable for all kind of options, also in multivariate situations, without resorting to Dupire type volatilities.
The remainder of the article is organized as follows.Section 2 introduces the variance reduction technique based on hedge control variates, which is crucial in our optimization tasks.In Section 3 we explain our calibration method, in particular how to optimize (1.5).The details of the numerical implementation and the results of the statistical performance analysis are then given in Section 4 as well as Section 6.In Appendix A we state stability theorems for stochastic differential equations depending on parameters.This is applied to neural SDEs when calculating derivatives with respect to the parameters of the neural networks.In Appendix B we recall preliminaries on deep learning by giving a brief overview on universal approximation properties of artificial neural networks and briefly explaining stochastic gradient descent.Finally Appendix C contains alternative optimization approaches to (1.5).

Variance reduction for pricing and calibration via hedging and deep heding
This section is dedicated to introduce a generic variance reduction technique for Monte Carlo pricing and calibration by using hedging portfolios as control variates.This method will be crucial in our LSV calibration presented in Section 3.For similar considerations we refer to [41].
Consider on a finite time horizon T > 0, a financial market in discounted terms with r traded instruments (Z t ) t∈[0,T ] following an R r -valued stochastic process on some filtered probability space (Ω, (F t ) t∈[0,T ] , F, Q).Here, Q is a risk neutral measure and (F t ) t∈[0,T ] is supposed to be right continuous.In particular, we suppose that (Z t ) t∈[0,T ] is an r-dimensional square integrable martingale with càdlàg paths.
Let C be an F T -measurable random variable describing the payoff of some European option at maturity T > 0. Then the usual Monte Carlo estimator for the price of this option is given by where (C 1 , . . ., C N ) are i.i.d with the same distribution as C and N ∈ N.This estimator can easily be modified by adding a stochastic integral with respect to Z. Indeed, consider a strategy (h t ) t∈[0,T ] ∈ L 2 (Z) and some constant c.Denote the stochastic integral with respect to Z by I = (h • Z) T and consider the following estimator where (I 1 , . . ., I N ) are i.i.d with the same distribution as I.Then, for any (h t ) t∈[0,T ] ∈ L 2 (Z) and c, this estimator is still an unbiased estimator for the price of the option with payoff C since the expected value of the stochastic integral vanishes.If we denote by then the variance of π is given by

This becomes minimal by choosing c = Cov(π, H) Var (H) .
With this choice, we have In particular, in the case of a perfect (pathwise) hedge we have Corr(π, H) = 1 and the estimator π has 0 variance since in this case Therefore it is crucial to find a good approximate hedging portfolio such that Corr 2 (π, H) becomes large.This is subject of Section 2.1 and 2.2 below.
2.1.Black-Scholes Delta hedge.In many cases of local stochastic volatility models as of form (1.3) and options depending only on the terminal value of the price process, a Delta hedge of the Black-Scholes model works well.Indeed, let C = g(S T ) and let π g BS (t, T, s, σ) be the price at time t of this claim in the Black-Scholes model.Here, s stands for the price variable and σ for the volatility parameter in the Black Scholes model.Moreover, we indicate the dependency on the maturity T as well.Then choosing as hedging instrument only the price S itself and as approximate hedging strategy ) usually already yields a considerable variance reduction.In fact it is even sufficient to consider α t alone to achieve satisfying results, i.e., one has (2.4) ), This reduces the computational costs for the evaluation of the hedging strategies even further.

Hedging strategies as neural networks -deep hedging.
Alternatively, in particular when the number of hedging instruments becomes higher, one can learn the hedging strategy by parametrizing it via neural networks.For a brief overview on neural networks and relevant notation used below, we refer to Appendix B.
Let the payoff be again a function of the terminal values of the heging instruments, i.e., C = g(Z T ).Then in Markovian models it makes sense to specify the hedging strategy via a function which in turn will correspond to an artificial neural network (t, z) → h(t, z | δ) ∈ N N r+1,r with weights denoted by δ in some parameter space ∆ (see Notation1 B.4).Following the approach in [5, Remark 3], an optimal hedge for the claim C with given market price π mkt can be computed via inf for some convex loss function u : R → R + .If u(x) = x 2 , which is often used in practice, this then corresponds to a quadratic hedging criterion.
In order to tackle this optimization problem, we can apply stochastic gradient descent, because we fall in the realm of problem (B.1).Indeed, the stochastic objective function ).The optimal hedging strategy h(•, • | δ * ) for an optimizer δ * can then be used to define As always in this article we shall assume that activation functions σ of the neural network as well as the convex loss function u are smooth, hence we can calculate derivatives with respect to δ in a straight forward way.This is important to apply stochastic gradient descent, see Section B.2.We shall show that the gradient of Q(δ) is given by i.e. we are allowed to move the gradient inside the stochastic integral, and that approximations with simple processes, as we shall do in practice, converge to the correct quantities.To ensure this property, we shall apply the following theorem, which follows directly from results in Section A.
Theorem 2.1.For ε ≥ 0, let Z ε be a solution of a stochastic differential equation as described in Theorem A.3 with drivers Y = (Y 1 , . . ., Y d ), functionally Lipschitz operators F ε,i j , i = 1, . . ., r, j = 1, . . ., d and a process (J ε,1 , . . .J ε,r ), which is here for all ε ≥ 0 simply J1 {t=0} (t) for some constant vector J ∈ R r .Let (ε, t, z) → f ε (t, z) be a map, such that the bounded càglàd process Proof.Consider the extended system where we obtain existence, uniqueness and stability for the second equation by Theorem A.3, and wherefrom we obtain ucp convergence of the integrand of the first equation: since stochastic integration is continuous with respect to the ucp topology we obtain the result.
The following corollary implies the announced properties, namely that we can move the gradient inside the stochastic integral and that the derivatives of a discretized integral with a discretized version of Z and approximations of the hedging strategies are actually close to the derivatives of the limit object.
Corollary 2.2.Let, for ε > 0, Z ε denote a discretization of the process of heding instruments Z ≡ Z 0 such that the conditions of Theorem 2.1 are satisfied.Denote, for ε ≥ 0, the corresponding hegding strategies by (t, z, δ) → h ε (t, z | δ) given by neural networks N N r+1,r , whose activation functions are bounded and C 1 , with bounded derivatives.
Proof.To prove (i), we apply Theorem 2.1 with . Indeed, by the neural network assumptions, we have (with the sup over some compat set) Concerning (ii) we apply again Theorem 2.1, this time with

Calibration of LSV models
Consider an LSV model as of (1.3) defined on some filtered probability space (Ω, ( where Q is a risk neutral measure.We assume the stochastic process α to be fixed.This can for instance be achieved by first calibrating the pure stochastic volatility model with L ≡ 1 and by fixing the corresponding parameters.
Our main goal is to determine the leverage function L in perfect accordance with market data.We here consider only European call options, but our approach allows in principle to take all kind of other options into account.
Due to the universal approximation properties outlined in Appendix B (Theorem B.3) and in spirit of neural SDEs, we choose to parametrize L via neural networks.More precisely, set T 0 = 0 and let 0 < T 1 • • • < T n = T denote the maturities of the available European call options to which we aim to calibrate the LSV model.We then specify the leverage function L(t, s) via a family of neural networks, i.e., where F i ∈ N N 1,1 for i = 1, . . ., n (see Notation B.4).For notational simplicity we shall often omit the dependence on θ i ∈ Θ i .However, when needed we write for instance S t (θ), where θ then stands for all parameters θ i used up to time t.
For purposes of training, similarly as in Section 2.2, we shall need to calculate derivatives of the LSV process with respect to θ. Theorem 3.1.Let (t, s, θ) → L(t, s | θ) be of form (3.1) where the neural networks (s, θ i ) → F i (s | θ i ) are bounded and C 1 , with bounded and Lipschitz continuous derivatives 2 , for all i = 1, . . ., n.Then the directional derivative in direction θ at θ satisfies the following equation with initial value 0. This can be solved by variation of constants and leads to well-known backward propagation schemes.
Proof.First note that Theorem A.2 implies the existence and uniqueness of for every θ.Here, the driving process is one dimensional and given by Y = is functionally Lipschitz and Theorem A.2 implies the assertion.These conditions are implied by the form of L(t, s | θ) and the conditions on the neural networks F i .
To prove the form of the derivative process we apply Theorem A. 3 In the terminology of Theorem A.3, 3 is given by which converges ucp to whence ucp convergence of the first term in (3.3).The convergence of term two and three is clear.The last one follows again from that fact the family {s → ∇ θ L(t, s | θ + εθ) | ε ∈ [0, 1]} is equicontinuous, which is again a consequence of the form of the neural networks.
By the assumptions on the derivatives, F 0,3 is functionally Lipschitz.Hence Theorem A.2 yields the existence of a unique solution to (3.2) and Theorem A.3 implies convergence.Remark 3.2.
(i) For the pure existence and uniqueness of with L(t, s | θ) of form (3.1), it suffices that the neural networks s → F i (s | θ i ) are bounded and Lipschitz, for all i = 1, . . ., n (see also Remark A.4). (ii) Similarly as in Theorem 2.1 we can also consider a discretization S ε (θ) and conclude analogously as above the form of its derivative process.If the corresponding coefficients then converge in ucp, then also the derivative process does so.
Theorem 3.1 guarantees the existence and uniqueness of the derivative process.This thus allows to set up gradient based search algorithms for training.
In view of this let us now come to the precise optimization task as already outlined in Section 1.2.
To ease the notation we shall here omit the dependence of the weigths w and the loss function on the parameter γ.For each maturity T i , we assume to have J i options with strikes K ij , j ∈ {1, . . ., J i }.The calibration functional for the i-th maturity is then of the form where π mod ij (θ i ) (π mkt ij respectively) denotes the model (market resp.)price of an option with maturity T i and Strike K ij .Moreover, : R → R + is some nonnegative, nonlinear, convex loss function (e.g.square or absolute value) with (0) = 0 and (x) > 0 for x = 0, measuring the distance between market and model prices.Finally, w ij denote some weights, e.g., of vega type (compare [9]), which allows to match implied volatility data rather then pure prices, our actual goal, very well.Notice that we allow the weights to be trained and adapted during a run of our algorithm.
We solve the minimization problems (3.4) iteratively: we start with maturity T 1 and fix θ 1 .This then enters in the computation of π mod 2j (θ 2 ) and thus in (3.4) for maturity T 2 , etc.To simplify the notation in the sequel, we shall therefore leave the index i away so that for a generic maturity T > 0, (3.4) becomes Since the model prices are given by The calibration task then amounts to finding a minimum of As is a nonlinear function, this is not of the expected value form of problem (B.1).Hence standard stochastic gradient descent, as outlined in Appendix B.2, can not be applied in a straightforward manner.
We shall tackle this problem via hedge control variates as introduced in Section 2. In the following we explain this in more detail.

Minimizing the calibration functional. Consider the standard Monte Carlo estimator for
for i.i.d samples {ω 1 , . . ., ω N } ∈ Ω.Since the Monte Carlo error decreases as 1 √ N , the number of simulation N has to be chosen large (≈ 10 8 ) in order to approximate well the true model prices in (3.5).Note that implied volatility to which we actually aim to calibrate is even more sensitive.As stochastic gradient descent is not directly applicable due to the nonlinearity of , it seems necessary at first sight to compute the gradient of the whole function f (θ) to minimize (3.8).As m ≈ 10 8 , this is however computationally very expensive, leads to numerical instabilities and does not allow to find a minimum in the high dimensional parameter space Θ in a reasonable amount of time.
One very expedient remedy is to apply hedge control variates as introduced in Section 2 as variance reduction technique.This allows to reduce the number of samples N in the Monte Carlo estimator considerably to only up to 5 • 10 4 sample paths.
Assume that we have r hedging instruments (including the price process S) denoted by (Z t ) t∈[0,T ] which are square integrable martingales under Q and take values in R r .Consider, for j = 1, . . ., J, strategies The calibration functionals (3.7) and (3.8), can then simply be defined by replacing Q j (θ)(ω) by X j (θ)(ω) so that we end up minimizing To tackle this task, we apply the following variant of gradient descent: starting with an initial guess θ (0) , we iteratively compute for some learning rate η k and i.i.d samples (ω N ).These samples can either be chosen to be the same in each iteration or to be newly sampled in each update step.The difference between these two approaches is negligable, since N is chosen so as to yield a small Monte Carlo error, whence the gradient is nearly deterministic.In our numerical experiments we newly sample in each update step.
Concerning the choice of the hedging strategies, we can parametrize them as in Section 2.2 via neural networks and find the optimal weights δ by computing (3.12) for i.i.d samples {ω 1 , . . ., ω N } ∈ Ω and some loss function u when θ is fixed.Here, This means to iterate the two optimization procedures, i.e. minimizing (3.10) for θ (with fixed δ) and (3.12) for δ (with fixed θ).Clearly the Black-Scholes-Hedge ansatz as of Section 2.1 works as well, in this case without additional optimization with respect to the hedging strategies.
For alternative approaches how to minimize (3.7), we refer to Appendix C.

Numerical Implementation
In this section we discuss the numerical implementation of the proposed calibration method.
We implement our approach via tensorflow, taking advantage of gpu-accelerated computing.All computations are performed on a single-gpu nvidea GEFORCE RTX 2080 machine.For the implied volatility computations, we rely on the python py_vollib library. 3ecall that a LSV model is given on some filtered probability space (Ω, (F t ) t∈[0,T ] , F, Q) by dS t = S t α t L(t, S t )dW t , S 0 > 0, for some stochastic process α.When calibrating to data, it is therefore necessary to make further specifications.We calibrate the follwing SABR-type LSV model.1] and initial values α 0 > 0, S 0 > 0. Here, B and W are two correlated Brownian motions.Remark 4.2.We shall often work in log-price coordinates for S. In particular, we can then consider L as a function of X := log S rather then S. By denoting this parametrization again with L, we therefore have L(t, X) instead of L(t, S) and the model dynamics read as

Definition 4.1. The SABR-LSV model is specified via the SDE,
Note that α is a geometric Brownian motion, in particular, the closed form solution for α is available and given by For the rest of the paper we shall set S 0 = 1.

Implementation of the calibration method.
We now present a proper numerical test and demonstrate the effectiveness of our approach on a family of typical market smiles (instead of just one calibration example).We consider as ground truth a situation where market smiles are produced by a parametric family.By randomly sampling smiles from this family we then show that they can be calibrated up to small errors, which we analyze statistically.
4.1.1.Ground truth assumption.We start by specifying the ground truth assumption.It is known that a discrete set of prices can be exactly calibrated by a local volatility model using Dupire's volatility function, if an appropriate interpolation method is chosen.Hence, any market observed smile data can be reproduced by the following model (we assume zero riskless rate and define X = log(S)), dS t = σ Dup (t, X t )S t dW t , or equivalently (4.1) where σ Dup denotes Dupire's local volatility function [14].Our ground truth assumption consist in supposing that the function σ Dup (or to be more precise σ 2 Dup ) can be chosen from a parametric family.Such parametric families for local volatility models have been discussed in the literature, consider e.g.[8] or [7].In the latter, the authors introduce a family of local volatility functions a ξ indexed by parameters ξ = (p 1 , p 2 , σ 0 , σ 1 , σ 2 ) and p 0 = 1 − (p 1 + p 2 ) satisfying the constraints .
In Figure 1(a) we show plots of implied volatilities for different slices for a realistic choice of parameters.As one can see, the produced smiles seem to be unrealistically flat.Hence we modify the local volatility function a ξ to produce more pronounced and more realistic smiles.To 1.1 20 10 10 0.005 0.001 0.5 Table 1.Fixed Parameters for the ground truth assumption a 2 ξ .
be precise, we define a new family of local volatility functions a ξ indexed by the set of parameters ξ as We fix the choice of the paramers γ i , β i , λ i , κ as given in Table 1.Note that a 2 ξ is not defined at t = 0.When doing a Monte Carlo simulation, we simply replace a 2 ξ (0, x) with a 2 ξ (∆ t , x), where ∆ t is the time increment of the Monte Carlo simulation.
What is left to be specified are the parameters with p 0 = 1 − p 1 − p 2 .This motivates our statistical test for the performance evaluation of our method.To be precise, our ground truth assumption is, that all observable market prices are explained by a variation of the parameters ξ.For illustration, we plot implied volatilities for this modified local volatility function in Figure 1    4.1.2.Performance test.We now come to the evaluation of our proposed method.We want to calibrate the SABR-LSV model to market prices generated by the previously formulated ground truth assumption.This corresponds to randomly sampling the parameter ξ of the local volatily function a ξ and to compute prices according to (4.3).Calibrating the SABR-LSV model, i.e. finding the parameters ν, , the initial volatility α 0 and the unknown leverage function L, to these prices and repeating this multiple times then allows for a statistical analysis of the errors.
As explained in Section 3, we consider European call options with maturities T 1 < • • • < T n and denote the strikes for a given maturity T i by K ij , j ∈ {1, . . ., J i }.To compute the ground truth prices for these European calls we use a Euler-discretization of (4.3) with time step ∆ t = 1/100.Prices are then obtained by a variance reduced Monte Carlo estimator using 10 7 Brownian paths and a Black-Scholes Delta hedge variance reduction as described previously.For a given parameter set ξ, we use the same Brownian paths for all strikes and maturities.
Overall, in this test, we consider n = 4 maturities with J i = 20 strike prices for all i = 1, . . ., 4.
The values for T i are given in Figure 2(a).For the choice of the strikes K i , we choose evenly spaced points, i.e.
For the smallest and largest strikes per maturity we choose with the values of k i given in Figure 2(b).
We now specify a distribution under which we draw the parameters for our test.The components are all drawn independently from each other under the uniform distribution on the respective intervals given below.
- When necessary we adjust p 2 so that 1 − p 1 − p 2 ≥ 0 is satisfied.
We can now generate data by the following scheme.
• For each m, compute prices of European calls for maturities T i and strikes K ij for i = 1, . . ., n = 4 and j = 1, . . ., 20 according to (4.3) using 10 7 Brownian trajectories (for each m we use new trajectories).
• Store these prices.
The second part consists in calibrating each of these surfaces and storing pertinent values for which we conduct a statistical analysis.In the following we describe the procedure in detail: Recall that we specify the leverage function L(t, x) via a family of neural networks, i.e., where F i ∈ N N 1,1 (see Notation B.4).Each F i is specified as a 3-hidden layer feed forward network where the dimension of each of the hidden layers is 50.As activation function we choose σ = tanh.As before we denote the parameters of F i by θ i and the corresponding parameter space by Θ i .
As closed form pricing formulas are not available for such an LSV model, let us here briefly specify our pricing method.For the variance reduced Monte Carlo estimator as of (3.We start by pointing out that from the 150 sampled market smiles, two smiles caused difficulty, in the sense that our implied volatility computation failed due to the remaining Monte-Carlo error.By increasing the training parameters slightly, this issue can be mitigated but the resulting calibrated implied volatility errors stay large out of the money where the smiles are extreme.Hence, we opt to remove those two samples from the following statistical analysis.
Further, we identify six smiles, for which the calibration of at least one slice has failed.Here, we say the calibration of a slice has failed if the maximum error of implied volatility is larger than 0.01.Let us make the following remark.In all these examples the training got stuck in a local minimum and a second drawing of the initial parameters actually led to satisfying results.A straight forward parallelization to reduce the additional time due to redrawing parameters is of course possible.In the following however, we keep the data of the failed 6 calibrations as they are when presenting results.
In Figure 3 we show calibration results for a typical example of randomly generated market data.From this it is already visible that the worst case calibration error (which occurs out of the money) ranges typically between 20 and 40 basis points.The corresponding calibration result for the leverage function L 2 is given in Figure 4.
Let us note that our method achieves a high calibration accuracy for the considered range of strikes and across all considered maturities.However, the further away from ATM, the more challenging this calibration becomes as can be seen in the results of a worst case analysis of calibration errors in Figures 6 and 7.There we show the mean as well as different quantiles of the data.
We present a histogram of calibration times in Figure 5. There, we see that the typical calibration of all slices finishes under 30 minutes and only rarely we face the situation where a higher number of optimization steps is needed before the abort criterion is activated.Since our approach allows for straight forward parallelization strategies, these times can be reduced significantly by changing to a multi gpu setup.

Conclusion
We have demonstrated how the parametrization by means of neural networks can be used to calibrate local stochastic volatility models to implied volatility data.We make the following remarks: (i) The method we presented does not require any form of interpolation for the implied volatility surface since we do not calibrate via Dupire's formula.As the interpolation is usually done ad hoc, this is a desirable feature of our method.(ii) It is possible to "plug in" any stochastic variance process such as rough volatility processes as long as an efficient simulation of trajectories is possible.(iii) The multivariate extension is straight forward.
(iv) The level of accuracy of the calibration result is of a very high degree, making the presented method already of interest by this feature alone.(v) The method can be significantly accelerated by applying distributed computation methods in the context of multi-gpu computational concepts.(vi) The presented algorithm is further able to deal with path-dependent options since all computations are done by means of Monte Carlo simulations.In particular, it qualifies for joint calibration to S&P and VIX options.This is investigated in a companion paper.(vii) We stress again the advantages of the generative adversarial network point of view.
Indeed, the adversarial choice of the weights leads to very accurate generative neural SDE models.We believe that this is a crucial feature in the joint calibration of S&P and VIX options.

Plots
This section contains the relevant plots for the numerical test outlined in Section 4.     where Q denotes some stochastic objective function4 Q : Ω × Θ → R, (ω, θ) → Q(θ)(ω) that depends on parameters θ taking values in some space Θ.
The classical method how to solve generic optimization problems for some differentiable objective function f (not necessarily of the expected value form as in (B.1)) is to apply a gradient descent algorithm: starting with an initial guess θ (0) , one iteratively defines for some learning rate η k .Under suitable assumptions, θ (k) converges for k → ∞ to a local minimum of the function f .One insight of deep learning is that stochastic gradient descent methods, going back to stochastic approximation algorithms proposed by Robbins and Monroe [35], are much more efficient.To apply this, it is crucial that the objective function f is linear in the sampling probabilities.In other words, f needs to be of the expected value form as in (B.1).In the simplest form of stochastic gradient descent, under the assumption that the true gradient of f is approximated by a gradient at a single sample Q(θ)(ω) which reduces the computational cost considerably.In the updating step for the parameters θ as in (B.2), f is then replaced by Q(θ)(ω k ), hence The algorithm passes through all samples ω k of the so-called training data set, possibly several times (specified by the number of epochs), and performs the update until an approximate minimum is reached.
A compromise between computing the true gradient of f and the gradient at a single sample Q(θ)(ω) is to compute the gradient of a subsample of size N batch , called (mini)-batch, so that Q(θ (k) )(ω k ) used in the update (B.3) is replaced by for independent draws ω 1 , . . ., ω N (the same N samples can be used for each strike K j ).Equivalently we have for independent draws ω 1 , . . ., ω 2N .The analog of (B.4) is then given by for k ∈ {0, 1, ..., N/N batch − 1}.
Clearly we can now modify and improve the estimator by using again hedge control variates and replace Q j (θ) by X j (θ) as defined in (3.9).
(b) for a specific parameter set ξ. Our ground truth model is now specified as in (4.1) with σ Dup replaced by a ξ , i.e.

Fig. 1 .
Fig. 1.Implied volatility of the original parametric family a ξ (a) versus our modification a ξ (b).

k 1 k 2 k 3 k 4 0Fig. 2 .
Fig. 2. (a) Maturities used to generate data for the calibration test.(b) Parameters that define the strikes of the call options to which we calibrate.

Algorithm 4 . 3 . 1 # Initialize the network parameters 2 initialize 3 # 4 N , k = 400 , 1 5# 13 Simulate
In the subsequent pseudocode, the index i stands for the maturities, N for the number of samples used in the variance reduced Monte Carlo estimator as of (3.10) and k for the updating step in the gradient descent: θ1, . . ., θ4 Define initial number of trajectories and initial step The time discretization for the MC simulations and the 6 N trajectories of the SABR -LSV process up 14 to time Ti , compute the payoffs .

15 do : 16 Compute the stochastic integral of the Black -Scholes 17 Delta hedge against these trajectories as of ( 2 20 Compute 27 as in ( 3 . 11 ) 28 optimizer with learning rate 10 −3 . 29 do : 30 Update
the calibration functional as of (3.10) 21 with (x) = x 2 and normalized vega weights wj for strike Kij 22 given by wj = wj/ 20 l=1 wl with wj = 1/vij , where vij is the 23 Black -Scholes vega for strike Kij , the corresponding 24 market implied volatility and the maturity Ti but with the more sophisticated ADAMthe parameter N , the condition nextslice and 31 compute model prices according to Specification 4.4.

32 do : 33 k = k + 1 Specfication 4 . 4 . 10 Compute model prices πmodel for slice i via MC simulation 11 using 10 7
We update the parameters in the above algorithm according to the following rules: trajectories .Apply the Black -Scholes Delta 12 hedge for variance reduction .

Fig. 3 .
Fig. 3. Left column: Implied volatilities for the calibrated model together with the data (market) implied volatilities for a typical example of sampled market prices.Right column: Calibration errors by subtracting model implied volatilities from the data implied volatilities.Each row corresponds to one of the four available maturities.

Fig. 5 .
Fig. 5. Histogram of calibration times of the statistical test introduced in Section 4.1.2.One calibration in this histogram corresponds to the number of minutes, before the calibration of all slices is finished.

Fig. 6 .
Fig. 6.Boxplots for the first (above) and second (below) slice.Depicted are the mean (horizontal line), as well as the 0.95, 0.70, 0.3, 0.15 quantiles for the maximum of absolute calibration errors along all strikes.

Fig. 7 .B. 2 .
Fig. 7. Boxplot for the third (above) and fourth (below) slice.Depicted are the mean (horizontal line), as well as the 0.95, 0.70, 0.3, 0.15 quantiles for the maximum of absolute calibration errors along all strikes.
10)we always use a standard Euler-SDE discretization with step size ∆ t = 1/100.As variance reduction method, we implement the running Black-Scholes Delta hedge with instantanous running volatility of the price process, i.e.L(t, S t )α t is plugged in the formula for the Black-Scholes Delta as in (2.3).The only parameter that remains to be specified, is the number of trajectories used for the Monte Carlo estimator which is done in Algorithm 4.3 and 4.4 below.As a first calibration step, we calibrate the SABR model (i.e.(4.1) with L ≡ 1) to the market prices and fix the calibrated SABR parameters ν, and α 0 .For the remaining parameters θ i , i = 1, . . ., 4, we apply the following algorithm until all parameters are calibrated.