The Seven-League Scheme: Deep Learning for Large Time Step Monte Carlo Simulations of Stochastic Differential Equations

Shuaiqiang Liu; Lech A. Grzelak; Cornelis W. Oosterlee

doi:10.3390/risks10030047

,

and

¹

Applied Mathematics (DIAM), Delft University of Technology, 2628 CD Delft, The Netherlands

²

Rabobank, 3500 HG Utrecht, The Netherlands

³

Centrum Wiskunde & Informatica (CWI), 1098 XG Amsterdam, The Netherlands

^*

Author to whom correspondence should be addressed.

Risks2022, 10(3), 47;https://doi.org/10.3390/risks10030047

This article belongs to the Special Issue Risks: Feature Papers 2022

Version Notes

Order Reprints

Abstract

We propose an accurate data-driven numerical scheme to solve stochastic differential equations (SDEs), by taking large time steps. The SDE discretization is built up by means of the polynomial chaos expansion method, on the basis of accurately determined stochastic collocation (SC) points. By employing an artificial neural network to learn these SC points, we can perform Monte Carlo simulations with large time steps. Basic error analysis indicates that this data-driven scheme results in accurate SDE solutions in the sense of strong convergence, provided the learning methodology is robust and accurate. With a method variant called the compression–decompression collocation and interpolation technique, we can drastically reduce the number of neural network functions that have to be learned, so that computational speed is enhanced. As a proof of concept, 1D numerical experiments confirm a high-quality strong convergence error when using large time steps, and the novel scheme outperforms some classical numerical SDE discretizations. Some applications, here in financial option valuation, are also presented.

Keywords:

artificial neural network; stochastic differential equations; large time step simulation; stochastic collocation Monte Carlo sampler; numerical scheme; path-dependent options

1. Introduction

The highly successful deep learning paradigm () receives a lot of attention in science and engineering. Within numerical mathematics, the machine learning methodology has, for example, successfully entered the field of numerically solving partial differential equations (PDEs) (; ; ; ; ; ). The aim with machine learning is then to either speed up the solution process or to solve high-dimensional problems that are not easily handled by the traditional numerical methods.

In this paper, we develop a highly accurate numerical discretization scheme for scalar stochastic differential equations (SDEs), which is based on taking possibly large discrete time steps. We “learn” to take large time steps, with the help of the stochastic collocation Monte Carlo sampler (SCMC) proposed by (), and by using an artificial neural network (ANN), within the classical supervised learning context.

SDEs are widely used to describe uncertain phenomena, in physics, finance, epidemics, amongst others, as a means to model and quantify uncertainty. The corresponding solutions are stochastic processes. Numerical approximation of the solution to an SDE is standard practice, as an analytic solution is typically not available. The most commonly known technique to solve SDEs is based on the Monte Carlo (MC) simulation, for which the SDE first needs to be discretized. There are quite a few applications that could benefit from an accurate and efficient numerical method on the basis of a large time step discretization (), for example, in finance, the valuation of path-dependent financial derivatives or financial risk management where counterparty credit risk plays a role.

Basically, there are two ways to measure the convergence rate of discrete solutions to SDEs, by means of the approximation to the sample path or by approximation to the corresponding distribution. This way, strong and weak convergence of a numerical SDE solutions have respectively been defined (see ). Weak convergence, the convergence in distributional sense, is often addressed in the literature. Moment-matching, for example, is a basic technique used to improve weak convergence. Strong pathwise convergence is particularly challenging, and requires accurate conditional distributions. There are natural approaches to improve strong convergence properties, i.e., by adding higher order terms or by using finer time grids. However, these are nontrivial and costly, especially when considering multi-dimensional SDEs.

We aimed to develop highly accurate numerical schemes by means of deep learning, for which the strong error of the discretization does not depend on the size of the simulation time step. For this, we employ the SCMC method as an efficient approach for approximating (conditional) distribution functions. The distribution function of interest is expanded as a polynomial in terms of a random variable, which is cheap to sample from at given collocation points, and interpolation takes place between these points. The resulting big time step discretization, in which the SCMC methodology is combined with deep learning, is called the Seven-League scheme1 here, and we abbreviate it by the 7L scheme.

There are different reasons to learn stochastic collocation points instead of the sample paths directly. Stochastic collocation points have a specific physical meaning, which makes the data-driven scheme explainable. Monte Carlo sample paths are random, while collocation points are path independent and deterministic (i.e., representing key features of a probability distribution), which simplifies the learning process when using neural networks. Unlike the SCMC method, which provides accurate Monte Carlo samples given a constant time step and for one specific instance of the SDE parameters, the 7L methodology enables us to generate samples for a wide range of time steps and for many different instances of the model parameters (i.e., for a family of SDEs), by means of a neural network to learn the evolution of the collocation points over time for many different model parameters.

The 7L scheme is composed of two separate phases under the framework of supervised learning, i.e., an offline (training) phase and an online (prediction) phase. The training phase, which usually requires heavy computation and many datasets, is done only once and offline. The prediction phase, which is a computationally cheap and highly efficient process, can be performed in an online fashion.

This paper serves to show that the proposed methodology performs very well for scalar SDEs. This work can thus be seen as a proof of concept. Higher-dimensional extensions of the 7L scheme will be presented in forth-coming work. Because this methodology is however “grid-based” we will not reach extremely high dimensions without additional enhancement. This is left for future work.

The remainder of this paper is organized as follows. In Section 2, SDEs, the discretization, stochastic collocation, and the connection between SDE discretizations and the SCMC method are introduced. In Section 3, the data-driven methodology is explained to address large time step simulation, i.e., the 7L scheme, for SDEs. ANNs will be used as function approximators to learn the stochastic (conditional) collocation points. A brief description of their details is placed in Section 3.3. In Section 4, we introduce a decompression-compression technique to accelerate the computation. This latter efficient variant is named the 7L-CDC scheme (i.e., seven-league compression–decompression scheme). Section 5 presents numerical experiments to show the performance of the proposed approach. Furthermore, the corresponding error is analyzed. Section 6 concludes.

2. Stochastic Differential Equations and Stochastic Collocation

We first describe the basic, well-known SDE setting, and explain our notation.

2.1. SDE Basics

We work with a real-valued random variable

Y (t)

, defined on the probability space

(Ω, Σ, P)

with filtration

F_{t \in [0, T]}

, sample space

Ω

,

σ

-algebra

Σ

and probability measure

P

. For the time evolution of

Y (t)

, consider the generic scalar Itô SDE,

d Y (t) = a (t, Y (t), θ) d t + b (t, Y (t), θ) d W (t), 0 \leq t \leq T,

(1)

with the drift term

a (t, Y (t), θ)

, the diffusion term

b (t, Y (t), θ)

, model parameters

θ

, Wiener process

W (t)

, and given initial value

Y_{0} : = Y (t = 0)

. When the drift and diffusion terms satisfy some regularity conditions (e.g., the global Lipschitz continuity ()), the existence and uniqueness of the solution of (1) are guaranteed. The cumulative distribution function of

Y (t)

,

t \in [0, T]

,

F_{Y (t)} (\cdot)

is available and the corresponding density function, evolving over time, is described by the Fokker–Planck equation ().

With a discretization in time interval

[0, T]

,

t_{i} = i \cdot T / N, i = 0, \dots N,

with equidistant time step

Δ t = t_{i + 1} - t_{i}

, the discrete random variable at time

t_{i}

is denoted by

Y (t_{i})

. Traditional numerical schemes have been designed based on Itô’s lemma, in a similar fashion as the Taylor expansion is used to discretize deterministic ODEs and PDEs. The basic discretization, for each Monte Carlo path, is the Euler–Maruyama scheme (), which reads,

{\hat{Y}}_{i + 1} = {\hat{Y}}_{i} + a (t_{i}, {\hat{Y}}_{i}, θ) Δ t + b (t_{i}, {\hat{Y}}_{i}, θ) \sqrt{Δ t} {\hat{X}}_{i + 1},

(2)

where

{\hat{Y}}_{i + 1} : = \hat{Y} (t_{i + 1})

is a realization (i.e., a number) from random variable

\tilde{Y} (t_{i + 1})

, which represents the numerical approximation to exact solution

Y (t_{i + 1})

at time point

t_{i + 1}

, and a realization

{\hat{X}}_{i + 1}

is drawn from the random variable X, which here follows the standard normal distribution

N (0, 1)

. Moreover,

\overset{ˇ}{Y} (t_{i})

(a number) will be used as the notation for a realization of

Y (t_{i})

.

In addition, the Milstein discretization () reads,

{\hat{Y}}_{i + 1} = {\hat{Y}}_{i} + a (t_{i}, {\hat{Y}}_{i}, θ) Δ t + b (t_{i}, {\hat{Y}}_{i}, θ) \sqrt{Δ t} {\hat{X}}_{i + 1} + \frac{1}{2} b^{'} (t_{i}, {\hat{Y}}_{i}, θ) b (t_{i}, {\hat{Y}}_{i}, θ) Δ t ({\hat{X}}_{i + 1}^{2} - 1),

(3)

where

b^{'} (t_{i}, \cdot, θ)

represents the derivative with respect to

\hat{Y}

of

b (\cdot, \hat{Y}, θ)

. When the drift and diffusion terms are independent of time t, the SDE is called time-invariant.

Two error convergence criteria are commonly used to measure the SDE discretization accuracy; that is, the convergence in the weak and strong sense. Strong convergence, which is of our interest here, is defined as follows.

Definition 1.

Let

Y (t_{i})

be the exact solution of an SDE at time

t_{i}

, its discrete approximation

\tilde{Y} (t_{i})

with time step

Δ t \in R^{+}

converges in the strong sense, with order

β_{s} \in R^{+}

, if there exists a constant K such that

E | \tilde{Y} (t_{i}) - Y (t_{i}) | \leq K {(Δ t)}^{β_{s}} .

(4)

It is well-known that the Euler–Maruyama scheme (2) has strong convergence

β_{s} = 0.5

, while the Milstein scheme (3) has

β_{s}

= 1.0. When deriving high-order schemes for SDEs, the rules of Itô calculus must be respected (). As a result, there will be eight terms in a Taylor SDE scheme with

β_{s}

= 1.5, and twelve with

β_{s}

= 2.0, and the computational complexity increases. As a consequence, higher order schemes are involved and somewhat expensive. Convergence of the numerical solution for

Δ t \to 0

is guaranteed, but the computational costs increase significantly to achieve accurate solutions.

The generic form of the above mentioned numerical schemes to solve the Itô SDE is as follows,

{\hat{Y}}_{i + 1} | {\hat{Y}}_{i} = \sum_{j = 0}^{m - 1} α_{j} ({\hat{Y}}_{i}) {\hat{X}}_{i + 1}^{j},

(5)

where m represents the number of polynomial terms, and the coefficients

α_{j} ({\hat{Y}}_{i})

are pre-defined and equation-dependent. For the sake of notational convenience, let

α_{j} = α_{j} ({\hat{Y}}_{i})

. For example, for the Euler–Maruyama scheme (2), with

m = 2

, we have

\{\begin{matrix} α_{0} = {\hat{Y}}_{i} + a (t_{i}, {\hat{Y}}_{i}, θ) Δ t, \\ α_{1} = b (t_{i}, {\hat{Y}}_{i}, θ) \sqrt{Δ t}, \end{matrix}

(6)

while for the Milstein scheme, with

m = 3

, it follows that

\{\begin{matrix} α_{0} = {\hat{Y}}_{i} + a (t_{i}, {\hat{Y}}_{i}, θ) Δ t + \frac{1}{2} b^{'} (t_{i}, {\hat{Y}}_{i}, θ) b (t_{i}, {\hat{Y}}_{i}, θ), \\ α_{1} = b (t_{i}, {\hat{Y}}_{i}, θ) \sqrt{Δ t}, \\ α_{2} = \frac{1}{2} b^{'} (t_{i}, {\hat{Y}}_{i}, θ) b (t_{i}, {\hat{Y}}_{i}, θ) . \end{matrix}

(7)

With these explicit coefficients we arrive at the probability distribution of the random variable,

Y_{i + 1} | Y_{i} \approx {\tilde{Y}}_{i + 1} | {\tilde{Y}}_{i} \overset{d}{=} \sum_{j = 0}^{m - 1} α_{j} X^{j} .

(8)

These discrete SDE schemes are based on a series of transformations of the previous realization to approximate the conditional distribution,

P [Y (t + Δ t) < y | Y (t)] = F_{Y (t + Δ t) | Y (t)} (y) \approx F_{\tilde{Y} (t + Δ t) | \tilde{Y} (t)} (y) .

(9)

A numerical scheme is thus essentially based on the conditional sampling

Y (t + Δ t) | Y (t)

. The Euler–Maruyama scheme draws from a normal distribution, with a specific mean and variance, to approximate the distribution in the next time point, while the Milstein scheme combines a normal and a chi-squared distribution. Similarly, we can derive the stochastic collocation methods.

2.2. Stochastic Collocation Method

Let us assume two random variables, Y and X, where the latter one is cheaper to sample from (e.g., X is a Gaussian random variable). These two scalar random variables are connected, via,

F_{Y} (Y) \overset{d}{=} U \overset{d}{=} F_{X} (X),

(10)

where

U \sim U ([0, 1])

is a uniformly distributed random variable,

F_{Y} (\bar{y}) : = P [Y \leq \bar{y}]

and

F_{X} (\bar{x}) : = P [X \leq \bar{x}]

are supposed to be cumulative distribution functions (CDFs). Note that

F_{X} (X)

and

F_{Y} (Y)

are random variables following the same uniform distribution.

F_{Y} ({\bar{y}}_{n})

and

F_{X} ({\bar{x}}_{n})

are supposed to be strictly increasing functions, so that the following inversion holds true,

{\bar{y}}_{n} = F_{Y}^{- 1} (F_{X} ({\bar{x}}_{n})) = : g ({\bar{x}}_{n}) .

(11)

where

{\bar{y}}_{n}

and

{\bar{x}}_{n}

are samples (numbers) from Y and X, respectively. The mapping function,

g (\cdot) = F_{Y}^{- 1} (F_{X} (\cdot))

, connects the two random variables and guarantees that

F_{X} ({\bar{x}}_{n})

equals

F_{Y} (g ({\bar{x}}_{n}))

, in distributional sense and also element-wise. The mapping function should be approximated, i.e.,

g ({\bar{x}}_{n}) \approx g_{m} ({\bar{x}}_{n})

, by a function which is computationally cheap. When function

g_{m} (\cdot)

is available, we may generate “expensive” samples,

{\bar{y}}_{n}

from Y, by using the cheaper random samples

{\bar{x}}_{n}

from X.

The stochastic collocation Monte Carlo method (SCMC) developed in () aims to find an accurate mapping function

g (\cdot)

in an efficient way. The basic idea is to employ Equation (11) at specific collocation points and approximate the function

g (\cdot)

by a suitable monotonic interpolation between these points. This procedure, see Algorithm 1, reduces the number of expensive inversions

F_{Y}^{- 1} (\cdot)

to obtain many samples from

Y (\cdot)

.

Algorithm 1: SCMC Method

Taking an interpolation function of degree

m - 1

(with

m \geq 2

, as we need at least two collocation points), as an example, the following steps need to be performed:

Calculate CDF $F_{X} (x_{j})$ on the points $(x_{1}, x_{2}, \dots, x_{m})$ , which are obtained, for example, from Gauss–Hermite quadrature, giving m pairs $(x_{j}, F_{X} (x_{j}))$ ;
Invert the target CDF $y_{j} = F_{Y}^{- 1} (F_{X} (x_{j}))$ , $j = 1, \dots, m$ , and form m pairs $(x_{j}, y_{j})$ ;
Define the interpolation function, $y = g_{m} (x)$ , based on these m point pairs $(x_{j}, y_{j})$ ;
Obtain sample $\hat{Y}$ by applying the mapping function $\hat{Y} = g_{m} (\hat{X})$ , where sample $\hat{X}$ is drawn from X.

The SCMC method parameterizes the distribution function by imposing probability constraints at the given collocation points. Taking the Lagrange interpolation as an example, we can expand function

g_{m} (\cdot)

in the form of polynomial chaos,

Y \approx g_{m} (X) = \sum_{j = 0}^{m - 1} {\hat{α}}_{j} X^{j} = {\hat{α}}_{0} + {\hat{α}}_{1} X + \dots + {\hat{α}}_{m - 1} X^{m - 1} .

(12)

Monotonicity of interpolation is an important requirement, particularly when dealing with peaked probability distributions.

The Cameron-Martin Theorem () states that any distribution can be approximated by a polynomial chaos approximation based on the normal distribution, but also other random variables may be used for X (see, for example, ).

Clearly, Equation (8) can be compared to Equation (12), as a discretization scheme to approximate the realization in the next time point.

3. Methodology

For our purposes, given

Y (t)

, the conditional variable

Y (t + Δ t)

can be written as,

Y (t + Δ t) | Y (t) \approx g_{m} (X) = \sum_{j = 0}^{m - 1} {\hat{α}}_{j} X^{j},

(13)

where the coefficients

{\hat{α}}_{j} \equiv {\hat{α}}_{j} ({\hat{Y}}_{i}, t_{i}, t_{i + 1}, θ)

are now functions of realization

{\hat{Y}}_{i}

. Equation (13), for large m-values, holds for any

Δ t = t_{i + 1} - t_{i}

, particularly also for large

Δ t

. As such the scheme can be interpreted as an almost exact simulation scheme for an SDE under consideration. By the scheme in (13) we can thus take large time steps in a highly accurate discretization scheme. More specifically, a sample from the known distribution X can be mapped onto a corresponding unique sample of the conditional distribution

Y (t + Δ t)

by the coefficient functions.

There are essentially two possibilities for using an ANN in the framework of the stochastic collocation method, the first being to directly learn the (time-dependent) polynomial coefficients,

{\hat{α}}_{j}

, in (13), the second to learn the collocation points,

y_{j}

. The two methods are equivalent mathematically, but the latter, our method of choice, appears more stable and flexible. Here, we explain how to learn the collocation points,

y_{i}

, which is then followed by inferring the polynomial coefficients. When the stochastic collocation points at time

t + Δ t

are known, the coefficients in (13) can be easily computed.

An SDE solution is represented by its cumulative distribution at the collocation points, plus a suitable accurate interpolation

g_{m} (x)

. In other words, the SCMC method forces the distribution functions (the target and the numerical approximation) to strictly match at the collocation points over time. The collocation points are dynamic and evolve with time.

3.1. Data-Driven Numerical Schemes

Calculating the conditional distribution function requires generating samples conditionally on previous realizations of the stochastic process. Based on a general polynomial expression, the conditional sample, in discrete form, is defined as follows,

{\hat{Y}}_{i + 1} | {\hat{Y}}_{i} = \sum_{j = 0}^{m - 1} {\hat{α}}_{j} ({\hat{Y}}_{i}, t_{i}, t_{i + 1} - t_{i}, θ) {\hat{X}}_{i + 1}^{j},

(14)

where

Δ t = t_{i + 1} - t_{i}

, and the coefficients

{\hat{α}}_{j}

, at time

t_{i}

, are functions of the variables

{\hat{Y}}_{i}, t_{i}, t_{i + 1} - t_{i}, θ

, see Equations (6) and (7).

In the case of a Markov process, the future does not dependent on past values. Given

\hat{Y} (t)

, the random variable

\hat{Y} (t + Δ t)

only depends on the increment

Y (t + Δ t) - Y (t)

. The process has independent increments, and the conditional distribution at time

t_{i + 1}

given information up to time

t_{i}

only depends on the information at

t_{i}

.

Similar to these coefficient functions, the m conditional stochastic collocation points at time

t_{i + 1}

,

y_{j} (t_{i + 1}) | {\hat{Y}}_{i}

, with

j = 0, \dots, m - 1

, can be written as a functional relation,

y_{j} (t_{i + 1}) | {\hat{Y}}_{i} = H_{j} ({\hat{Y}}_{i}, t_{i}, t_{i + 1} - t_{i}, θ) .

(15)

A closed-form expression for function

H (\cdot)

is generally not available. Finding the conditional collocation points can however be formulated as a regression problem.

It is well-known that neural networks can be utilized as universal function approximators (). We then generate random data points in the domain of interest and the ANN should “learn the mapping function

H_{j} (\cdot)

”, in an offline ANN training stage. The SCMC method is here used to compute the corresponding collocation points at each time point, which are then stored to train the ANN, in a supervised learning fashion (see, for example, ).

3.2. The Seven-League Scheme

Next, we detail the generation of the stochastic collocation points to create the training data. Consider a stochastic process

Y (τ)

,

τ \in [0, τ_{m a x}]

, where

τ_{m a x}

represents the maximum time horizon for the process that we wish to sample from. When the analytical solution of the SDE is not available (and we cannot use an exact simulation scheme with large time steps), a classical numerical scheme will be employed, based on tiny constant time increments

Δ τ = τ_{i + 1} - τ_{i}

, a discretization in the time-wise direction with grid points

0 < τ_{1} < τ_{2} < \dots < τ_{N} \leq τ_{m a x}

, to generate a sufficient number of highly accurate samples at each time point

τ_{i}

, to approximate the corresponding cumulative functions highly accurately. Note that the training samples are generated with a fixed time discretization, and in the case of a Markov process, one can easily obtain discrete values for the underlying stochastic process for many possible time increments, e.g.,

2 Δ τ

,

3 Δ τ

. With the obtained samples, we approximate the corresponding marginal collocation points at time

τ_{i}

, as follows,

{\hat{y}}_{j} (τ_{i}) = F_{{\tilde{Y}}_{i}}^{- 1} (F_{X} (x_{j})) \approx F_{Y_{i}}^{- 1} (F_{X} (x_{j}))

(16)

where, with integer

M_{s}

,

{\hat{y}}_{j} (τ_{i})

,

j = 1, \dots, M_{s}

, represent the approximate collocation points of

Y (t)

at time

τ_{i}

, and

x_{j}

are optimal collocation points of variable X. For simplicity, consider

X \sim N (0, 1)

, so that the points

x_{j}

are known analytically and do not depend on time point

τ_{i}

. In the case of a normal distribution, these points are known quadrature points, and tabulated, for example, in (). After this first step, we have the set of collocation points,

{\hat{y}}_{j} (τ_{i})

, for

i = 1, \dots, N

and

j = 1, \dots, M_{s}

. Subsequently, the

{\hat{y}}_{j} (\cdot)

from (16) are used as the ground-truth to train the ANN.

In the second step, we determine the conditional collocation points. For each time step

τ_{i}

and collocation point indexed by j, a nested Monte Carlo simulation is then performed to generate the conditional samples. Similar to the first step, we obtain the conditional collocation points from each of these sub-simulations using (16). With

M_{c}

being an integer representing the number of conditional collocation points, the above process yields the following set of

M_{c}

conditional collocation points,

\begin{matrix} {\hat{y}}_{k | j} (τ_{i + 1}) : = {\hat{y}}_{k} (τ_{i + 1}) | {\hat{y}}_{j} (τ_{i}) = F_{{\hat{Y}}_{i + 1} | {\hat{Y}}_{i} = {\hat{y}}_{j} (τ_{i})}^{- 1} (F_{X} (x_{k | j})), \end{matrix}

(17)

where

x_{k | j}

is a conditional collocation point, and

i \in {0, 1, \dots, N - 1}

,

j \in {1, \dots, M_{s}}

,

k \in {1, \dots, M_{c}}

. Note that, in the case of Markov processes, the above generic procedure can be simplified by just varying the initial value

Y_{0}

instead of running a nested Monte Carlo simulation. Specifically, we then set

{\hat{Y}}_{i} = {\hat{Y}}_{0}

,

τ_{i} = τ_{0}

and

τ_{i + 1} = τ_{0} + Δ τ

to generate the corresponding conditional collocation points.

The inverse function,

F_{{\hat{Y}}_{i + 1} | {\hat{Y}}_{i}}^{- 1} (\cdot)

, is often not known analytically, and needs to be derived numerically. An efficient procedure for this is presented in (). Of course, it is well-known that the computation of

F^{- 1} (p)

is equivalent to the computation of the quantile at level p.

We encounter essentially four types of stochastic collocation (SC) points:

x_{j}

are called the original SC points,

{\hat{x}}_{j}

are original conditional collocation points,

{\hat{y}}_{j}

are the marginal SC points, and

{\hat{y}}_{k} | \cdot

are the conditional SC points. For example,

{\hat{y}}_{k} | \hat{Y_{i}}

is conditional on a realization

{\hat{Y}}_{i}

. In the context of Markov processes, the marginal SC points

{\hat{y}}_{j}

only depend on the initial value

Y_{0}

, thus they are a special type of conditional collocation points, i.e.,

{\hat{y}}_{j} | Y_{0}

. When a previous realization happens to be a collocation point, e.g.,

{\hat{Y}}_{i} = {\hat{y}}_{j}

, we have

{\hat{y}}_{k | j} : = {\hat{y}}_{k} | {\hat{y}}_{j}

, which will be used to develop a variation of the 7L scheme in Section 4.

When the data generation is completed, the ANNs are trained in a supervised-learning fashion, on the generated SC points to approximate the function H in (15), giving us a learned function

\hat{H}

. This is called the training phase. With the trained ANNs, we can approximate new collocation points, and develop a numerical solver for SDEs, which is the Seven-League scheme (7L), see Algorithm 2. Figure 1 gives a schematic illustration of Monte Carlo sample paths that are generated by the 7L scheme.

Algorithm 2: 7L Scheme

Offline stage: train the ANNs to learn the stochastic collocation points. At this stage, we choose different $θ$ values, simulate corresponding Monte Carlo paths, with small constant time increments $Δ τ = τ_{i + 1} - τ_{i}$ in $[0, τ_{m a x}]$ , generate the ${\hat{y}}_{j}$ and ${\hat{y}}_{k | j}$ collocation points, and learn the relation between input and output. So, we actually “learn” $H_{k} \approx {\hat{H}}_{k}$ . See Section 3.3 for the ANN details.
Online stage: Partition time interval $[0, T]$ , $t_{i} = i \cdot T / N, i = 0, \dots N,$ with equidistant time step $Δ t = t_{i + 1} - t_{i}$ . Given a sample ${\hat{Y}}_{i}$ at time $t_{i}$ , compute m collocation points at time $t_{i + 1}$ using

${\hat{y}}_{j} (t_{i + 1}) | {\hat{Y}}_{i} = {\hat{H}}_{j} ({\hat{Y}}_{i}, t_{i}, t_{i + 1} - t_{i}, θ), j = 1, 2, \dots, m,$

(18)

and form a vector ${\hat{y}}_{i + 1} = ({\hat{y}}_{1} (t_{i + 1}) | {\hat{Y}}_{i}, {\hat{y}}_{2} (t_{i + 1}) | {\hat{Y}}_{i}, \dots, {\hat{y}}_{m} (t_{i + 1}) | {\hat{Y}}_{i})$ .
Compute the interpolation function $g_{m} (\cdot)$ , or calculate the coefficients ${\hat{α}}_{i + 1}$ (if necessary):

$A {\hat{α}}_{i + 1} = {\hat{y}}_{i + 1},$

(19)

where A is the corresponding matrix (e.g., the Vandermonde matrix in a polynomial interpolation) and more details on the computation of the original collocation points can be found in Algorithm 1. We will compare monotonic spline, Chebyshev, and the barycentric formulation of Lagrange interpolation for this purpose. See Section 4.2 for a detailed discussion.
Sample from X and obtain a sample in the next time point, ${\hat{Y}}_{i + 1}$ , by ${\hat{Y}}_{i + 1} | {\hat{Y}}_{i} = g_{m} ({\hat{X}}_{i + 1})$ , or the coefficient form as follows,

${\hat{Y}}_{i + 1} | {\hat{Y}}_{i} = \sum_{j = 0}^{m - 1} {\hat{α}}_{i + 1, j} {\hat{X}}_{i + 1}^{j} .$
Return to Step 2 by $t_{i + 1} \overset{}{\to} t_{i}$ , iterate until terminal time T.
Repeat this procedure for a number of Monte Carlo paths.

Remark 1

(Lagrange interpolation issue). In the case of classical Lagrange interpolation, Vandermonde matrix A should not get too large, as the matrix would then suffer from ill-conditioning. However, when employing orthogonal polynomials, this drawback is removed. More details can be found in ().

Figure 1. Schematic diagram of the 7L scheme. Here conditional SC points, represented by ■, are conditional on a previous realization, denoted by ★. “Conditional PDF” is the conditional probability density function, defined by these conditional SC points. The density function, which is not required by 7L, is plotted only for illustration purposes.

When the approximation errors from the ANN and SCMC techniques are sufficiently small, the strong convergence properties of the 7L scheme can be estimated, as follows,

E | \tilde{Y} (t_{i}) - Y (t_{i}) | < ϵ (Δ τ) ≪ K {(Δ t)}^{β_{s}},

(20)

where time step

Δ τ

is used to define the ANN training data-set, and the actual time step

Δ t

is used for ANN prediction, with

Δ τ ≪ Δ t

. Based on the trained 7L scheme, the strong error,

ϵ (Δ τ)

, thus, does not grow with the actual time step

Δ t

. In particular, let us assume

Δ τ = Δ t / κ

, for example

κ = 100

, when employing the Euler–Maruyama scheme with time step

Δ τ

during the ANN learning phase, we expect a strong convergence of

O (\sqrt{Δ τ})

, which then equals

O (\sqrt{Δ t / κ})

, while the use of the Milstein scheme during training would result in

O (Δ t / κ)

accuracy. When

κ = 100

, the time step during the learning phase is 100 times smaller than

Δ t

, which has a corresponding effect on the overall scheme’s accuracy in terms of its strong, pathwise convergence. The maximum value of the time step

Δ t

in the 7L scheme can be set up to

τ_{m a x}

for a Markov process. With a time step

Δ t

, we solve the SDE in an iterative way until the actual terminal time T, which can be much larger than the training time horizon

τ_{m a x}

. An error analysis can be found in Section 5.2.

3.3. The Artificial Neural Network

The ANN to learn the conditional collocation points is detailed in this subsection. Neural networks can be utilized as powerful functions to approximate a nonlinear relationship. In fact, we will employ a rather basic fully-connected neural network configuration for our learning task.

A fully connected neural network, without skip connections, can be described as a composition function, i.e.,

\hat{H} (\overset{´}{x} | \tilde{Θ}) = h^{(L)} (\dots h^{(2)} (h^{(1)} (\overset{´}{x}; {\tilde{θ}}_{1}); {\tilde{θ}}_{2}); \dots {\tilde{θ}}_{L_{A}}),

(21)

where

\overset{´}{x}

represents the input variables,

\tilde{Θ}

being the hidden parameters (i.e., weights and biases),

L_{A}

the number of hidden layers. We can expand the hidden parameters as,

\tilde{Θ} = ({\tilde{θ}}_{1}, {\tilde{θ}}_{2}, \dots, {\tilde{θ}}_{L}) = (w_{1}, b_{1}, w_{2}, b_{2}, \dots, w_{L}, b_{L_{A}}),

(22)

where

w_{ℓ}

and

b_{ℓ}

represent the weight matrix and the bias vector, respectively, in the ℓ-th hidden layer.

Each hidden-layer function,

h^{(ℓ)} (\cdot), ℓ = 1, 2, \dots, L_{A}

, takes input signals from the output of a previous layer, computes an inner product of weights and inputs, and adds a bias. It sends the resulting value in an activation function to generate the output.

Let

z_{j}^{(ℓ)}

denote the output of the j-th neuron in the ℓ-th layer. Then,

z_{j}^{(ℓ)} = φ^{(ℓ)} (\sum_{i} w_{i, j}^{(ℓ)} z_{i}^{(ℓ - 1)} + b_{j}^{(ℓ)}),

(23)

where

w_{i, j}^{(ℓ)} \in w_{ℓ}

,

b_{j}^{(ℓ)} \in b_{ℓ}

, and

φ^{(ℓ)}

is a nonlinear transfer function (i.e., activation function). With a specific configuration, including the architecture, the hidden parameters, activation functions, and other specific operations (e.g., drop out), the ANN in (21) becomes a deterministic, complicated, composite function.

Supervised machine learning () is used here to determine the weights and biases, where the ANN should learn the mapping from a given input to a given output, so that for a new input, the corresponding output will be accurately approximated. Such ANN methodology consists of basically two phases. During the (time-consuming, but offline) training phase the ANN learns the mapping, with many in- and output samples, while in the testing phase, the trained model is used to very rapidly approximate new output values for other parameter sets, in the online stage.

In a supervised learning context, the loss function measures the distance between the target function and the function implied by the ANN. During the training phase, there are many known data samples available, which are represented by input-output pairs

(\overset{´}{X}, \overset{´}{Y})

. With a user-defined loss function

L (\tilde{Θ})

, training neural networks is formulated as

arg min_{\tilde{Θ}} L (\tilde{Θ} | (\overset{´}{X}, \overset{´}{Y})),

(24)

where the hidden parameters are estimated to approximate the function of interest in a certain norm. More specifically, in our case, the input,

\overset{´}{x}

, equals

{{\hat{Y}}_{i}, t_{i}, t_{i + 1} - t_{i}, θ}

, and the output,

\overset{´}{y}

, represents the collocation points

{\hat{y}}_{i + 1}

, as in Equation (18). In the domain of interest

\overset{´}{Ω}

, we have a collection of data points

{{\overset{´}{x}}_{k}}

,

k = 1, \dots, M_{D}

, and their corresponding collocation points

{{\overset{´}{y}}_{k}}

, which form a vector of input-output pairs

(\overset{´}{X}, \overset{´}{Y}) = {({\overset{´}{x}}_{k}, {\overset{´}{y}}_{k})}_{k = 1, \dots, M_{D}}

. For example, using the

L 2

-norm, the discrete form of the loss function reads,

L (\tilde{Θ} | (\overset{´}{X}, \overset{´}{Y})) = \sqrt{\frac{1}{M_{D}} \sum_{k = 1}^{M_{D}} {({\overset{´}{y}}_{k} - \hat{H} ({\overset{´}{x}}_{k} | \hat{Θ}))}^{2}} .

(25)

One of the popular approaches for training ANNs is to optimize the hidden parameters via backpropagation, for instance, using stochastic gradient descent ().

4. An Efficient Large Time Step Scheme: Compression–Decompression Variant

The 7L scheme employs the ANNs to generate the conditional collocation points for all samples of a previous time point, see Figure 1b. The extensive use of ANNs in the methodology has an impact on the method’s computational complexity.

In order to speed up the data-driven 7L scheme procedure, we introduce a compression–decompression (CDC) variant, in the online validation phase. Please note that the offline learning phase is identical for both variants. The so-called 7L-CDC scheme, to be developed in this section, only uses the ANNs to determine the conditional collocation points for the optimal collocation points of a previous time point. All other samples will be computed by means of accurate interpolation. The computational complexity is reduced when the chosen interpolation is computationally cheaper than using ANNs.

By the compression–decompression procedure, Monte Carlo sample paths based on SDEs can be recovered from a 3D matrix. We then employ the 7L scheme procedure only to compute the entries of the encoded matrices at time point

t_{i}

(see Section 4.1), which leads to a reduction of the computational cost in many cases.

Next, we will explain the process of recovering the sample paths from a known matrix C using the decompression method.

4.1. CDC Variant

With a discretization

{t_{0}, t_{1}, t_{2}, \dots, t_{N}}

, we define a 3D matrix

\hat{C} = {{\hat{C}}_{0}, {\hat{C}}_{1}, \dots, {\hat{C}}_{N - 1}}

, which consists of

N \times (M_{s} + 1) \times (M_{c} + 2)

entries in total. Recall that

M_{s}

represents the number of collocation points and

M_{c}

the number of conditional collocation points, N the number of grid points over time.

M_{s}

and

M_{c}

may vary with time points

t_{i}

(in case of an adaptive scheme, for example), but we use constant values for

M_{s}

and

M_{c}

. For each time point

t_{i}

, we construct a 2D matrix

{\hat{C}}_{i}

,

\begin{matrix} {\hat{C}}_{i} = {(\begin{matrix} - & - & {\hat{x}}_{1} & {\hat{x}}_{2} & \dots & {\hat{x}}_{M_{c}} \\ x_{1} & {\hat{y}}_{1} (t_{i}) | Y_{0} & {\hat{y}}_{1} (t_{i + 1}) | {\hat{y}}_{1} (t_{i}) & {\hat{y}}_{2} (t_{i + 1}) | {\hat{y}}_{1} (t_{i}) & \dots & {\hat{y}}_{M_{c}} (t_{i + 1}) | {\hat{y}}_{1} (t_{i}) \\ x_{2} & {\hat{y}}_{2} (t_{i}) | Y_{0} & {\hat{y}}_{1} (t_{i + 1}) | {\hat{y}}_{2} (t_{i}) & {\hat{y}}_{2} (t_{i + 1}) | {\hat{y}}_{2} (t_{i}) & \dots & {\hat{y}}_{M_{c}} (t_{i + 1}) | {\hat{y}}_{2} (t_{i}) \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{M_{s}} & {\hat{y}}_{M_{s}} (t_{i}) | Y_{0} & {\hat{y}}_{1} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i}) & {\hat{y}}_{2} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i}) & \dots & {\hat{y}}_{M_{c}} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i}) \end{matrix})}_{(M_{s} + 1) \times (M_{c} + 2)}, \end{matrix}

(26)

with

x_{j}

,

j = 1, \dots, M_{s}

, the original SC points,

{\hat{y}}_{j} (t_{i}) | Y_{0}

the marginal collocation points (depending on the initial value

Y_{0}

), and

{\hat{x}}_{k}

,

k = 1, \dots, M_{c}

, the k-th original conditional SC points,

{\hat{y}}_{k} (t_{i + 1}) | {\hat{y}}_{j} (t_{i})

the conditional SC points (depending on the marginal collocation points

{\hat{y}}_{j} (t_{i}) | Y_{0}

). We thus represent the conditional SC points,

{\hat{y}}_{k} (t_{i + 1}) | {\hat{y}}_{j} (t_{i})

, by matrix elements

c_{i, j, k}

. The two empty cells in (26) are not addressed in the computation. Moreover, at the last time point,

t_{N}

,

{\hat{C}}_{N}

is not needed.

Remark 2

(Time-dependent elements). As the original collocation points,

x_{i}

and

{\hat{x}}_{k}

, do not depend on time, we can remove the first row and the first column of matrix

{\hat{C}}_{i}

to obtain a time-dependent version,

C = {C_{0}, C_{1}, \dots, C_{N - 1}}

, with the following elements,

\begin{matrix} C_{i} = {(\begin{matrix} {\hat{y}}_{1} (t_{i}) | Y_{0} & {\hat{y}}_{1} (t_{i + 1}) | {\hat{y}}_{1} (t_{i}) & {\hat{y}}_{2} (t_{i + 1}) | {\hat{y}}_{1} (t_{i}) & \dots & {\hat{y}}_{M_{c}} (t_{i + 1}) | {\hat{y}}_{1} (t_{i}) \\ {\hat{y}}_{2} (t_{i}) | Y_{0} & {\hat{y}}_{1} (t_{i + 1}) | {\hat{y}}_{2} (t_{i}) & {\hat{y}}_{2} (t_{i + 1}) | {\hat{y}}_{2} (t_{i}) & \dots & {\hat{y}}_{M_{c}} (t_{i + 1}) | {\hat{y}}_{2} (t_{i}) \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ {\hat{y}}_{M_{s}} (t_{i}) | Y_{0} & {\hat{y}}_{1} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i}) & {\hat{y}}_{2} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i}) & \dots & {\hat{y}}_{M_{c}} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i}) \end{matrix})}_{M_{s} \times (M_{c} + 1)} . \end{matrix}

(27)

An entry in matrix

C_{i}

can be computed by the trained ANNs, as follows,

c_{i, j, k} : = {\hat{y}}_{k} (t_{i + 1}) | {\hat{y}}_{j} (t_{i}) = {\hat{H}}_{k}^{(M_{c})} ({\hat{y}}_{j} (t_{i}), t_{i}, t_{i + 1} - t_{i}, θ), i = 0, \dots, N - 1,

(28)

using the marginal SC points,

{\hat{y}}_{j} (t_{i}) | Y_{0} = {\hat{H}}_{j}^{(M_{s})} (Y_{0}, t_{0}, t_{i} - t_{0}, θ), i = 0, \dots, N - 1,

(29)

where

{\hat{H}}_{j}^{(Λ)} (\cdot), j = 1, \dots, m

,

Λ = {M_{s}, M_{c}}

, represents the ANN function which approximates the j-th collocation point when

Λ = M_{s}

, and the j-th conditional collocation point when

Λ = M_{c}

. When

M_{c} = M_{s}

,

{\hat{H}}_{j}^{(M_{s})} = {\hat{H}}_{j}^{(M_{c})}

. Figure 2 shows an example of the distribution of the conditional SC points when

M_{c} = 3

and

M_{s} = 3

. When the matrices have been defined, all sample paths are compressed into a structured matrix. In other words, matrix

\hat{C}

contains all of the information needed to perform the Monte Carlo simulation of the SDEs, apart from the interpolation technique.

Figure 2. Schematic illustration of matrix C, with five marginal SC points and five conditional SC points. The conditional SC points are dependent on the realization connected to the corresponding marginal SC point.

The resulting matrix C will be decompressed to generate Monte Carlo sample paths with the help of an interpolation. The process of decompression is straightforward given a matrix

\hat{C}

. In addition to the interpolation process

g_{m} (\cdot)

in SCMC (see Equation (19)), an interpolation

{\hat{g}}_{m} (\cdot)

is needed to compute conditional collocation points for previous realizations, based on the matrix

\hat{C}

.

Suppose a vector of samples

{\hat{Y}}_{i}

at time

t_{i}

, and we wish to generate samples of

{\hat{Y}}_{i + 1}

. For a specific sample

{\hat{Y}}_{i}^{*}

, we need to calculate

M_{c}

conditional SC points. To obtain the k-th (

1 \leq k \leq M_{c}

) conditional SC point, we take marginal collocation points and their k-th conditional collocation points, using functions (28) and (29), to form

M_{s}

input-output pairs

{({\hat{y}}_{1} (t_{i}) | Y_{0}, {\hat{y}}_{k} (t_{i + 1}) | {\hat{y}}_{1} (t_{i})), ({\hat{y}}_{2} (t_{i}) | Y_{0}, {\hat{y}}_{k} (t_{i + 1}) | {\hat{y}}_{2} (t_{i})), \dots, {\hat{y}}_{M_{s}} (t_{i}) | Y_{0}, {\hat{y}}_{k} (t_{i + 1}) | {\hat{y}}_{M_{s}} (t_{i})}

. This combination gives us the interpolation function

{\hat{g}}_{m}

, through which we can obtain the k-th conditional SC point of

{\hat{Y}}_{i}^{*}

at time point

i = 0, \dots, N - 1

,

{\hat{y}}_{k}^{*} (t_{i + 1}) | {\hat{Y}}_{i}^{*} = {\hat{g}}_{m} ({\hat{Y}}_{i}^{*}), k = 1, \dots, M_{c} .

(30)

As a result, for each sample

{\hat{Y}}_{i}^{*}

, we obtain

M_{c}

interpolation nodes, which form a set of pairs,

({\hat{x}}_{1}, {\hat{y}}_{1}^{*} (t_{i + 1}) | {\hat{Y}}_{i}^{*})

,

({\hat{x}}_{2}, {\hat{y}}_{2}^{*} (t_{i + 1}) | {\hat{Y}}_{i}^{*})

, up to

({\hat{x}}_{k}, {\hat{y}}_{M_{c}}^{*} (t_{i + 1}) | {\hat{Y}}_{i}^{*})

, which are used to determine the interpolation function

g_{m} (\cdot)

required by SCMC. Afterwards, to generate a new sample

{\hat{Y}}_{i + 1}^{*} | {\hat{Y}}_{i}^{*}

, the mapping function

g_{m}

defines a conditional sample by taking in a random sample from X,

{\hat{Y}}_{i + 1}^{*} | {\hat{Y}}_{i}^{*} = g_{m} ({\hat{X}}_{i + 1}) .

The choice of the appropriate number of (conditional) collocation points is a trade-off between the computational cost and the required accuracy. When the number of collocation points tends to infinity, the 7L-CDC scheme will resemble the 7L scheme from Section 3.1. A schematic picture is presented in Figure 3.

Figure 3. Schematic diagram of the 7L-CDC scheme at time

t_{i}

. (a): marginal SC points, corresponding to Equation (29). (b): sample paths generated by 7L-CDC. The triple {2,1,3}, in the picture, represents the third conditional SC point, dependent on the first marginal SC point at time point

t_{2}

. The above procedure is also applicable to other time points.

Remark 3

(Computation time). During the online phase of the method, the total computation time of the large time step schemes consists of essentially two parts, calculation of the conditional SC points, and generating random samples by interpolation (the second part). The difference between the 7L and 7L-CDC schemes is found in the computation of the conditional SC points, the generation of the samples is identical for both schemes.

In this first part, for the 7L-CDC scheme, the work consists of setting up matrix C by the ANNs and computing the conditional SC points by the interpolation. In matrix C, there are

M_{s} \times M_{c} \times N

elements that are computed by the ANNs, where N represents the number of time points,

M_{s}

the number of collocation points and

M_{c}

the number of conditional collocation points. Depending on the

M_{s}

collocation points, the interpolation is based on

M_{c}

conditional collocation points for each path. For the 7L scheme,

M \times M_{c} \times N

elements, where M is the total number of paths, are computed by the ANNs. The time ratio between the 7L-CDC and 7L schemes is found to be,

γ = \frac{t_{I} M + t_{A} M_{s}}{t_{A} M} = \frac{t_{I}}{t_{A}} + \frac{M_{s}}{M},

(31)

with

t_{A}

the computational time of the ANN (i.e., the function

\hat{H} (\cdot)

),

t_{I}

for the interpolation (i.e., the function

\hat{g} (\cdot)

in (30)), which is a polynomial function of

M_{s}

. Given the fact that the number of sample paths is typically much larger than the number of SC points

M ≫ M_{s}

,

γ \approx \frac{t_{I}}{t_{A}} .

When the employed interpolation is computationally cheaper than the ANNs,

γ < 1,

so that the 7L-CDC scheme needs fewer computations than the 7L scheme.

4.2. Interpolation Techniques

To define the function

g_{m} (x)

in (13) or

\hat{g} (x)

in (30), we will compare three different interpolation techniques.

A bijective mapping function is obtained by the monotonic piecewise cubic Hermite interpolating polynomial (PCHIP) (). Assuming there are multiple data points,

(x_{k}, y_{k})

, using,

h_{k} : = x_{k + 1} - x_{k}, d_{k} : = \frac{y_{k + 1} - y_{k}}{x_{k + 1} - x_{k}},

the derivatives

f_{k}^{'}

at the points

x_{k}

are computed as a weighted average,

\frac{{\hat{w}}_{1} + {\hat{w}}_{2}}{f_{k}^{'}} = \frac{{\hat{w}}_{1}}{d_{k - 1}} + \frac{{\hat{w}}_{2}}{d_{k}}, if d_{k} \cdot d_{k - 1} > 0,

where

{\hat{w}}_{1} : = 2 h_{k} + h_{k - 1}

and

{\hat{w}}_{2} : = h_{k} + 2 h_{k - 1}

. At each data point, the first derivative is guaranteed to be continuous, and a cubic spline is used to interpolate between the data points. If

d_{k} \cdot d_{k - 1} \leq 0

, then

f_{k}^{'} = 0

, PCHIP requires more computations than a Lagrange interpolation, but it results in a monotonic function which is generally advantageous.

The convergence of the stochastic collocation method is not really dependent on the monotonicity of the mapping function, so an interpolation based on Lagrange polynomials is possible in practice. The barycentric version of Lagrange interpolation (), our second interpolation technique, provides a rapid and stable interpolation scheme, which is applied when using Lagrange interpolation in our numerical experiments. With help of the basic Lagrange interpolation expressions, however, we can conveniently perform theoretical analysis.

The third technique is based on choosing the interpolation points carefully (e.g., as the Chebyshev zeros) to achieve a stable interpolation. The Chebyshev interpolation () is of the form,

g_{m} (x) = \sum_{j = 0}^{m - 1} α_{j} p_{j} (x) = α_{0} + α_{1} p_{1} (x) + \dots + α_{m - 1} p_{m - 1} (x),

(32)

where

p_{m - 1} (x)

are interpolation basis functions, here Chebyshev orthogonal polynomials, up to degree

m - 1

. The Chebyshev nodes in the interval

[x_{a}, x_{b}]

are computed as,

{\tilde{x}}_{k} = x_{a} + \frac{1 + cos (\frac{π k}{m - 1})}{2} (x_{b} - x_{a}), k = 0, 1, \dots, m - 1 .

When the polynomial degree increases, the Chebyshev interpolation retains uniform convergence. In financial mathematics, Chebyshev interpolation has been successfully used, for example, to compute parametric option prices and implied volatility in (; ; ). When the interpolation points are not Chebyshev nodes (e.g., Gauss quadrature points), the Chebyshev coefficients can be estimated by means of a least squares regression, which is also called the Chebyshev fit. In such case, the coefficients in (14) can be explicitly computed, in contrast to the barycentric Lagrange interpolation.

The selection of a suitable interpolation technique depends on various factors, for instance, speed, monotonicity, availability of coefficients. These three interpolation methods will be compared in the numerical section.

4.3. Pathwise Sensitivity

Often in computations with stochastic variables, we wish to determine the derivatives of the variables of interest, the so-called pathwise sensitivities. This is generally not a trivial exercise in a Monte Carlo setting, see, for example, the discussions in (; ; ; ). With our new large time step schemes, we determine the pathwise sensitivities of the computed stochastic variables in a natural way, based on the available information in the (conditional) SC points and the interpolation. In this section, we derive the pathwise sensitivity of the state variable

Y (t)

with respect to model parameters

θ

.

The first derivative with respect to parameter

θ

of the conditional distribution in Equation (13) reads,

\frac{\partial Y (t_{i + 1})}{\partial θ} = \frac{\partial g (X)}{\partial θ} \approx \frac{\partial}{\partial θ} (\sum_{j = 1}^{m} {\hat{y}}_{j} (t) p_{j} (X)) = \frac{\partial}{\partial θ} (\sum_{j = 1}^{m} {\hat{H}}_{j} p_{j} (x)) = \sum_{j = 1}^{m} (\frac{d {\hat{H}}_{j}}{d θ} p_{j} (X)),

(33)

where

\frac{d {\hat{H}}_{j}}{d θ}

is the total derivative of function

{\hat{H}}_{j} ({\hat{Y}}_{i}, t_{i}, t_{i + 1} - t_{i}, θ)

and

p_{j} (X)

are basis functions, which do not depend on the model parameters. For the derivative

\frac{d {\hat{H}}_{j}}{d θ}

in (33) at time

t_{i}

, the expression of the ANN (21), given the specific activation function, is available. So, the function

\hat{H}

is analytically differentiable. As a result,

\frac{d {\hat{H}}_{j}}{d θ}

can be easily computed, by means of automatic differentiation in the machine learning framework. Thus, we arrive at the sensitivity of a sample path with respect to model parameters, as follows,

\frac{\partial {\hat{Y}}_{i + 1}}{\partial θ} = \sum_{j = 1}^{m} \frac{d {\hat{H}}_{j}}{d θ} p_{j} ({\hat{X}}_{i + 1}) .

(34)

5. Numerical Experiments

In this section with numerical experiments we will give evidence of the high quality of our numerical SDE solver, by analyzing in detail its components. For this purpose, we mainly focus on the Geometric Brownian Motion SDE, which reads,

d Y (t) = μ Y (t) d t + σ Y (t) d W (t), Y_{0} = Y (0), 0 \leq t \leq T,

(35)

where the model parameters are the constant drift and volatility coefficients, i.e.,

θ = {μ, σ}

, and the initial value is given by

Y_{0}

at time

t = 0

. For (35) a continuous-time analytic expression for the asset price at time t is available, i.e.,

Y (t) = Y_{0} e^{(μ - \frac{1}{2} σ^{2}) (t - t_{0}) + σ (W (t) - W (t_{0}))} \overset{d}{=} Y_{0} e^{(μ - \frac{1}{2} σ^{2}) (t - t_{0}) + σ \sqrt{t - t_{0}} X},

(36)

where

X \sim N (0, 1)

, and

Y (t)

is governed by the lognormal distribution. The derivative of the stock price with respect to volatility

σ

is available in closed form, and reads,

\frac{\partial Y (t)}{\partial σ} \overset{d}{=} Y (t) (- σ (t - t_{0}) + \sqrt{t - t_{0}} X) .

(37)

This expression will be used as the reference value of the sensitivity obtained from the 7L discretization.

Furthermore, the Ornstein–Uhlenbeck process is explained and also analyzed, in Section 5.3.2. We will employ the large time step discretization, in which the conditional collocation points are computed by the trained ANN, and compare the results of the novel scheme with those obtained by the Milstein SDE discretization.

5.1. ANN Training Details

GBM and the OU process are Markov processes, so the conditional distribution at time

t_{i + 1}

given information up to time

t_{i}

only depends on the information at time

t_{i}

. The ANN (15) will therefore be used for the conditional collocation stochastic points, with

θ = {μ, σ}

, for GBM, and

θ = {\bar{Y}, σ, λ}

for the OU process (as will be discussed in Section 5.3.2).

Regarding the size of the compression–decompression matrix, the more conditional collocation points, the better the accuracy of the 7L-CDC method. A 5 × 5 matrix size (i.e., five marginal and five conditional SC points) is preferred, taking into account the computing effort and the accuracy. In () it has been discussed and shown that highly accurate approximations could already be obtained with a small number of collocation points.

As the first method component, we evaluate the quality of the ANN which defines the collocation points, for the GBM dynamics. For this purpose,

M_{L}

random points (i.e., sets of input parameters) are generated by using Latin hypercube sampling (LHS) in the domain of interest for the three parameters

(Y_{0}, μ, σ)

, see Table 1. As the second step, for each point, a Monte Carlo method is employed to simulate the discretized SDE based on the tiny time step

Δ τ

. We use an Euler–Maruyama time discretization for this purpose, with

N_{τ}

the number of time points and the time horizon

τ_{m a x} = N_{τ} \cdot Δ τ

. At each time step,

j = 1, \dots, N_{τ}

, the conditional distribution function

F_{Y (t_{j}) | Y_{0}, μ, σ} (\cdot)

is computed, based on the many generated MC paths. This way, the resulting collocation points for the “big time step”,

Δ t = j \cdot Δ τ

, are also obtained, to form the required training dataset.

Table 1. Training data,

Δ τ = 0.01

. Here is an example for training on five SC points.

We set

τ_{m a x} = 1.6

,

N_{τ} = 160

,

M_{L} = 500

. The amount of training data used is given by

M_{t r a i n} = M_{L} \cdot N_{τ} = 80,000

samples in total, which are divided into an ANN training (

90 %

) and an ANN testing (

10 %

) set.

The ANN hyperparameters have an impact on the errors from the optimization related to training the ANN, as well as on the model performance. The approximation capacity does not only depend on the number of hidden parameters, but also on the network structure (i.e., on the width and depth of the network). In principle, deep neural networks have more powerful expressiveness than shallow neural networks. The fully connected neural network employed will be composed of one input layer, one output layer and four hidden layers. Each hidden layer consists of 50 neurons, with Softplus, i.e.,

φ (x) = ln (1 + e^{x})

as the activation function (). Before training the ANN, the hidden parameters are initialized via the Glorot technique (). Training goes in batches. At each iteration, a variant of SGD, the Adam () optimization algorithm, which implements adaptive learning rates, randomly selects a portion of the training samples according to the batch size, to calculate the gradient for updating the hidden parameters. In an epoch, all training samples have been processed by the optimizer. The mean squared error (MSE), which measures the distance between the ground-truth and the model values in supervised learning, is used to update the hidden parameters during training. The measure mean absolute error (MAE), i.e.,

MAE = \frac{1}{M_{t r a i n}} \sum_{j} | y_{j} - {\hat{y}}_{j} |,

is also estimated, as the pathwise error of the 7L scheme is related to the maximum absolute difference in the approximated collocation points,

{\hat{y}}_{j}

, in Section 5.3, see the derivation in the next section.

The training process starts with a relatively large learning rate (i.e.,

10^{- 3}

) to avoid getting stuck in local optima. After 1000 epochs, the learning rate is reduced to

10^{- 4}

, followed by training 500 more epochs, to achieve a steady convergence. Afterwards, the trained ANN is evaluated on the testing dataset, with the results presented in Figure 4 (for two of the collocation points) and Table 2. Clearly, the predicted values fit very well with the true values of the stochastic collocation points. This implies that the trained ANNs reach a highly satisfactory generalization, and generate accurate and robust approximation results for all five collocation points.

Figure 4. The goodness-of-fit on test dataset. Two scatter plots show the relation between the predicted values and the ground truth.

Table 2. The approximation performance on test data set.

5.2. Numerical Error Analysis, the Lagrangian Case

There are essentially two approximation errors in the 7L scheme, a neural network approximation error when generating the collocation points, and an SCMC error when representing the conditional distribution function.

Considering d inputs, the neural network may approximate any function

ζ_{d, n}

, from the function space

C^{n - 1} ({[0, 1]}^{d})

, where the derivatives up to order

n - 1

are Lipschitz continuous (). The input and output variables can be normalized to the unit interval

[0, 1]

. With a fixed network architecture during training, the approximation error can be assessed, as follows.

Theorem 1.

From (), given any

\hat{ϵ} \in (0, 1)

, there exists a neural network which is capable of approximating any function

ζ_{d, n}

with error

\hat{ϵ}

, based on the following configuration:

at least piece-wise activation functions,
at least $\tilde{c} (ln (1 / \hat{ϵ}) + 1)$ hidden layers and $\tilde{c} {\hat{ϵ}}^{- d / n} (ln (1 / \hat{ϵ}) + 1)$ weights and computation units, where $\tilde{c} : = \tilde{c} (d, n)$ depends on the parameters d and n.

When the architecture is dynamic, the error bound can be further reduced, as shown in () and (). One of the assumptions is that the ANNs are sufficiently trained, so that the optimization error is negligible.

The error from the SCMC methodology was derived in (). The optimal collocation points,

x_{i}

,

i = 1, \dots, m

, correspond to the zeros of an orthogonal polynomial. In the case of Lagrange interpolation, when the collocation method can be connected to Gauss quadrature, we have

\begin{matrix} \int_{R} Ψ (x) f_{X} (x) d x = \sum_{i = 1}^{m} Ψ (x_{i}) ω_{i} + ϵ_{m} = ϵ_{m}, \end{matrix}

(38)

with

Ψ (x) = {(g (x) - g_{m} (x))}^{2}

, the difference between the target and the SC approximated function,

f_{X} (x)

the weight function, and

ω_{i}

the quadrature weights. When the Gauss–Hermite quadrature is used with m collocation points. the approximation error of the CDF can be estimated as,

\begin{matrix} ϵ_{m} = \frac{m! \sqrt{π}}{2^{m}} \frac{Ψ^{(2 m)} (ξ_{1})}{(2 m)!}, \end{matrix}

(39)

where

ξ_{1} \in (- \infty, \infty)

and the distance function

Ψ (x) = {(g (x) - g_{m} (x))}^{2} \approx {(\frac{1}{m!} {\frac{\partial^{m} g (x)}{\partial x^{m}}|}_{x = ξ_{2}} \prod_{k = 1}^{m} (x - x_{k}))}^{2},

with

ξ_{2} \in [x_{1}, x_{m - 1}]

. In other words, the error of approximating the target CDF converges exponentially to zero when the number of corresponding collocation points increases.

At each time point

t_{i}

, the process

Y (t_{i})

is approximated using the collocation method, by a polynomial

g_{m} (X)

, i.e., in the case of classical Lagrange interpolation, using

ℓ_{j} (\bar{x}) = p_{j} (\bar{x})

,

Y (t_{i}) \approx \tilde{Y} (t_{i}) = g_{m} (X) = \sum_{j = 1}^{m} y_{j} (t_{i}) ℓ_{j} (X), ℓ_{j} (\bar{x}) = \prod_{k = 1}^{m} \frac{X - x_{k}}{x_{j} - x_{k}},

(40)

where the collocation points

y_{j} (t_{i}) = F_{Y (t_{i})}^{- 1} (F_{X} (x_{j}))

. Because of the use of an ANN, the collocation points are not exact, but they are approximated with

y_{j} (t_{i}) - {\hat{y}}_{j} (t_{i}) = ϵ_{j}^{A}

, where

{\hat{y}}_{j} (t_{i})

represents the ANN approximated value. The error associated with

ϵ_{j}^{A}

can be estimated as in (). The impact of

ϵ_{j}^{A}

on the obtained output distribution needs to be assessed. Let

{\tilde{g}}_{m}

denote the approximate function based on the predicted ANN collocation points

{\hat{y}}_{j} (t_{i})

, and x a random sample from the standard normal distribution

N (0, 1)

. The approximation error, in the strong sense, is given by

\begin{matrix} E [| g_{m} (x) - {\tilde{g}}_{m} (x) |] & = & E | \sum_{j = 1}^{m} y_{j} (t_{i}) ℓ_{j} (x) - \sum_{j = 1}^{m} {\hat{y}}_{j} (t_{i}) ℓ_{j} (x) | \\ = & \int_{R} | \sum_{j = 1}^{m} y_{j} (t_{i}) ℓ_{j} (x) - \sum_{j = 1}^{m} {\hat{y}}_{j} (t_{i}) ℓ_{j} (x) | f_{X} (x) d x \\ = & \int_{R} | \sum_{j = 1}^{m} ϵ_{j}^{A} ℓ_{j} (x) | f_{X} (x) d x . \end{matrix}

(41)

Note that the

ℓ_{j} (x)

interpolation functions are identical as they depend solely on the x values. We arrive at the following error related to the ANNs,

\begin{matrix} \int_{R} | \sum_{j = 1}^{m} ϵ_{j}^{A} ℓ_{j} (x) | f_{X} (x) d x & \leq & \int_{R} \sum_{j = 1}^{m} max {| ϵ_{1}^{A} |, \dots, | ϵ_{m}^{A} |} ℓ_{j} (x) f_{X} (x) d x \\ = & \int_{R} max {| ϵ_{1}^{A} |, \dots, | ϵ_{m}^{A} |} f_{X} (x) d x \\ = & max {| ϵ_{1}^{A} |, \dots, | ϵ_{m}^{A} |} . \end{matrix}

(42)

Considering the error introduced by SCMC in (39), the total pathwise error reads

\begin{matrix} E [| g (x) - {\tilde{g}}_{m} (x) |] & \leq & E [| g (x) - g_{m} (x) |] + E [| g_{m} (x) - {\tilde{g}}_{m} (x) |] \\ \leq & \sqrt{| ϵ_{m} |} + max {| ϵ_{1}^{A} |, \dots, | ϵ_{m}^{A} |} . \end{matrix}

(43)

In other words, the expected pathwise error can be bounded by the approximation CDF error

\sqrt{| ϵ_{m} |}

plus the largest difference in the ANN approximated collocation points.

Kolmogorov–Smirnov Test

The Kolmogorov–Smirnov test, calculating the supremum of a set of distances, is used to measure the nonparametric distance between two empirical cumulative distribution functions. We perform the two-sample Kolmogorov–Smirnov test, as follows,

K S = sup_{z} | F_{Y} (z) - {\hat{F}}_{Y} (z) |,

where

{\hat{F}}_{Y} (\cdot)

and

F_{Y} (\cdot)

are two empirical cumulative distribution functions, one from the 7L-CDC solution and the other one from the reference distribution. We take the analytic solution of the GBM as the reference distribution.

Remark 4

(Time horizon for 7L-CDC). The information in Table 1 is used to train the mapping function between a realization (including marginal SC points) and its conditional SC points, via Equation (28). For the marginal SC points in Equation (29), however, we need training data up to terminal time T. So, we generate a second dataset in which the time reaches

τ_{m a x}

(the terminal time of interest) and the upper value for

Y_{0}

equals 5. These two datasets are merged into one set in order to train the ANNs for the 7L-CDC methodology.

Figure 5 shows the Kolmogorov–Smirnov test at different time points based on 10,000 samples. We focus on the CDC methodology here, and compare the accuracy with the different interpolation methods in the figure. Clearly, the KS statistic and also the corresponding p-values for the 7L-CDC schemes are much better than those of the Milstein scheme in Figure 5. This is an indication that the CDFs that originate from the 7L-CDC schemes resemble the target CDF much better, with high confidence. In addition, unlike the Milstein scheme, the 7L-CDC schemes exhibit an almost constant difference between the approximated and target CDFs with increasing time.

Figure 5. The Kolmogorov–Smirnov test:

Δ t = 0.5, μ = 0.1, σ = 0.3, Y_{0} = 1.0

, with 10,000 samples. When we have a small KS statistic or a large p-value, the hypothesis that the distributions of the two sets of random samples are the same can not be rejected.

We will also analyze the costs of the different interpolation methods within 7L-CDC. The two steps which require interpolation are the computation of the conditional collocation points and the generation of conditional samples. The computational speed of the 7L-CDC scheme depends on the employed interpolation method, see Table 3. In general, to generate a solution with the same strong order in the numerical error, the Milstein scheme will require more computation time, here about 27 s, while the 7L scheme needs 13 s and 7L-CDC (barycentric version) 5 s when

Δ t = 1.0

. The larger the time step, the more computation time will be saved.

Table 3. The CPU running time (s) to reach the same accuracy (CPU: E3-1240, 3.40 GHz): simulating 10,000 sample paths until terminal time

T = 4.0

, based on

5 \times 5

marginal/conditional SC points. Here, for the 7L scheme, PCHIP is used as the interpolant

g_{m} (\cdot)

in Step 3 of Algorithm 1.

Note that, in order to achieve a similar accuracy in the strong sense, the Euler–Maruyama scheme requires a much finer time grid, by a factor of

κ = Δ t / Δ τ

, than the 7L scheme. When

κ

is sufficiently large, the 7L-CDC scheme outperforms the Euler–Maruyama scheme, in terms of both accuracy and speed. For example, in Table 3,

κ = 100

when

Δ t = 1.0

, and

κ = 200

when

Δ t = 2.0

. In other words, the “online version” of the Euler–Maruyama discretization is computationally slower than the online phase of the 7L scheme to achieve the same accuracy. Additionally, computational time of the 7L scheme can be further reduced by parallelization, for example, using GPUs .

5.3. Pathwise Error Convergence

In this section, we compare the pathwise errors of our proposed novel discretization with those of the classical discretization schemes.

5.3.1. GBM Process

We analyze here the strong convergence properties of the new methodology for the GBM process. For GBM, the exact path is given by the expression (36). The random number, which is drawn from

X \sim N (0, 1)

, is the same for the exact solution (36), the novel schemes (14) and the Milstein scheme (3). The pathwise differences between the numerical schemes and the exact simulation are plotted in Figure 6. When

Δ t = 0.5

, the 7L-CDC scheme presents superior paths as compared to the Milstein scheme, in terms of its pathwise error comparing to the exact path.

Figure 6. Paths generated by 7L-CDC: time step

Δ t = 0.5

, GBM with

σ = 0.3

,

r = 0.1

,

Y_{0} = 1.0

. The paths are with Chebyshev interpolation, which are not plotted, are identical to ones from Lagrange in this case.

As shown in Figure 7, the 7L-CDC scheme gives rise to flat, almost constant, strong and weak error convergence curves for many different

Δ t

-values, suggesting a small, constant convergence error even with large time steps

Δ t

. The Milstein scheme has the strong order of convergence

O (Δ t)

, so that a larger time step gives rise to a larger error. When the time step becomes small, more time points are needed to reach a time T, and then the resulting recursive error of the 7L-CDC scheme increases.

Figure 7. The strong error is estimated as

\frac{1}{M} \sum | {\overset{ˇ}{Y}}_{k} (T) - {\hat{Y}}_{k} (T) |

, see Equation (4) and the weak error by

\frac{1}{M} (\sum {\overset{ˇ}{Y}}_{k} (T) - \sum {\hat{Y}}_{k} (T))

, see () for details on the computation of the convergence rate. There are

M = 1000

sample paths in total.

The number of conditional collocation points, by which the conditional distribution at a next time point is mostly determined, has a significant contribution to the convergence order of the 7L-CDC scheme. As mentioned, we found empirically that five conditional collocation points are preferable in terms of computing effort versus accuracy. CDC matrix C is then of size

N \times 5 \times 5

; that is, at each time point, there are five collocation points and each of these has five conditional collocation points.

5.3.2. Ornstein–Uhlenbeck Process

In the case of Markov processes, any SDE which can be solved by the Euler–Maruyama discretization can be solved by our ANN methodology, with improved strong convergence properties. We also wish to confirm the strong convergence properties for another stochastic process in this section.

The mean reverting Ornstein–Uhlenbeck (OU) process () is defined as,

d Y (t) = - λ (Y (t) - \bar{Y}) d t + σ d W (t), 0 \leq t \leq T,

(44)

with

\bar{Y}

the long term mean of

Y (t)

,

λ

the speed of mean reversion, and

σ

the volatility. The initial value is

Y_{0}

, and the model parameters are

θ : = {\bar{Y}, σ, λ}

. Its analytical solution is given by,

Y (t) \overset{d}{=} Y_{0} e^{- λ t} + \bar{Y} (1 - e^{- λ t}) + σ \sqrt{\frac{1 - e^{- 2 λ t}}{2 λ}} X,

(45)

with

t_{0} = 0

,

X \sim N (0, 1)

. Equation (45) is used to compute the reference value to the pathwise error and the strong convergence.

We employ the same data-driven procedure as for GBM to discretize and solve the OU process. In the training phase, the Euler–Maruyama scheme (2) is used to discretize the OU dynamics and generate the dataset. Note that the Milstein and Euler schemes are identical in the case of the OU process. As the OU process is a Markov process, we again can vary

Y_{0}

to find the relation between the conditional SC points and the marginal SC points (i.e., as in Equation (28)). Similar to Table 1, we employ five SC points to learn within the ANN, with

Δ τ = 0.01

,

τ_{m a x} = 4.1

,

N_{τ} = 500

,

M_{L} = 410

, see Section 5.1 for the details of the training process.

After the training, the obtained ANNs will be applied to solve the OU process with specific parameters and details of our interest. We provide an example in Figure 8, which confirms that the sample paths generated by 7L-CDC are as accurate as the exact solution, and the error, in the sense of strong convergence, stays close to zero even with a large time step.

Figure 8. Paths and strong convergence for the OU process, using

λ = 0.5

,

\bar{Y} = 1.0

,

σ = 0.3

,

Y_{0} = 1.0

. The sample paths with barycentric, Chebyshev and PCHIP interpolation overlap for the 7L-CDC scheme. There are five marginal and five conditional SC points at each time point.

5.4. Applications in Finance

The possibility to take large time steps and still get accurate SDE solutions is certainly interesting in computational finance, as there are several financial products that are updated on a daily basis (think of an over-night interest rate), whereas monitoring of financial contracts and risk management monitoring is typically only done on a weekly, monthly of even yearly basis. In such situations, our novel scheme will be useful. Research into large time step simulations is state-of-the-art in computational finance, see the exact (and almost exact) Monte Carlo simulation papers, e.g., (); () for the SABR and Heston stochastic volatility asset dynamics, respectively.

5.4.1. The Asian Option

Moreover, the strong convergence property of an SDE discretization is important in many cases. When valuing so-called path-dependent options, for example, improved strong convergence enhances the convergence of a Monte Carlo simulation. Options are governed by their pay-off function (i.e., the option value at the final time of the contract,

t = T

). Here we consider a path-dependent exotic option, the so-called European-style Asian option, which has a payoff that is based on a time-averaged underlying stock price. For example, the pay-off of a fixed strike Asian option is given by

V_{A} (T) = max (A (T) - \tilde{K}, 0),

where T is the option contract’s expiry time, and

\tilde{K}

is the predetermined strike price. Here,

A (T)

denotes the discrete arithmetic average of the stock prices over

N_{b}

monitoring dates

{t_{1}, \dots, t_{k}} \in [0, T]

,

A (T) = \frac{1}{N_{b}} \sum_{k = 1}^{N_{b}} \hat{Y} (t_{k}),

where

\hat{Y} (t_{k})

is the observed stock price at time

t_{k}

,

1 \leq t_{k} \leq T

. Averaging thus takes place in the time-wise direction, and we consider pricing financial options based on the discrete arithmetic average of a number of stock prices.

We assume here that the underlying stock price follows Geometric Brownian motion, as in Equation (35), under the risk-neutral measure, meaning

μ \equiv r

, where r is the risk-free interest rate. There is a cash account

M (t)

, governed by

d M (t) = r M (t) d t

. The value of European-style Asian option is then given by

V_{A} (t) = e^{- r (T - t)} E^{Q} [max (A (T) - \tilde{K}, 0) | F (t)] .

(46)

Because the pay-off is clearly a path-dependent quantity for such options, it is expected that an improved strong convergence, obtained with the variant 7L-CDC, will result in superior convergence, as compared to classical numerical discretization schemes.

The relative error is presented, which is defined as

ϵ_{r e l} = |\frac{V_{A}^{r e f} (t_{0}) - V_{A} (t_{0})}{V_{A}^{r e f} (t_{0})}|,

where

V_{A}^{r e f} (t_{0})

is based on the exact GBM Monte Carlo simulation. As shown in Table 4, the 7L-CDC scheme gives highly accurate Asian option prices, compared to the Milstein scheme. As the accuracy of Asian option prices depends directly on the accuracy of the realized paths, an increasing number of monitoring dates will give rise to higher accuracy by 7L-CDC.

Table 4. Pricing Asian European-style option with a fixed strike price, using

Y_{0} = 1.0

,

\tilde{K} = Y_{0}

,

r = 0.1

,

T = Δ t \times N_{b}

, the number of sample paths

M = 100,000

.

Next, we focus on the Asian option’s sensitivity. The sensitivity of the option price with respect to volatility

σ

is called Vega, which can be computed in a pathwise fashion (see Chapter 7 in ), as follows,

\frac{\partial V}{\partial σ} = e^{- r T} E^{Q} [\sum_{i = 1}^{N} \frac{\partial V (T, Y (t_{i}); σ)}{\partial Y (t_{i})} \frac{\partial Y (t_{i})}{\partial σ} | Y_{0}],

(47)

where N represents the number of grid points over time. The chain rule is employed to derive the sensitivity. First of all, we compute the gradient of the payoff function with respect to the underlying stock price, by

\frac{\partial V (T, Y (t_{i}))}{\partial Y (t_{i})} = \frac{1}{N} 1_{A (T) > \tilde{K}} .

(48)

Then, the derivative of the stock price at time

t_{i}

with respect to the model parameter,

\frac{\partial Y (t_{i})}{\partial σ}

, can be found with the trained ANNs, as given by Equation (33). Vega can be estimated by,

\frac{\partial V}{\partial σ} \approx e^{- r T} \frac{1}{N} E^{Q} [\sum_{i = 1}^{N} (1_{A (T) > \tilde{K}} \sum_{j = 0}^{m - 1} \frac{d {\hat{H}}_{j}}{d σ} p_{j} (X)) | Y_{0}] .

(49)

When there are M sample paths, we have,

\frac{\partial V}{\partial σ} \approx e^{- r T} \frac{1}{M} \frac{1}{N} [\sum_{k = 1}^{M} \sum_{i = 1}^{N} (1_{A (T) > \tilde{K}} \sum_{j = 0}^{m - 1} \frac{d {\hat{H}}_{j}}{d σ} p_{j} ({\hat{X}}_{k, i + 1})) | Y_{0}] .

(50)

As shown in Table 1, there are four input arguments in the function

{\hat{H}}_{j} (\hat{Y_{i}}, t_{i + 1} - t_{i}, r, σ)

(note that the drift term equals the interest rate, i.e.,

μ = r

, in the risk-neutral world). Since the previous realization

{\hat{Y}}_{i}

is a function of the model parameters (here r and

σ

), at time

t_{i + 1}

, the total derivative of

{\hat{H}}_{j}

with respect to the volatility in Equation (34) becomes

\frac{d {\hat{H}}_{j}}{d σ} = \frac{\partial {\hat{H}}_{j}}{\partial σ} + \frac{\partial {\hat{H}}_{j}}{\partial \hat{Y_{i}}} \frac{\partial {\hat{Y}}_{i}}{\partial σ},

(51)

where

\frac{\partial {\hat{Y}}_{i}}{\partial σ}

is known at the previous time point. Like simulating the Monte Carlo paths, the calculation of this derivative is done iteratively. Figure 9a compares the pathwise sensitivities obtained via Equations (37) and (51). Clearly, the pathwise derivative by the 7L scheme is very similar to the analytical solution. Figure 9b confirms that the ANN methodology computes a highly accurate Asian option Vega by means of the above pathwise sensitivity. Summarizing, the sensitivity with respect to model parameters can highly accurately be obtained from the trained ANNs. As the 7L-CDC scheme is composed of marginal and conditional collocation points, the above procedure of computing the pathwise sensitivity is also applicable to the variant 7L-CDC, by using the chain rule.

Figure 9. Path-wise estimator of Vega: Exact Vega is calculated by means of the central finite difference. The parameters are

Y_{0} = 1.0

,

r = 0.05

,

\tilde{K} = Y_{0}

,

σ = 0.3

,

Δ t = 1.0

,

N_{b} = 4

,

T = Δ t \times N_{b} = 4.0

.

5.4.2. Bermudan Option Valuation

When dealing with so-called Bermudan options, the option contract holder has the right (but not the obligation) to exercise the option contract at a finite number of pre-specified dates up to final time T. At an exercise date, when the holder decides to exercise the Bermudan option, she immediately obtains the current payoff value of the contract. Alternatively, she may also wait until the next exercise opportunity. The Bermudan option can be exercised at the following set of exercise dates,

{t_{0}, t_{1}, \dots, t_{N_{b}}}

, with a constant time difference,

Δ t = t_{i} - t_{i - 1}

, for any

0 < i \leq N_{b}

.

In this experiment, we compare the performance of the new 7L-CDC discretization scheme with a classical scheme. Valuation of the Bermudan option will take place by means of the well-known Longstaff–Schwartz Monte Carlo (LSMC) method (), a least squares Monte Carlo method. The Longstaff–Schwartz algorithm is presented, for convenience, in the Appendix A.

The difference between a large time step simulation and a classical simulation, like the Milstein scheme, is that a classical scheme requires additional time steps to be taken between the early-exercise dates of the Bermudan option, while with the 7L-CDC scheme, we can perform one-step Monte Carlo simulation without any intermediate grid points between adjacent early-exercise dates.

We also assume here that the underlying stock price follows Geometric Brownian motion, as in Equation (35), under the risk-neutral measure, with

μ \equiv r

. A Bermudan put option, with risk-free interest rate

r = 0.1

, pay-off function

V (t_{j}) = max (\tilde{K} - Y (t_{j}), 0)

with strike price

\tilde{K} = 1.1

and initial stock price

Y_{0} = 1.0

, is priced based on

M = 100,000

Monte Carlo paths. The matrix size within the 7L-CDC scheme is set to

N_{b} \times 5 \times 5

. The terminal time is

T = Δ t \times M_{B}

with a constant time step

Δ t

. The random seed is chosen to be zero when drawing random numbers. We compare the relative errors

| \frac{V^{r e f} (t_{0}) - V (t_{0})}{V^{r e f} (t_{0})} |

, where

V^{r e f} (t_{0})

is computed with the help of a Monte Carlo method based on the exact simulation of GBM (36).

As shown in Table 5, the option prices based on the 7L-CDC Monte Carlo simulation are highly satisfactory, and the related error does not increase with larger time steps

Δ t

. In contrast, a larger time step gives rise to significant pricing errors, in the case of the Milstein discretization.

Table 5. Bermudan put option prices based on large time step Monte Carlo simulations.

Remark 5.

In principle, a sample value

{\hat{Y}}_{i}

can be any rational number. So, a path value may reach a larger stock price than the prescribed upper bound in Table 1. The stock prices outside the training interval are called outliers. Outliers did not appear in the experiments of Table 5. As an alternative method to avoid the appearance of outliers, one may scale the asset price, to remove the dependence on the initial value. For example, GBM can be scaled by

\bar{Y} (t) = \frac{Y (t)}{Y (t_{0}) e^{r t}} .

Using Itô’s lemma, we have a drift-less process,

d \bar{Y} (t) = \bar{Y} (t) σ d W,

where the initial value

{\bar{Y}}_{0} = 1.0

. The following formula returns the original variable,

Y (t) = \bar{Y} (t) Y (t_{0}) e^{r t} .

In such case, scaling guarantees a fixed initial value, for example,

Y_{0} = 1.0

.

6. Conclusions and Outlook

We developed a data-driven numerical solver for stochastic differential equations, by which large time step simulations can be carried out accurately in the sense of strong convergence. With a combination of artificial neural networks and the stochastic collocation Monte Carlo method, a small number of stochastic collocation points are learned by the ANN to approximate a nonlinear function which can be used to compute the unknown collocation points. Theoretical analysis indicates that the numerical error is controllable and does not increase when the simulation time step increases.

There are several advantages to the proposed approach. The powerful expressive ability of neural networks enables the ANNs to accurately approximate stochastic collocation points. The compression–decompression method reduces the computational costs, so that the numerical method can be applied in practice. In finance, the proposed big time step methodology will be highly beneficial for the generation for path-dependent financial option contracts or in risk management applications.

As an outlook, it will be relevant to extend the introduced methodology to solving higher-dimensional or more involved SDE dynamics. We will define multi-dimensional stochastic collocation points (i.e., by means of a tensor) for a multi-dimensional system of SDEs, and choose a Convolutional Neural Network () to efficiently process these collocation points. Of course, this may not trivially generalize to truly high-dimensional systems, but approximation of moderate dimensionality should be possible The computational speed can be further improved by parallel computation, for example, on GPUs. Non-Markovian processes may also be solved with a large time step by the proposed ANN method, where the conditional collocation points are dependent on past realizations. Fractional Brownian motion () forms a relevant example, which is used for the simulation of rough volatility in finance (). In such a context, advanced variants of fully connected neural networks, e.g., recurrent neural networks (RNN) or long short-term memory (LSTM) networks (see a review in ), are recommended when approximating the nonlinear transition probability function, for example, Equation (13).

As another outlook, multilevel Monte Carlo (MLMC) methods, as developed by ( , ), form another interesting research topic for our large time step accurate discretization schemes. It is well-known that the strong convergence properties of SDE discretizations impact the efficiency of the MLMC methods.

Author Contributions

Writing—original draft preparation, S.L.; SCMC method and research advice, L.A.G.; Supervision, C.W.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

S.L. would like to thank the China Scholarship Council (CSC) for the financial support.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Longstaff–Schwartz Algorithm

For convenience, we detail the Longstaff–Schwartz Monte Carlo (LSMC) algorithm here (see Algorithm A1).

Algorithm A1: 7L Scheme Longstaff–Schwartz Algorithm

Divide the time horizon into $N_{b}$ intervals.
Simulate M stock price paths ${\hat{Y}}_{i, j}$ ( $0 \leq i \leq N_{b}$ , $1 \leq j \leq M$ ), using the 7L-CDC methodology;
Price the Bermudan option by means of the Longstaff–Schwartz Monte Carlo method:
(a)
At terminal time $T_{N_{b}}$ , calculate the payoff ${\hat{V}}_{N_{b}, j} = V (Y_{N_{b}, j})$ for all paths j, where $V (\cdot)$ is the payoff function.
(b)
Perform a backward recursion, from $i = N_{b} - 1$ until $i = 0$ as follows:
(c)
Compute the discounted continuation value at time $t_{i}$ , i.e.,

${\hat{η}}_{i, j} : = e^{- r Δ t} {\hat{V}}_{i + 1, j}$

(A1)

(d)
Perform least squares regression at time $t_{i}$ , based on the cross-sectional information ${\hat{Y}}_{i, j}$ and ${\hat{η}}_{i, j}$ to estimate the conditional expectation function,

$\bar{η_{i}} (\hat{Y}) = \sum_{k = 1}^{M_{k}} β_{k} B_{k} (\hat{Y})$

(A2)

where $M_{k}$ is the number of the basis functions $B_{k} (S)$ (polynomial basis, here, $M_{k} = 3$ ), and the coefficients $β_{k}$ are constant over different paths j. Note that only in-the-money paths are considered in Equations (A1) and (A2),
(e)
For each path j, compare the immediate exercise value $V ({\hat{Y}}_{i, j})$ with the estimated continuation value $\bar{η_{i}} ({\hat{Y}}_{i, j})$ : If $V ({\hat{Y}}_{i, j}) \geq \bar{η_{i}} ({\hat{Y}}_{i, j})$ , then ${\hat{V}}_{i, j} = V ({\hat{Y}}_{i, j})$ ; else ${\hat{V}}_{i, j} = \bar{η_{i}} ({\hat{Y}}_{i, j})$ .
Calculate the option price $V (t_{0})$ at the initial time,

$V (t_{0}) = \frac{1}{M} \sum_{j = 1}^{M} {\hat{V}}_{0, j} .$

(A3)

Note

1	With seven-league boots, we are marching through the time-wise direction, see also https://en.wikipedia.org/wiki/Seven-league_boots, accessed on 1 September 2020.

References

Bar-Sinai, Yohai, Stephan Hoyer, Jason Hickey, and Michael P. Brenner. 2019. Learning data-driven discretizations for partial differential equations. Proceedings of the National Academy of Sciences of the United States of America 116: 15344–49. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Beck, Christian, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. 2018. Solving stochastic differential equations and Kolmogorov equations by means of deep learning. arXiv arXiv:1806.00421. [Google Scholar]
Berrut, Jean-Paul, and Lloyd N. Trefethen. 2004. Barycentric Lagrange Interpolation. SIAM Review 46: 501–17. [Google Scholar] [CrossRef] [Green Version]
Broadie, Mark, and Özgür Kaya. 2006. Exact simulation of stochastic volatility and other affine jump diffusion processes. Operations Research 54: 217–31. [Google Scholar] [CrossRef] [Green Version]
Cameron, Robert H., and William T. Martin. 1947. The orthogonal development of nonlinear functionals in series of Fourier-Hermite functionals. Annals of Mathematics 48: 385–92. [Google Scholar] [CrossRef]
Capriotti, Luca. 2010. Fast Greeks by Algorithmic Differentiation. Journal of Computational Finance 14: 3–35. [Google Scholar] [CrossRef]
Cybenko, George. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2: 303–14. [Google Scholar] [CrossRef]
Fritsch, Frederick N., and Ralph E. Carlson. 1980. Monotone piecewise cubic interpolation. SIAM Journal on Numerical Analysis 17: 238–46. [Google Scholar] [CrossRef]
Gaß, Maximilian, Kathrin Glau, Mirco Mahlstedt, and Maximilian Mair. 2018. Chebyshev interpolation for parametric option pricing. Finance and Stochastics 22: 701–31. [Google Scholar] [CrossRef] [Green Version]
Gatheral, Jim, Thibault Jaisson, and Mathieu Rosenbaum. 2018. Volatility is rough. Quantitative Finance 18: 933–49. [Google Scholar] [CrossRef]
Giles, Michael B. 2008. Multilevel Monte Carlo Path Simulation. Operations Research 56: 607–17. [Google Scholar] [CrossRef] [Green Version]
Giles, Michael B. 2015. Multilevel Monte Carlo methods. Acta Numerica 24: 259–328. [Google Scholar] [CrossRef] [Green Version]
Giles, Michael B., and Paul Glasserman. 2006. Smoking adjoints: Fast Monte Carlo Greeks. Risk 19: 88–92. [Google Scholar]
Glasserman, Paul. 2004. Monte Carlo Methods in Financial Engineering. New York: Springer. [Google Scholar]
Glau, Kathrin, Paul Herold, Dilip B. Madan, and Christian Pötz. 2019. The Chebyshev method for the implied volatility. Journal of Computational Finance 23: 1–31. [Google Scholar]
Glau, Kathrin, and Mirco Mahlstedt. 2019. Improved error bound for multivariate Chebyshev polynomial interpolation. International Journal of Computer Mathematics 96: 2302–14. [Google Scholar] [CrossRef]
Glorot, Xavier, and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. Paper present at Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, May 13–15; pp. 249–56. [Google Scholar]
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge: MIT Press. [Google Scholar]
Grzelak, Lech A. 2019. The collocating local volatility framework—A fresh look at efficient pricing with smile. International Journal of Computer Mathematics 96: 2209–28. [Google Scholar] [CrossRef]
Grzelak, Lech A., Jeroen Witteveen, Maria Suarez-Taboada, and Cornelis W. Oosterlee. 2019. The stochastic collocation Monte Carlo sampler: Highly efficient sampling from expensive distributions. Quantitative Finance 19: 339–56. [Google Scholar] [CrossRef]
Han, Jiequn, Arnulf Jentzen, and E. Weinan. 2018. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences of the United States of America 115: 8505–10. [Google Scholar] [CrossRef] [Green Version]
Jain, Shashi, Álvaro Leitao, and Cornelis W. Oosterlee. 2019. Rolling Adjoints: Fast Greeks along Monte Carlo scenarios for early-exercise options. Journal of Computational Science 33: 95–112. [Google Scholar] [CrossRef] [Green Version]
Karatzas, Ioannis, and Steven E. Shreve. 1988. Brownian Motion and Stochastic Calculus. New York: Springer. [Google Scholar]
Kingma, Diederik P., and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv arXiv:1412.6980. [Google Scholar]
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521: 436–44. [Google Scholar] [CrossRef]
Leitao, Álvaro, Lech A. Grzelak, and Cornelis W. Oosterlee. 2017. On a one time-step Monte Carlo simulation approach of the SABR model: Application to European options. Applied Mathematics and Computation 293: 461–79. [Google Scholar] [CrossRef] [Green Version]
Li, Xingjie, Fei Lu, and Felix X. F. Ye. 2021. ISALT: Inference-based schemes adaptive to large time-stepping for locally Lipschitz ergodic systems. arXiv arXiv:2102.12669. [Google Scholar] [CrossRef]
Longstaff, Francis A., and Eduardo S. Schwartz. 2015. Valuing American Options by Simulation: A Simple Least-Squares Approach. The Review of Financial Studies 14: 113–47. [Google Scholar] [CrossRef] [Green Version]
Mandelbrot, Benoit B., and John W. Van Ness. 1968. Fractional Brownian Motions, Fractional Noises and Applications. SIAM Review 10: 422–37. [Google Scholar] [CrossRef]
Maziar, Raissi, Paris Perdikaris, and George E. Karniadakis. 2019. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378: 686–707. [Google Scholar]
Milstein, Grigori N. 1975. Approximate integration of stochastic differential equations. Theory of Probability and Its Applications 19: 557–62. [Google Scholar]
Montanelli, Hadrien, and Qiang Du. 2019. New Error Bounds for Deep ReLU Networks Using Sparse Grids. SIAM Journal on Mathematics of Data Science 1: 78–92. [Google Scholar] [CrossRef] [Green Version]
Nwankpa, Chigozie, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. 2018. Activation Functions: Comparison of trends in Practice and Research for Deep Learning. arXiv arXiv:1811.03378. [Google Scholar]
Oosterlee, Cornelis W., and Lech A. Grzelak. 2019. Mathematical Modeling and Computation in Finance. Singapore: World Scientific. [Google Scholar]
Platen, Eckhard. 1999. An introduction to numerical methods for stochastic differential equations. Acta Numerica 8: 197–246. [Google Scholar] [CrossRef] [Green Version]
Risken, Hannes. 1984. The Fokker-Planck Equation: Methods of Solution and Applications. Springer Series in Synergetics; New York: Springer. [Google Scholar]
Rivlin, Theodore J. 1990. Chebyshev Polynomials: From Approximation Theory to Algebra and Number Theory. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts; Hoboken: Wiley. [Google Scholar]
Sirignano, Justin, and Konstantinos Spiliopoulos. 2018. Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics 375: 1339–64. [Google Scholar] [CrossRef] [Green Version]
Uhlenbeck, George E., and Leonard Ornstein. 1930. On the Theory of the Brownian Motion. Physical Review 36: 823–41. [Google Scholar] [CrossRef]
Xie, You, Erik Franz, Mengyu Chu, and Nils Thuerey. 2018. TempoGAN: A Temporally Coherent, Volumetric GAN for Super-Resolution Fluid Flow. ACM Transactions on Graphics 37: 1–15. [Google Scholar] [CrossRef]
Yarotsky, Dmitry. 2017. Error bounds for approximations with deep ReLU networks. Neural Networks 94: 103–14. [Google Scholar] [CrossRef] [Green Version]
Yu, Yong, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Computation 31: 1235–70. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the 7L scheme. Here conditional SC points, represented by ■, are conditional on a previous realization, denoted by ★. “Conditional PDF” is the conditional probability density function, defined by these conditional SC points. The density function, which is not required by 7L, is plotted only for illustration purposes.

Figure 2. Schematic illustration of matrix C, with five marginal SC points and five conditional SC points. The conditional SC points are dependent on the realization connected to the corresponding marginal SC point.

Figure 3. Schematic diagram of the 7L-CDC scheme at time

t_{i}

. (a): marginal SC points, corresponding to Equation (29). (b): sample paths generated by 7L-CDC. The triple {2,1,3}, in the picture, represents the third conditional SC point, dependent on the first marginal SC point at time point

t_{2}

. The above procedure is also applicable to other time points.

Figure 4. The goodness-of-fit on test dataset. Two scatter plots show the relation between the predicted values and the ground truth.

Figure 5. The Kolmogorov–Smirnov test:

Δ t = 0.5, μ = 0.1, σ = 0.3, Y_{0} = 1.0

, with 10,000 samples. When we have a small KS statistic or a large p-value, the hypothesis that the distributions of the two sets of random samples are the same can not be rejected.

Figure 6. Paths generated by 7L-CDC: time step

Δ t = 0.5

, GBM with

σ = 0.3

,

r = 0.1

,

Y_{0} = 1.0

. The paths are with Chebyshev interpolation, which are not plotted, are identical to ones from Lagrange in this case.

Figure 7. The strong error is estimated as

\frac{1}{M} \sum | {\overset{ˇ}{Y}}_{k} (T) - {\hat{Y}}_{k} (T) |

, see Equation (4) and the weak error by

\frac{1}{M} (\sum {\overset{ˇ}{Y}}_{k} (T) - \sum {\hat{Y}}_{k} (T))

, see () for details on the computation of the convergence rate. There are

M = 1000

sample paths in total.

Figure 8. Paths and strong convergence for the OU process, using

λ = 0.5

,

\bar{Y} = 1.0

,

σ = 0.3

,

Y_{0} = 1.0

. The sample paths with barycentric, Chebyshev and PCHIP interpolation overlap for the 7L-CDC scheme. There are five marginal and five conditional SC points at each time point.

Figure 9. Path-wise estimator of Vega: Exact Vega is calculated by means of the central finite difference. The parameters are

Y_{0} = 1.0

,

r = 0.05

,

\tilde{K} = Y_{0}

,

σ = 0.3

,

Δ t = 1.0

,

N_{b} = 4

,

T = Δ t \times N_{b} = 4.0

.

Table 1. Training data,

Δ τ = 0.01

. Here is an example for training on five SC points.

Table 1. Training data,

Δ τ = 0.01

. Here is an example for training on five SC points.

ANN	Parameters	Value Range	Method
input	volatility, $σ$	[0.05, 0.60]	LHS
	drift, $μ$	(0.0, 0.10]	LHS
	time, $τ_{m a x}$	(0.0, 1.60]	Equidistant
	realization, $Y_{0}$	[0.10, 15.0]	LHS
${\hat{H}}_{1} (\cdot)$ output	point, ${\hat{y}}_{1}$	$(0.0, 25.65)$	SCMC
${\hat{H}}_{2} (\cdot)$ output	point, ${\hat{y}}_{2}$	$(0.0, 25.98)$	SCMC
${\hat{H}}_{3} (\cdot)$ output	point, ${\hat{y}}_{3}$	$(0.0, 27.84)$	SCMC
${\hat{H}}_{4} (\cdot)$ output	point, ${\hat{y}}_{4}$	$(0.0, 54.67)$	SCMC
${\hat{H}}_{5} (\cdot)$ output	point, ${\hat{y}}_{5}$	$(0.0, 154.35)$	SCMC

Table 2. The approximation performance on test data set.

SC Points	${\hat{y}}_{1}$	${\hat{y}}_{2}$	${\hat{y}}_{3}$	${\hat{y}}_{4}$	${\hat{y}}_{5}$
$R^{2}$	0.999891	0.999947	0.999980	0.999892	0.999963
MAE	0.026	0.027	0.021	0.071	0.066

Table 3. The CPU running time (s) to reach the same accuracy (CPU: E3-1240, 3.40 GHz): simulating 10,000 sample paths until terminal time

T = 4.0

, based on

5 \times 5

marginal/conditional SC points. Here, for the 7L scheme, PCHIP is used as the interpolant

g_{m} (\cdot)

in Step 3 of Algorithm 1.

Table 3. The CPU running time (s) to reach the same accuracy (CPU: E3-1240, 3.40 GHz): simulating 10,000 sample paths until terminal time

T = 4.0

, based on

5 \times 5

marginal/conditional SC points. Here, for the 7L scheme, PCHIP is used as the interpolant

g_{m} (\cdot)

in Step 3 of Algorithm 1.

Method/Time (s)	$Δ t = 1.0$			$Δ t = 2.0$
Method/Time (s)	Create C	Decom. C	Total	Create C	Decom. C	Total
7L-CDC Barycentric	0.054	4.93	4.98	0.027	2.48	2.51
7L-CDC Chebyshev	0.054	9.78	9.83	0.027	4.93	4.96
7L-CDC PCHIP	0.054	11.39	11.44	0.027	5.73	5.76
7L scheme	-	-	12.80	-	-	6.39
Milstein	-	-	27.01	-	-	27.70

Table 4. Pricing Asian European-style option with a fixed strike price, using

Y_{0} = 1.0

,

\tilde{K} = Y_{0}

,

r = 0.1

,

T = Δ t \times N_{b}

, the number of sample paths

M = 100,000

.

Table 4. Pricing Asian European-style option with a fixed strike price, using

Y_{0} = 1.0

,

\tilde{K} = Y_{0}

,

r = 0.1

,

T = Δ t \times N_{b}

, the number of sample paths

M = 100,000

.

	Method	$Δ t = 1.0$ , $N_{b}$ = 4	$Δ t = 0.5$ , $N_{b}$ = 8
$σ$ = 0.30	Analytic MC	0.24886257 (0.00%)	0.22403982 (0.00%)
	Milstein MC	0.23077000 (7.27%)	0.21558276 (3.77%)
	7L-CDC	0.24871446 (0.06%)	0.22404571 (0.00%)
$σ$ = 0.40	Analytic MC	0.28515109 (0.00%)	0.25723594 (0.00%)
	Milstein MC	0.26394277 (7.44%)	0.24717425 (3.91%)
	7L-CDC	0.28482371 (0.11%)	0.25647592 (0.30%)

Table 5. Bermudan put option prices based on large time step Monte Carlo simulations.

	Method	$Δ t = 1.0, N_{b} = 4$	$Δ t = 0.5, N_{b} = 4$	$Δ t = 0.5, N_{b} = 8$
$σ$ = 0.30	Analytic MC	0.15213858(0.00%)	0.14620214(0.00%)	0.16161876(0.00%)
	Milstein MC	0.13872771(8.81%)	0.14065252(3.80%)	0.15429369(4.53%)
	7L-CDC	0.15234901(0.14%)	0.14648443(0.19%)	0.16196264(0.21%)
$σ$ = 0.40	Analytic MC	0.21459038(0.00%)	0.19552454(0.00%)	0.22340304(0.00%)
	Milstein MC	0.19598488(8.67%)	0.18790933(3.89%)	0.21297732(4.67%)
	7L-CDC	0.21474619(0.07%)	0.19590733(0.20%)	0.22389360(0.22%)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The Seven-League Scheme: Deep Learning for Large Time Step Monte Carlo Simulations of Stochastic Differential Equations^†

Abstract

1. Introduction

2. Stochastic Differential Equations and Stochastic Collocation

2.1. SDE Basics

2.2. Stochastic Collocation Method

3. Methodology

3.1. Data-Driven Numerical Schemes

3.2. The Seven-League Scheme

3.3. The Artificial Neural Network

4. An Efficient Large Time Step Scheme: Compression–Decompression Variant

4.1. CDC Variant

4.2. Interpolation Techniques

4.3. Pathwise Sensitivity

5. Numerical Experiments

5.1. ANN Training Details

5.2. Numerical Error Analysis, the Lagrangian Case

Kolmogorov–Smirnov Test

5.3. Pathwise Error Convergence

5.3.1. GBM Process

5.3.2. Ornstein–Uhlenbeck Process

5.4. Applications in Finance

5.4.1. The Asian Option

5.4.2. Bermudan Option Valuation

6. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Longstaff–Schwartz Algorithm

Note

References

Article Metrics

Citations

Article Access Statistics

The Seven-League Scheme: Deep Learning for Large Time Step Monte Carlo Simulations of Stochastic Differential Equations †

Abstract

1. Introduction

2. Stochastic Differential Equations and Stochastic Collocation

2.1. SDE Basics

2.2. Stochastic Collocation Method

3. Methodology

3.1. Data-Driven Numerical Schemes

3.2. The Seven-League Scheme

3.3. The Artificial Neural Network

4. An Efficient Large Time Step Scheme: Compression–Decompression Variant

4.1. CDC Variant

4.2. Interpolation Techniques

4.3. Pathwise Sensitivity

5. Numerical Experiments

5.1. ANN Training Details

5.2. Numerical Error Analysis, the Lagrangian Case

Kolmogorov–Smirnov Test

5.3. Pathwise Error Convergence

5.3.1. GBM Process

5.3.2. Ornstein–Uhlenbeck Process

5.4. Applications in Finance

5.4.1. The Asian Option

5.4.2. Bermudan Option Valuation

6. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Longstaff–Schwartz Algorithm

Note

References

Article Metrics

Citations

Article Access Statistics

The Seven-League Scheme: Deep Learning for Large Time Step Monte Carlo Simulations of Stochastic Differential Equations^†