Quasi-Regression Monte-Carlo Method for Semi-Linear PDEs and BSDEs †

: In this work we design a novel and efﬁcient quasi-regression Monte Carlo algorithm in order to approximate the solution of discrete time backward stochastic differential equations (BSDEs), and we analyze the convergence of the proposed method. With the challenge of tackling problems in high dimensions we propose suitable projections of the solution and efﬁcient parallelizations of the algorithm taking advantage of powerful many core processors such as graphics processing units (GPUs).


Introduction
In this work we are interested in numerically approximating the solution (X, Y, Z) of a decoupled forward-backward stochastic differential equation The terminal time T > 0 is fixed. These equations are considered in a filtered probability space (Ω, F , P, (F t ) 0≤t≤T ) supporting a q ≥ 1 dimensional Brownian motion W. In this representation, X is a d-dimensional adapted continuous process (called the forward component), Y is a scalar adapted continuous process and Z is a q-dimensional progressively measurable process. Regarding terminology, g(X T ) is called terminal condition and f the driver.

Results
Our aim is to solve where f j (x, y) := f (t j , x, y), f being the driver in (1). In other words, our subsequent scheme will approximate the solutions to and ∂ t u(t, x) + Au(t, x) + f (t, x, u(t, x)) = 0 for t < T and u(T, .) = g(.).
One important observation is that, due to the Markov property of the Euler scheme, for every i, there exist measurable deterministic functions y i : R d → R, such that Y i = y i (X i ), almost surely. A second crucial observation is that the value functions y i (·) are independent of how we initialize the forward component. Our subsequent algorithm takes advantage of this observation. For instance, let X i i be a random variable in R d with some distribution ν and let X i j be the Euler scheme evolution of X j starting from X i ; it writes This flexibility property w.r.t. the initialization then writes Approximating the solution to (3) is actually achieved by approximating the functions y i (·). In this way, we are directly approximating the solution to the semi-linear PDE (5). In order to control better the truncation error we define a weighted modification of y i by y i and y i coincide. The previous DPE (7) becomes The introduction of the polynomial factor (1 + |X i i | 2 ) q/2 gives higher flexibility in the error analysis, it ensures that y (q) i decreases faster at infinity, which will provide nicer estimates on the approximation error when dealing with Fourier-basis.
Then we define some proper basis functions φ k which satisfy orthogonality properties in R d and which span some L 2 space. It turns out that the choice of measure for defining the L 2 space has to coincide with the sampling measure of X i i ∼ ν. Our strategy for defining such basis functions is to start from trigonometric basis on [0, 1] d and then to apply appropriate changes of variable: later, this transform will allow to easily quantify the approximation error when truncating the basis. Using the notation we can rewrite the exact solution as y (q) Under mild conditions on f , g and ν, S i (X i i:N ) is square-integrable, and therefore y i,k φ k (x), for some coefficients (α (q) i,k : k ∈ N d ). Using the orthonormality property of the basis functions φ k 's, α where for all k ∈ Γ,ᾱ and

Discussion
A implementation on GPUs of the GQRMDP algorithm is proposed. It includes two kernels, one simulates the paths of the forward process and computes the associated responses, the other one computes the regression coefficients (α (q) i,k , k ∈ Γ). In the first kernel the initial value of each simulated path of the forward process is stored in a device vector in global memory, it will be read later in the second kernel. In order to minimize the number of memory transactions and therefore maximize performance, all accesses to global memory have been implemented in a coalesced way. The random numbers needed for the path generation of the forward process were generated on the fly (inline generation) taking advantage of the NVIDIA cuRAND library [1] and the generator MRG32k3a proposed by L'Ecuyer in [2]. Therefore, inside this kernel the random number generator is called as needed. Another approach would be the pre-generation of the random numbers in a separate previous kernel, storing them in GPU global memory and reading them back from this device memory in the next kernel. Both alternatives have advantages and drawbacks. In this work we have chosen inline generation having in mind that this option is faster and saves global memory. Besides, register swapping was not observed on the implementation and the quality of the obtained solutions is similar to the accuracy of pure sequential traditional CPU solutions achieved employing more complex random number generators. In the second kernel, in order to compute the regression coefficients, a parallelization not only over the multi-indices k ∈ Γ but also over the simulations 1 ≤ m ≤ M was proposed. Thus, blocks of threads parallelize the outer for loop ∀k ∈ Γ, whilst the threads inside each block carry out in parallel the inner loop traversing the vectors of the responses and the simulations.

Conflicts of Interest:
The authors declare no conflict of interest.