Next Article in Journal
Stochastic Mortality Modelling for Dependent Coupled Lives
Previous Article in Journal
Portfolio Optimization under Correlation Constraint
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing Asset-Liability Risk with Neural Networks

RiskLab, ETH Zurich, 8092 Zurich, Switzerland
*
Author to whom correspondence should be addressed.
Risks 2020, 8(1), 16; https://doi.org/10.3390/risks8010016
Submission received: 30 November 2019 / Revised: 31 January 2020 / Accepted: 4 February 2020 / Published: 9 February 2020

Abstract

:
We introduce a neural network approach for assessing the risk of a portfolio of assets and liabilities over a given time period. This requires a conditional valuation of the portfolio given the state of the world at a later time, a problem that is particularly challenging if the portfolio contains structured products or complex insurance contracts which do not admit closed form valuation formulas. We illustrate the method on different examples from banking and insurance. We focus on value-at-risk and expected shortfall, but the approach also works for other risk measures.

1. Introduction

Different financial risk management problems require an assessment of the risk of a portfolio of assets and liabilities over a given time period. Banks, mutual funds and hedge funds usually calculate value-at-risk numbers for different risks over a day, a week or longer time periods. Under Solvency II, insurance companies have to calculate one-year value-at-risk, whereas the Swiss Solvency Test demands a computation of one-year expected shortfall. A determination of these risk figures requires a conditional valuation of the portfolio given the state of the world at a later time, called risk horizon. This is particularly challenging if the portfolio contains structured products or complicated insurance policies which do not admit closed form valuation formulas. In theory, the problem can be approached with nested simulation, which is a two-stage procedure. In the outer stage, different scenarios are generated to model how the world could evolve until the risk horizon, whereas in the inner stage, cash flows occurring after the risk horizon are simulated to estimate the value of the portfolio conditionally on each scenario; see, e.g., Lee (1998), Glynn and Lee (2003), Gordy and Juneja (2008), Broadie et al. (2011) or Bauer et al. (2012). While nested simulation can be shown to converge for increasing sample sizes, it is often too time-consuming to be useful in practical applications. A more pragmatic alternative, usually used for short risk horizons, is the delta-gamma method, which approximates the portfolio loss with a second order Taylor polynomial; see, e.g., Rouvinez (1997), Britten-Jones and Schaefer (1999) or Duffie and Pan (2001). If first and second derivatives of the portfolio with respect to the underlying risk factors are accessible, the method is computationally efficient, but its accuracy depends on how well the second order Taylor polynomial approximates the true portfolio loss. Similarly, the replicating portfolio approach approximates future cashflows with a portfolio of liquid instruments that can be priced efficiently; see e.g., Wüthrich (2016), Pelsser and Schweizer (2016), Natolski and Werner (2017) or Cambou and Filipović (2018). Building on work on American option valuation (see e.g., Carriere 1996; Longstaff and Schwartz 2001; Tsitsiklis and Van Roy 2001), Broadie et al. (2015) as well as Ha and Bauer (2019) have proposed to regress future cash flows on finitely many basis functions depending on state variables known at the risk horizon. This gives good results in a number of applications. However, typically, for it to work well, the basis functions have to be chosen well-adapted to the problem.
In this paper we use a neural network approach to approximate the value of the portfolio at the risk horizon. Since our goal is to estimate tail risk measures such as value-at-risk and expected shortfall, we employ importance sampling when training the networks and estimating the risk figures. In addition, we try different regularization techniques and test the adequacy of the neural network approximations using the defining property of the conditional expectation. Neural networks have also been used for the calculation of solvency capital requirements by Hejazi and Jackson (2017), Fiore et al. (2018) and Castellani et al. (2019). But they all exclusively focus on value-at-risk, do not make use of importance sampling and apply neural networks in a slightly different way. Hejazi and Jackson (2017) develop a neural network interpolation scheme within a nested simulation framework. Fiore et al. (2018) and Castellani et al. (2019) both use reduced-size nested Monte Carlo to generate training samples for the calibration of the neural network. Here, we directly regress future cash-flows on a neural network without producing training samples.
The remainder of the paper is organized as follows. In Section 2, we set up the underlying risk model, introduce the importance sampling distributions we use in the implementation of our method and recall how value-at-risk and expected shortfall can be estimated from a finite sample of simulated losses. Section 3 discusses the training, validation and testing of our network approximation of the conditional valuation functional. In Section 4 we illustrate our approach on three typical risk calculation problems: the calculation of risk capital for a single put option, a portfolio of different call and put options and a variable annuity contract with guaranteed minimum income benefit. Section 5 concludes the paper.

2. Asset-Liability Risk

We denote the current time by 0 and are interested in the value of a portfolio of assets and liabilities at a given risk horizon τ > 0 . Suppose all relevant events taking place until time τ are described by a d-dimensional random vector X = ( X 1 , , X d ) defined on a measurable space ( Ω , F ) that is equipped with two equivalent probability measures, P and Q . We think of P as the real-world probability measure and use it for risk measurement. Q is a risk-neutral probability measure specifying the time- τ value of the portfolio through
V = v ( X ) + E Q i = 1 I N τ N t i C t i | X ,
where v is a measurable function from R d to R describing the part of the portfolio whose value is directly given by the underlying risk factors X 1 , , X d ; C t 1 , , C t I are random cash flows occurring at times τ < t 1 < < t I ; and N τ , N t 1 , , N t I model the evolution of a numeraire process used for discounting.
Our goal is to compute ρ ( L ) for a risk measure ρ and the time τ net liability L = V , which can be written as
L = E Q Y X
for
Y = v ( X ) i = 1 I N τ N t i C t i .
Our approach works for a wide class of risk measures ρ . But for the sake of concreteness, we concentrate on value-at-risk and expected shortfall. We follow the convention of McNeil et al. (2015) and define value-at-risk at a level α ( 0 , 1 ) , such as 0.95 , 0.99 or 0.995 , as the left α -quantile
VaR α ( L ) : = min x R : P [ L x ] α = sup x R : P [ L x ] > 1 α
and expected shortfall as
ES α ( L ) : = 1 1 α α 1 VaR u ( L ) d u .
There exist different definitions of VaR and ES in the literature. But if the distribution function F L of L is continuous and strictly increasing, they are all equivalent1.

2.1. Conditional Expectations as Minimizing Functions

Let P Q be the probability measure on F given by
P Q [ A ] : = R d Q [ A X = x ] π ( d x ) ,
where π is the distribution of X under P and Q [ A X = x ] is a regular conditional version of Q given X. We assume that Y belongs to L 2 ( Ω , F , P Q ) . Then L is of the form L = l ( X ) for a measurable function l : R d R minimizing the mean squared distance
E P Q l ( X ) Y 2 ;
see, e.g., Bru and Heinich (1985). Note that l minimizes (3) if and only if
l ( x ) = arg min u R R ( u y ) 2 Q [ Y d y X = x ] for   π almost   all   x R d ,
where Q [ Y d y X = x ] is a regular conditional Q -distribution of Y given X. This shows that l is unique up to π -almost sure equality and can alternatively be characterized by
l ( x ) = arg min u R R ( u y ) 2 Q [ Y d y X = x ] for   ν almost   all   x R d
for any probability measure ν on R d that is equivalent to π . In particular, if  P ν is the probability measure on σ ( X ) under which X has distribution ν , and Y is in L 2 ( Ω , F , P ν Q ) , then l can be determined by minimizing
E P ν Q l ( X ) Y 2
instead of (3). Since we are going to approximate the expectation (4) with averages of Monte Carlo samples, this will give us some flexibility in simulating ( X , Y ) .

2.2. Monte Carlo Estimation of Value-at-Risk and Expected Shortfall

Let X 1 , , X n be independent P -simulations of X. By X ( 1 ) , , X ( n ) we denote the same sample reordered so that
L ( 1 ) = l ( X ( 1 ) ) L ( n ) = l ( X ( n ) ) .
To obtain P -simulation estimates of VaR α ( L ) and ES α ( L ) , we apply the VaR and ES definitions (1)–(2) to the empirical measure 1 n i = 1 n δ L ( i ) . This yields
VaR ^ α ( n ) : = L ( j ) and ES ^ α ( n ) : = 1 1 α i = 1 j 1 L ( i ) n + 1 j 1 ( 1 α ) n L ( j ) ,
where
j = min i 1 , , n : i / n > 1 α ;
see also McNeil et al. (2015). It is well known that if F L is differentiable at VaR α ( L ) with F L ( VaR α ( L ) ) > 0 , then VaR ^ α ( n ) is an asymptotically normal consistent estimator of order n 1 / 2 ; see, e.g., David and Nagaraja (2003). If in addition, L is square-integrable, the same is true for ES ^ α ( n ) ; see Zwingmann and Holzmann (2016).

2.3. Importance Sampling

Assume now that X is of the form X = h ( Z ) for a transformation h : R k R d and a k-dimensional random vector Z with density f : R k R + . To give more weight to important outcomes, we introduce a second density g on R k satisfying g ( z ) > 0 for all z R k , where f ( z ) > 0 . Let Z g be a k-dimensional random vector with density g and Z g , 1 , , Z g , n independent simulations of Z g . We reorder the simulations such that
L g , ( 1 ) = l h ( Z g , ( 1 ) ) L g , ( n ) = l h ( Z g , ( n ) )
and consider the random weights
w ( i ) = f ( Z g , ( i ) ) n g ( Z g , ( i ) ) .
Since f ( Z g ) / g ( Z g ) is integrable, one obtains from the law of large numbers that for every threshold x R ,
i = 1 n 1 l h ( Z g , ( i ) ) x w ( i )
is an unbiased consistent estimator of the exceedance probability
P [ L x ] = P l h ( Z ) x = E 1 l h ( Z g ) x f ( Z g ) g ( Z g ) .
If in addition, g can be chosen so that f ( Z g ) / g ( Z g ) is square-integrable, it follows from the central limit theorem that (6) is asymptotically normal with a standard deviation of order n 1 / 2 . In any case, i = 1 n w ( i ) converges to 1 and the random measure i = 1 n w ( i ) δ l h ( Z g , ( i ) ) approximates the distribution of L. Accordingly, we adapt the P -simulation estimators (5) by replacing i / n with m = 1 i w ( m ) . This yields the g-simulation estimates
VaR ^ α g ( n ) : = L g , ( j ) and ES ^ α g ( n ) : = 1 1 α i = 1 j 1 w ( i ) L g , ( i ) + 1 1 1 α i = 1 j 1 w ( i ) L g , ( j ) ,
where
j = min i 1 , , n : m = 1 i w ( m ) > 1 α .
If the α -quantile l α of L = l h ( Z ) were known, the exceedance probability P L l α could be estimated by means of (6) with x = l α . To make the procedure efficient, one would try to find a density g on R k from which it is easy to sample and such that the variance
Var 1 l h ( Z g ) l α f ( Z g ) g ( Z g ) = E 1 l h ( Z g ) l α f ( Z g ) 2 g ( Z g ) 2 E 1 l h ( Z g ) l α f ( Z g ) g ( Z g ) 2 ,
becomes as small as possible. Since
E 1 l h ( Z g ) l α f ( Z g ) g ( Z g ) = E 1 l h ( Z ) l α = P [ L l α ]
does not depend on g, it can be seen that g is a good importance sampling (IS) density if
E 1 l h ( Z g ) l α f ( Z g ) 2 g ( Z g ) 2 = E 1 l h ( Z ) l α f ( Z ) g ( Z )
is small. We use the same criterion as a basis to find a good IS density for estimating VaR α and ES α ; see, e.g., Glasserman (2003) for more background on importance sampling.

2.4. The Case of an Underlying Multivariate Normal Distribution

In many applications X can be modelled as a deterministic transformation of a random vector with a multivariate normal distribution. Examples are included in Section 4 below. In this case, it can be written as
X = u ( A Z )
for a function u : R p R d , a  p × k -matrix A and a k-dimensional random vector Z with a standard normal density f. To keep the problem tractable, we look for a suitable IS density g in the class of N k ( m , I k ) -densities f m with different mean vectors m R k and covariance matrix equal to the k-dimensional identity matrix I k . To determine a good choice of m, let v R p be the vector with components
v i = 1 if   l u   is   increasing   i n   y i 1 if   l u   is   decreasing   i n   y i 0 else .
If A T v = 0 , we choose m = 0 . Otherwise, v T A Z / A T v 2 is standard normal. Denote its α -quantile by z α . Then Z falls into the region
D = z R k : v T A z / A T v 2 z α
with probability 1 α , and  l u ( A z ) tends to be large for z D . We choose m as the maximizer of f ( z ) subject to z D , which leads to
m = A T v A T v 2 z α .
It is easy to see that this yields
f ( z ) < f m ( z ) for   all   z D .
So, if D is sufficiently similar to the region z R k : l u ( A z ) l α , it can be seen from (7) that this choice of IS density will yield a reduction in variance.

3. Neural Network Approximation

Usually, the distribution of the risk factor vector X is assumed to be known. For instance, in all our examples in Section 4, they are of the form (8). On the other hand, in many real-world applications, there is no closed form expression for the loss function l : R d R mapping X to L. Therefore, we approximate L = l ( X ) with l θ ( X ) for a neural network l θ : R d R ; see e.g., Goodfellow et al. (2016). We concentrate on feedforward neural networks of the form
l θ = ψ a J θ φ a J θ φ a 1 θ ,
where
  • q 0 = d , q J = 1 , and  q 1 , , q J 1 are the numbers of nodes in the hidden layers 1 , , J 1 ;
  • a j θ are affine functions of the form a j θ ( x ) = A j x + b j for matrices A j R q j × q j 1 and vectors b j R q j , for j = 1 , , J ;
  • φ is a non-linear activation function used in the hidden layers and applied component-wise. In the examples in Section 4 we choose φ = tanh ;
  • ψ is the final activation function. For a portfolio of assets and liabilities a natural choice is ψ = i d . To keep the presentation simple, we will consider pure liability portfolios with loss L > 0 in all our examples below. Accordingly, we choose ψ = exp .
The parameter vector θ consists of the components of the matrices A j and vectors b j , j = 1 , , J . So it lives in R q for q = j = 1 J q j ( q j 1 + 1 ) . It remains to determine the architecture of the network (that is, J and q 1 , , q J 1 ) and to calibrate θ R q . Then VaR and ES figures can be estimated as described in Section 2 by simulating l θ ( X ) .

3.1. Training and Validation

In a first step we take the network architecture (J and q 1 , , q J 1 ) as given and try to find a minimizer of θ E ( l θ ( X ) Y ) 2 , where the expectation is either with respect to P Q or P ν Q for an IS distribution ν on R d . To do that we simulate realizations ( X m , Y m ) , m = 1 , , M 1 + M 2 , of  ( X , Y ) under the corresponding distribution. The first M 1 simulations are used for training and the other M 2 for validation. More precisely, we employ a stochastic gradient descent method to minimize the Monte Carlo approximation based on the training samples
1 M 1 m = 1 M 1 l θ ( X m ) Y m 2
of E ( l θ ( X ) Y ) . At the same time we use the validation samples to check whether
1 M 2 m = M 1 + 1 M 1 + M 2 l θ ( X m ) Y m 2
is decreasing as we are updating θ .

3.1.1. Regularization through Tree Structures

If the number q of parameters is large, one needs to be careful not to overfit the neural network. For instance, in the extreme case, the  network could be so flexible that it can bring (9) down to zero even in cases where the true conditional expectation l ( X ) = E Q [ Y X ] is not equal to Y. To prevent this, one can generate the training samples by first simulating N 1 realizations X i of X and then for every X i , drawing N 2 simulations Y i , j from the conditional distribution of Y given X i . In the simple example of Section 4.1, we chose N 2 = 1 . In Section 4.2 and Section 4.3 we used N 2 = 5 .

3.1.2. Stochastic Gradient Descent

In principle, one can use any numerical method to minimize (9). But  stochastic gradient descent methods have proven to work well for neural networks. We refer to  Ruder (2016) for an overview of different (stochastic) gradient descent algorithms. Here, we randomly2 split the M 1 training samples into b mini-batches of size B. Then we update θ based on the θ -gradients of
1 B m = ( i 1 ) B + 1 i B l θ ( X m ) Y m 2 , i = 1 , , b .
We use batch normalization and Adam updating with the default values from TensorFlow.
After b gradient steps, all of the training data have been used once and the first epoch is complete. For further epochs, we reshuffle the training data, form new mini-batches and perform b more gradient steps. The procedure is repeated until the training error (9) stops to decrease or the validation error (10) starts to increase.

3.1.3. Initialization

We follow standard practice and initialize the parameters of the network randomly. The final operation of the network is
x ψ ( A J x + b J ) .
Since the network tries to approximate Y, and in all our examples below we use ψ = exp , we initialize the last bias as
b J 0 = log 1 M 1 m = 1 M 1 Y m .
For the other parameters we use Xavier initialization; see Glorot and Bengio (2010).

3.2. Backtesting the Network Approximation

After having determined an approximate minimizer θ R q for a given network architecture, one can test the adequacy of the approximation l θ ( X ) of the true conditional expectation l ( X ) = E Q [ Y X ] . The quality of the approximation depends on different aspects:
(i)
Generalization error:
The true conditional expectation E Q Y X is of the form l ( X ) for the unique3 measurable function l : R d R minimizing the mean squared distance E l ( X ) Y 2 . To approximate l we choose a network architecture and try to find a θ R q that minimizes the empirical squared distance
1 M 1 m = 1 M 1 l θ ( X m ) Y m 2 .
But if the samples ( X m , Y m ) , m = 1 , , M 1 , do not represent the distribution of ( X , Y ) well, (11) might not be a good approximation of the true expectation E l θ ( X ) Y 2 .
(ii)
Numerical minimization method:
The minimization of (11) usually is a complex problem, and one has to employ a numerical method to find an approximate solution θ . The  quality of l θ ( X ) will depend on the performance of the numerical method being used.
(iii)
Network architecture:
It is well known that feedforward neural networks with one hidden layer have the universal approximation property; see, e.g., Cybenko (1989); Hornik et al. (1989) or Leshno et al. (1993). That is, they can approximate any continuous function uniformly on compacts to any degree of accuracy if the activation function is of a suitable form and the hidden layers contain sufficiently many nodes. As a consequence, E ( l θ ( X ) l ( X ) ) 2 can be made arbitrarily small if the hidden layer is large enough and θ is chosen appropriately. However, we do not know in advance how many nodes we need. And moreover, feedforward neural networks with two or more hidden layers have shown to yield better results in different applications.
Since we simulate from an underlying model, we are able to choose the size M 1 of the training sample large and train extensively. In addition, for any given network architecture, we also evaluate the empirical squared distance (11) on the validation set ( X m , Y m ) , m = M 1 + 1 , , M 2 . So we suppose the generalization error is small and our numerical method finds a good approximate minimizer θ of (11). However, since we do not know whether a given network architecture is flexible enough to provide a good approximation to the true loss function l, we test for each trained network whether it satisfies the defining properties of a conditional expectation.
The loss function l : R d R is characterized by
E [ l ( X ) ξ ( X ) ] = E [ Y ξ ( X ) ]
for all measurable functions ξ : R d R such that Y ξ ( X ) is integrable. Ideally, we would like l θ to satisfy the same condition. But there will be an approximation error, and (12) cannot be checked for all measurable functions ξ : R d R satisfying the integrability condition. Therefore, we select finitely many measurable subsets B i R d , i = 1 , , I . Then we generate M 3 more samples ( X m , Y m ) of ( X , Y ) and test whether the differences
(a)
m = M 2 + 1 M 1 + M 2 + M 3 l θ ( X m ) Y m / M 3
(b)
m = M 2 + 1 M 1 + M 2 + M 3 l θ ( X m ) Y m l θ ( X m ) / M 3
(c)
m = M 2 + 1 M 1 + M 2 + M 3 l θ ( X m ) Y m 1 B i ( X m ) / M 3
are sufficiently close to zero. If this is not the case, we change the network architecture and train again.

4. Examples

As examples, we study three different risk assessment problems from banking and insurance. For comparison we generated realizations ( X m , Y m ) of ( X , Y ) under P Q as well as P ν Q , where  ν  is the IS distribution on R d obtained by changing the distribution of X. In all our examples, X has a transformed normal distribution as in Section 2.4. In each case we used 1.5 million simulations with mini-batches of size 10,000 for training, 500,000 simulations for validation and 500,000 for backtesting. After training and testing the network, we simulated another 500,000 realizations of X, once under P and then under P ν , to estimate VaR 99.5 % ( L ) and ES 99 % ( L ) .
We implemented the algorithms in Python. To train the networks we used the TensorFlow package.

4.1. Single Put Option

As a first example, we consider a liability consisting of a single put option with strike price K = 100 and maturity T = 1 / 3 on an underlying asset starting from s 0 = 100 and evolving according to
d S t = μ S t d t + σ S t d W t P = r S t d t + σ S t d W t Q
for an interest rate r = 1 % , a drift μ = 5 % , a volatility σ = 20 % , a P -Brownian motion W P and a Q -Brownian motion W Q . As risk horizon we choose τ = 1 / 52 . The time τ -value of this liability is
L = e r ( T τ ) E Q ( K S T ) + S τ .
Using Itô’s formula, one can write
S τ = s 0 exp ( μ σ 2 / 2 ) τ + σ τ Z and S T = S τ exp ( r σ 2 / 2 ) ( T τ ) + σ T τ V ,
where Z and V are two independent standard normal random variables under P Q . It is well-known that L is of the form P ( S τ , r , σ , T τ ) , where P is the Black–Scholes formula for a put option. This allows us to calculate reference values for VaR α ( L ) and ES α ( L ) .
To train the neural network approximation of L, we simulated realizations of the pair ( X , Y ) for X = S τ and Y = e r ( T τ ) ( K S T ) + . For comparison, we first simulated according to P Q and then according to P ν Q for an IS distribution P ν . Clearly, L is decreasing in Z. Therefore, we chose P ν so as to make Z normally distributed with mean equal to the 1 α -quantile of a standard normal and variance 1. Since x P ( x , r , σ , T τ ) is a simple one-dimensional function, we selected a network with a single hidden layer containing 5 nodes. This is sufficient to approximate P and faster to train than a network with more hidden layers. We did not use the tree structure of Section 3.1.1 for training (that is, N 2 = 1 ) and trained the network over 40 epochs.
It can be seen in Figure 1 that the empirical squared distance on both, the training and validation data, decreases with a very similar decaying pattern. This provides a first validation of our approximation procedure.
Figure 2 shows the empirical evaluation of (a) and (b) of Section 3.2 on the test data after each training epoch.
Similarly, Figure 3 illustrates the empirical evaluation of (c) of Section 3.2 on the test data after each training epoch for the sets
B 1 = x R : x < s 40 % and B 2 = x R : x > s 70 % ,
where s β denotes the β -quantile of X = S τ .
Training, validation and testing with IS worked similarly.
Once the network has been trained and tested, one can estimate VaR and ES numbers. Figure 4 shows our results for increasing sample sizes. The left panel shows our estimate of VaR 99.5 % ( L ) without and with IS. Plugging the 0.5%-quantile of S τ into the Black–Scholes formula gives a reference value of 8.3356. Our method yielded 8.3358 without and 8.3424 with importance sampling. The right panel shows our results for ES 99 % ( L ) . Transforming simulations of S τ with the Black–Scholes formula and using the empirical ES estimate (5) resulted in a reference value of 8.509. Without importance sampling, the neural network learned a value of 8.456 versus 8.478 with importance sampling. It can be seen that in both cases, IS made the method more efficient. It has to be noted that for increasing sample sizes, the VaR and ES estimates converge to the corresponding risk figures in the neural network model, which are not exactly equal to their analogs in the correct model. But it can be seen that the blue lines are close to their final values after very few simulations.

4.2. Portfolio of Call and Put Options

In our second example we introduce a larger set of risk factors. We consider a portfolio of 20 short call and put options on different underlying assets with initial prices s 0 i > 0 and dynamics
d S t i = μ i S t i d t + σ i S t i d W t P , i = r S t i d t + σ i S t i d W t Q , i
for P -Brownian motions W P , i and Q -Brownian motions W Q , i , i = 1 , , 20 such that ( W P , 1 , , W P , 20 ) is a multivariate Gaussian process under P with an instantaneous correlation of 30% between different components and ( W Q , 1 , , W Q , 20 ) is a multivariate Gaussian process under Q , also with instantaneous correlation of 30% between different components. We set s 0 i = 100 for all i = 1 , , 20 and r = 1 % . The drift and volatility parameters are assumed to be μ i = μ 10 + i = ( 3 + i / 2 ) % and σ i = σ 10 + i = ( 15 + i ) % , i = 1 , , 10 . As in the first example, we choose a maturity of T = 1 / 3 and a risk horizon of τ = 1 / 52 . We assume all options have the same strike price K = 100 . Then the time- τ value of the liability is
L = e r ( T τ ) E i = 1 10 ( S T i K ) + + i = 11 20 ( K S T i ) + | X ,
where X is the vector S τ 1 , , S τ 20 .
In this example we trained a neural network with two hidden layers containing 15 nodes each. We first simulated according to P Q and trained for 100 epochs. Figure 5 shows the decay of the empirical squared distance on the training and validation data set.
After training and testing the network under P Q , we did the same under P ν Q for the IS distributions ν resulting from the procedure of Section 2.4 for α = 99.5 % (for VaR) and 99 % (for ES). The two plots in Figure 6 show the empirical evaluations of (a) and (b) on the test data under the IS measure P ν Q corresponding to α = 99 % .
As an additional test, we consider the two sets
B 1 = { x R 20 : x i > s 20 % i   and   x 10 + i < s 80 % 10 + i   for   i = 1 , 2 , 3 }
and
B 2 = { x R 20 : x i < s 80 % i   and   x 10 + i > s 20 % 10 + i   for   i = 1 , 2 , 3 } ,
where s β i is the β -quantile of S τ i under P ν Q , and evaluate (c) on the test data generated under the IS distribution P ν Q corresponding to α = 99 % . The results are depicted in Figure 7.
After training and testing, we generated simulations of X to estimate VaR 99.5 % ( L ) and ES 99 % ( L ) . The convergence for increasing sample sizes is shown in Figure 8. The reference values, 104.92 for 99.5 % -VaR and 105.59 for 99 % -ES, were obtained from the empirical estimates (5) by simulating S τ and using the Black–Scholes formula for each of the 20 options. The neural network estimates of 99.5 % -VaR were 104.56 without and 104.48 with IS. Those of 99 % -ES were 105.03 without and 104.65 with IS. In both cases, IS made the procedure more efficient.

4.3. Variable Annuity with GMIB

As a third example we study a variable annuity (VA) with guaranteed minimum income benefit (GMIB). We consider the contract analyzed by Ha and Bauer (2019) using polynomial regression. At time 0 the contract is sold to an x-year old policyholder. If she is still alive at maturity T, she can choose between the balance of an underlying investment account and a lifetime annuity. Therefore, in case of survival, the time-T value of the contract is
max S T , b a x + T ( T ) ,
where S T is the account value, b a guaranteed rate and a x + T ( T ) the time-T value of an annuity paying one unit of currency to a ( x + T ) -year old policyholder at times T + 1 , T + 2 , . . . for as long as the person lives.
The contract is exposed to three types of risk: investment risk, interest rate risk and mortality risk. We suppose the log-account value q t = log ( S t ) , the interest rate r t and the mortality rate μ x + t of our policyholder start from known constants q 0 , r 0 , μ x and for x + t 120 , have P -dynamics
d q t = m 1 2 σ S 2 d t + σ S d W t P , S , d r t = ζ ( γ r t ) d t + σ r d W t P , r , d μ x + t = κ μ x + t d t + σ μ d W t P , μ ,
for given parameters m , ζ , γ , κ , σ S , σ r , σ μ and P -Brownian motions W P , S , W P , r and W P , μ forming a three-dimensional Gaussian process with instantaneous correlations ρ 12 , ρ 13 and ρ 23 . We assume that our policyholder does not live longer than 120 years. Therefore, we set μ x + t for x + t > 120 . The dynamics for x + t 120 under the risk-neutral probability Q are assumed to be
d q t = r t 1 2 σ S 2 d t + σ S d W t Q , S , d r t = ζ ( γ ¯ r t ) d t + σ r d W t Q , r , d μ x + t = κ μ x + t d t + σ μ d W t Q , μ ,
where W Q , S , W Q , r , W Q , μ are Q -Brownian motions constituting a three-dimensional Gaussian process with the same instantaneous correlations as the corresponding P -Brownian motions. As Ha and Bauer (2019), we assume there is no risk premium for mortality and a constant risk premium λ for interest rate risk, such that γ ¯ = γ λ σ r / ζ . Provided that the policyholder is still alive at the risk horizon τ < T , the value of the contract at that time is
L = E Q e τ T r s + μ x + s d s max e q T , b a x + T ( T ) X ,
where we denote X = ( q τ , r τ , μ x + τ ) . Discounting with r s + μ x + s takes into account that the policyholder might die between τ and T. On the other hand, a possible death time between 0 and τ is not considered. This results in a conservative estimate of the capital requirement for the issuer of the contract. Alternatively, one could model the loss as I A L , where A Ω is the event that the policyholder survives until time τ .
We follow Ha and Bauer (2019) and set x = 55 , τ = 1 , T = 15 , b = 10.792 , q 0 = 4.605 , m = 5 % , σ S = 18 % , r 0 = 2.5 % , ζ = 25 % , γ = 2 % , σ r = 1 % , λ = 2 % , μ x = 1 % , κ = 7 % , σ μ = 0.12 % , ρ 12 = 30 % , ρ 13 = 6 % , ρ 23 = 4 % . Then
a x + T ( T ) = k = 1 50 k E x + T ( T ) ,
where
k E x + t ( t ) = E Q e t t + k r s + μ x + s d s r t , μ x + t
is the time-t value of a pure endowment contract with maturity t + k . Since r and μ are affine, one has
k E x + t ( t ) = F ( t , k , r t , μ x + t )
for the function F ( t , k , r t , μ x + t ) = A ( t , t + k ) e B r ( t , t + k ) r t B μ ( t , t + k ) μ x + t , with A , B r , and B μ as given in the Appendix of Ha and Bauer (2019). Moreover, one can write
L = T τ E x + τ ( τ ) E Q E max e q T , b a x + T ( T ) X
for the probability measure Q E given by
d Q E d Q = exp 0 T r s + μ x + s d s E Q exp 0 T r s + μ x + s d s .
Under P , X = ( q τ , r τ , μ τ ) is a three-dimensional normal vector, and the conditional Q E -distribution of ( q T , r T , μ x + T ) given X is normal too (the precise form of these distributions is given in the Appendix of Ha and Bauer (2019)). This makes it possible to efficiently simulate ( X , Y ) under P Q E , where
Y = F ( τ , T τ , r τ , μ x + τ ) max e q T , b k = 1 50 F ( T , k , r T , μ x + T ) .
To approximate L we chose a network with two hidden layers containing 4 nodes each and trained it for 40 epochs. Figure 9 shows the empirical squared distance without IS on the training and validation data set.
The panels in Figure 10 illustrate the empirical evaluations of the test criteria (a) and (b) from Section 3.2.
To test (c) from Section 3.2 we considered the sets
B 1 = x R 3 : x 1 > q 70 %   and   x 2 < r 30 % and B 2 = x R 3 : x 1 < q 30 %   and   x 2 > r 70 %
where q β and r β denote the β -quantiles of q τ and r τ under P Q E ; see Figure 11.
We also trained and tested under P ν Q E for an IS distribution ν on R d . Our results are reported in Figure 12. Our estimate of VaR 99.5 % ( L ) was 138.64 without and 138.52 with IS. In comparison, the VaR estimate obtained by Ha and Bauer (2019) using 37 monomials and 40 million simulations is 139.74. Our estimates of ES 99 % ( L ) came out as 141.12 without and 142.12 with IS. There exist no reference values for this case.

5. Conclusions

In this paper we have developed a deep learning method for assessing the risk of an asset-liability portfolio over a given time horizon. It first computes a neural network approximation of the portfolio value at the risk horizon. Then the approximation is used to estimate a risk measure, such as value-at-risk or expected shortfall, from Monte Carlo simulations. We have investigated how to choose the architecture of the network, how to learn the network parameters under a suitable importance sampling distribution and how to test the adequacy of the network approximation. We have illustrated the approach by computing value-at-risk and expected shortfall in three typical risk assessment problems from banking and insurance. In all cases the approach has worked efficiently and produced accurate results.

Author Contributions

P.C., J.E., and M.V.W. have contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

As SCOR Fellow, John Ery thanks SCOR for financial support.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bauer, Daniel, Andreas Reuss, and Daniela Singer. 2012. On the calculation of the solvency capital requirement based on nested simulations. ASTIN Bulletin 42: 453–99. [Google Scholar]
  2. Britten-Jones, Mark, and Stephen Schaefer. 1999. Non-linear Value-at-Risk. Review of Finance, European Finance Association 5: 155–80. [Google Scholar]
  3. Broadie, Mark, Yiping Du, and Ciamac Moallemi. 2011. Efficient risk estimation via nested sequential estimation. Management Science 57: 1171–94. [Google Scholar] [CrossRef] [Green Version]
  4. Broadie, Mark, Yiping Du, and Ciamac Moallemi. 2015. Risk estimation via regression. Operations Research 63: 1077–97. [Google Scholar] [CrossRef] [Green Version]
  5. Bru, Bernard, and Henri Heinich. 1985. Meilleures approximations et médianes conditionnelles. Annales de l’I.H.P. Probabilités et Statistiques 21: 197–224. [Google Scholar]
  6. Cambou, Mathieu, and Damir Filipović. 2018. Replicating portfolio approach to capital calculation. Finance and Stochastics 22: 181–203. [Google Scholar] [CrossRef]
  7. Carriere, Jacques F. 1996. Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics and Economics 19: 19–30. [Google Scholar] [CrossRef]
  8. Castellani, Gilberto, Ugo Fiore, Zelda Marino, Luca Passalacqua, Francesca Perla, Salvatore Scognamiglio, and Paolo Zanetti. 2019. An Investigation of Machine Learning Approaches in the Solvency II Valuation Framework. SSRN Preprint. [Google Scholar]
  9. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2: 303–14. [Google Scholar] [CrossRef]
  10. David, Herbert A., and Haikady N. Nagaraja. 2003. Order Statistics, 3rd ed. Wiley Series in Probability and Statistics; Hoboken: Wiley. [Google Scholar]
  11. Duffie, Darrell, and Jun Pan. 2001. Analytical Value-at-Risk with jumps and credit risk. Finance and Stochastics 5: 155–80. [Google Scholar] [CrossRef]
  12. Fiore, Ugo, Zelda Marino, Luca Passalacqua, Francesca Perla, Salvatore Scognamiglio, and Paolo Zanetti. 2018. Tuning a deep learning network for Solvency II: Preliminary results. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: MAF 2018. Berlin: Springer. [Google Scholar]
  13. Glasserman, Paul. 2003. Monte Carlo Methods in Financial Engineering. Stochastic Modelling and Applied Probability. Berlin: Springer. [Google Scholar]
  14. Glorot, Xavier, and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 9: 249–56. [Google Scholar]
  15. Glynn, Peter, and Shing-Hoi Lee. 2003. Computing the distribution function of a conditional expectation via Monte Carlo: Discrete conditioning spaces. ACM Transactions on Modeling and Computer Simulation 13: 238–58. [Google Scholar]
  16. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge: MIT Press. [Google Scholar]
  17. Gordy, Michael B., and Sandeep Juneja. 2008. Nested simulation in portfolio risk measurement. Management Science 56: 1833–48. [Google Scholar] [CrossRef] [Green Version]
  18. Ha, Hongjun, and Daniel Bauer. 2019. A least-squares Monte Carlo approach to the estimation of enterprise risk. Working Paper. Forthcoming. [Google Scholar]
  19. Hejazi, Seyed Amir, and Kenneth R. Jackson. 2017. Efficient valuation of SCR via a neural network approach. Journal of Computational and Applied Mathematics 313: 427–39. [Google Scholar] [CrossRef] [Green Version]
  20. Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2: 359–66. [Google Scholar] [CrossRef]
  21. Lee, Shing-Hoi. 1998. Monte Carlo Computation of Conditional Expectation Quantiles. Ph.D. thesis, Stanford University, Stanford, CA, USA. [Google Scholar]
  22. Leshno, Moshe, Vladimir Lin, Allan Pinkus, and Shimon Schocken. 1993. Multilayer feedforward networks with a non-polynomial activation function can approximate any function. Neural Networks 6: 861–67. [Google Scholar] [CrossRef] [Green Version]
  23. Longstaff, Francis, and Eduardo Schwartz. 2001. Valuing American options by simulation: A simple least-squares approach. The Review of Financial Studies 14: 113–47. [Google Scholar] [CrossRef] [Green Version]
  24. McNeil, Alexander J., Rüdiger Frey, and Paul Embrechts. 2015. Quantitative Risk Management: Concepts, Techniques and Tools, rev. ed. Princeton: Princeton University Press. [Google Scholar]
  25. Natolski, Jan, and Ralf Werner. 2017. Mathematical analysis of replication by cashflow matching. Risks 5: 13. [Google Scholar] [CrossRef] [Green Version]
  26. Pelsser, Antoon, and Janina Schweizer. 2016. The difference between LSMC and replicating portfolio in insurance liability modeling. European Actuarial Journal 6: 441–94. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Rouvinez, Christophe. 1997. Going Greek with VaR. Risk Magazine 10: 57–65. [Google Scholar]
  28. Ruder, Sebastian. 2016. An overview of gradient descent optimization algorithms. arXiv arXiv:1609.04747. [Google Scholar]
  29. Tsitsiklis, John, and Benjamin Van Roy. 2001. Regression methods for pricing complex American-style options. IEEE Transactions on Neural Networks 12: 694–703. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Wüthrich, Mario V. 2016. Market-Consistent Actuarial Valuation. EAA Series; Berlin: Springer. [Google Scholar]
  31. Zwingmann, Tobias, and Hajo Holzmann. 2016. Asymptotics for the expected shortfall. arXiv arXiv:1611.07222. [Google Scholar]
1.
More precisely, VaR α is usually defined as an α - or (1 α ) -quantile depending on whether it is applied to L or L . So up to the sign convention, all VaR definitions coincide if F L is strictly increasing. Similarly, different definitions of ES are equivalent if F L is continuous.
2.
If the training data is generated according to a tree structure as in Section 3.1.1, one can either group the simulations ( X i , Y i , j ) , i = 1 , , N 1 , j = 1 , , N 2 , so that pairs with the same X i -component stay together or not. In our implementations, both methods gave similar results.
3.
More precisely, uniqueness holds if functions are identified that agree π -almost surely.
Figure 1. Empirical squared distance without IS during training (left) and on the validation data (right).
Figure 1. Empirical squared distance without IS during training (left) and on the validation data (right).
Risks 08 00016 g001
Figure 2. Empirical evaluation of (a) (left) and (b) (right) of Section 3.2.
Figure 2. Empirical evaluation of (a) (left) and (b) (right) of Section 3.2.
Risks 08 00016 g002
Figure 3. Empirical evaluation of (c) of Section 3.2 for the sets B 1 (left) and B 2 (right).
Figure 3. Empirical evaluation of (c) of Section 3.2 for the sets B 1 (left) and B 2 (right).
Risks 08 00016 g003
Figure 4. Convergence of the empirical 99.5%-VaR (left) and 99%-ES (right) without IS (orange) and with IS (blue) compared to the reference values obtained from the Black–Scholes formula (green).
Figure 4. Convergence of the empirical 99.5%-VaR (left) and 99%-ES (right) without IS (orange) and with IS (blue) compared to the reference values obtained from the Black–Scholes formula (green).
Risks 08 00016 g004
Figure 5. Empirical squared distance without IS during training (left) and on the validation data (right).
Figure 5. Empirical squared distance without IS during training (left) and on the validation data (right).
Risks 08 00016 g005
Figure 6. Empirical evaluation of (a) (left) and (b) (right) of Section 3.2 under P ν Q corresponding to α = 99 % .
Figure 6. Empirical evaluation of (a) (left) and (b) (right) of Section 3.2 under P ν Q corresponding to α = 99 % .
Risks 08 00016 g006
Figure 7. Empirical evaluation of (c) of Section 3.2 for the sets B 1 (left) and B 2 (right) under P ν Q corresponding to α = 99 % .
Figure 7. Empirical evaluation of (c) of Section 3.2 for the sets B 1 (left) and B 2 (right) under P ν Q corresponding to α = 99 % .
Risks 08 00016 g007
Figure 8. Convergence of the empirical 99.5%-VaR (left) and 99%-ES (right) without IS (orange) and with IS (blue) compared to the reference values obtained from the Black–Scholes formula (green).
Figure 8. Convergence of the empirical 99.5%-VaR (left) and 99%-ES (right) without IS (orange) and with IS (blue) compared to the reference values obtained from the Black–Scholes formula (green).
Risks 08 00016 g008
Figure 9. Empirical squared distance without IS during training (left) and on the validation data (right).
Figure 9. Empirical squared distance without IS during training (left) and on the validation data (right).
Risks 08 00016 g009
Figure 10. Empirical evaluation of (a) (left) and (b) (right) of Section 3.2 under P Q E .
Figure 10. Empirical evaluation of (a) (left) and (b) (right) of Section 3.2 under P Q E .
Risks 08 00016 g010
Figure 11. Empirical evaluation of (c) of Section 3.2 for the sets B 1 (left) and B 2 (right) under P Q E .
Figure 11. Empirical evaluation of (c) of Section 3.2 for the sets B 1 (left) and B 2 (right) under P Q E .
Risks 08 00016 g011
Figure 12. Convergence of the empirical 99.5%-VaR (left) and 99%-ES (right) without IS (orange) and with IS (blue) compared to the reference value from Ha and Bauer (2019) (green).
Figure 12. Convergence of the empirical 99.5%-VaR (left) and 99%-ES (right) without IS (orange) and with IS (blue) compared to the reference value from Ha and Bauer (2019) (green).
Risks 08 00016 g012

Share and Cite

MDPI and ACS Style

Cheridito, P.; Ery, J.; Wüthrich, M.V. Assessing Asset-Liability Risk with Neural Networks. Risks 2020, 8, 16. https://doi.org/10.3390/risks8010016

AMA Style

Cheridito P, Ery J, Wüthrich MV. Assessing Asset-Liability Risk with Neural Networks. Risks. 2020; 8(1):16. https://doi.org/10.3390/risks8010016

Chicago/Turabian Style

Cheridito, Patrick, John Ery, and Mario V. Wüthrich. 2020. "Assessing Asset-Liability Risk with Neural Networks" Risks 8, no. 1: 16. https://doi.org/10.3390/risks8010016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop