3.2. Neural Networks
Haykin (
1994) defined an NN as a “massively parallel distributed processor made up of simple processing units that has a natural propensity for storing experiential knowledge and making it available for use”.
The best part about considering an NN approach is that one does not need to assume any underlying pattern for the data. That is, NNs are a model-free purely data-driven approach. Patterns are captured using an almost exhaustive search. The provided data are regular enough (e.g., no infinite variance). The only assumption one needs to consider is on the ability of the NN to approximate the output function. As we discuss below, these are extremely mild assumptions due to the existence of the universal approximation theorem.
For our pricing of American put options, we collected market data on put prices (output), but also on the same inputs the model in Equation (
1) requires: price of the underlying asset, volatility, interest rate, dividend yield, options’ strike price, and maturity. As we will see in
Section 4.4, in our simplest NN, some of these inputs will not even be considered.
For exposition simplicity, we briefly present the dynamics of a feedforward NN with just one hidden layer, as represented in
Figure 1. The architecture’s logic allows for more than one hidden layer, which would add more connections and weights between the hidden layers only. In our options pricing application, we use both a model with just one hidden layer and another with several.
A feedforward NN can be summarized as follows:
The first layer, ( to ), represents the input layer, where d is the number of input variables. The second layer is the hidden layer, ( to ), where k is the number of nodes. Finally, f represents the output variable. The nodes in the same layer are not connected to each other, but each node is connected to each node from the neighboring layer. This connection is given by the weighted sum of the values of the previous nodes.
Starting with the connection between the input layer and the hidden layer, let
represent an input node and
the
node from the hidden layer, then each hidden node is obtained as:
where
is the number of input variables,
is the weight of the input layer
i with respect to the hidden node
k,
is the bias node, and
is an activation function.
As in the hidden layer, the output node also depends on an activation function, with the weighted sum of the hidden nodes as the argument. Once we have a value for each
, the output of the function is given by:
where
f is the output value,
is the number of nodes in the hidden layer,
is the weight of the node
,
b is the bias, and
is also an activation function.
The bias node is an extra node added to each layer (except the output layer). It is an extra argument to the activation function and acts as the independent term in a regression model (i.e., as a scaler). While the nodes are connected with the input layer through the weights, the bias node is not connected to the input layer. The argument of the activation function depends on the input nodes and the respective weights and is then used to complete the connection from the input nodes to each hidden node. It is the activation function that scales the argument to a different range, introducing non-linearity and making the model susceptible to non-linear combinations between the input variables.
The universal approximation theorem states that, under mild assumptions on the activation function, a feedforward network with one hidden layer and a finite number of neurons is able to approximate continuous functions on compact subsets of
. Intuitively, this theorem states that, when given appropriate parameters, a simple NN can represent a wide variety of functions. One of the first versions of the theorem was proven by
Cybenko (
1989), for sigmoid activation functions.
Leshno et al. (
1993) later showed that the class of deep neural networks is a universal approximator if and only if the activation function is not polynomial, although for an NN with a single hidden layer, the width of such networks would need to be exponentially large.
Hornik et al. (
1989) showed that the multilayer feedforward architecture gives NNs the potential of being universal approximators and that this property is not obtained from the specific choice of the activation function, but on the multilayer feedforward architecture itself, which gives NNs the potential of being universal approximators. The usage of NNs as approximators for complex and multidimensional problems was reinforced by
Barron (
1993) with the result regarding the accuracy of the approximation of functions, which implies that the number of elements of an NN does not have to increase exponentially with the space dimension to maintain errors at a low level.
Besides the classical sigmoid function,
, which can be seen on the left of
Figure 2, several other activation functions have been considered in the literature.
Krizhevsky et al. (
2012) empirically compared the sigmoid function to a nonlinear function called Rectified Linear Units (ReLU), concluding that it consistently improves the NN training, decreasing its error.
Lu et al. (
2017) showed that an NN of width
with ReLU activation functions is able to approximate any Lebesgue integrable function on an
n-dimensional input space with respect to the
distance, if the network depth is allowed to grow.
Hanin and Sellke (
2017) showed that an NN width
n+1 would suffice to approximate any continuous function of
n-dimensional input variables.
Xu et al. (
2015) investigated the use of variants of ReLU, such as leaky ReLU. Although the authors recognized the need for rigorous theoretical treatment of their empirical results, they showed that leaky ReLU seems to work consistently better than the original ReLU.
The general equation for both classical ReLUs and leaky ReLU is given by:
The classical ReLU and an instance of a leaky ReLU are also represented on the right of
Figure 2.
Following the most recent literature, in our option pricing NN, we opted for leaky ReLU (with ), instead of the traditional sigmoid function or standard ReLU, which has the advantage of avoiding zero gradients.
We note that the universal approximation theorem states that, given appropriate parameters, we can represent a wide variety of interesting functions using simple NN, but it does not touch upon the algorithmic learnability of those parameters. The most common learning procedure was introduced by
Rumelhart et al. (
1985) and is known as backpropagation. Under this setting, the weights of the nodes are iteratively adjusted in order to minimize the so-called cost function.
The cost function rates how good the NN does in the whole set of observations. Taking the Mean Squared Error (MSE) as an example of a cost function, we get:
where
n is the number of observations,
is the NN output value, for
the set of parameters, and
is the real observed market values.
Backpropagation adjusts the connection weights to correct for each error found.The error amount is effectively split among the connections. Technically, backpropagation computes the gradient of the cost function at a given state with respect to the weights. The weight updates are commonly done using Gradient Descent (GD),
where
t is the iteration step we refer to,
is the learning rate, and
the gradient of the cost function. The choice of
should be made carefully as values near one could cause the algorithm to be unstable and oscillate, resulting in missing a global/local minimum in one iteration, whereas values near zero can converge to a non-optimal solution and also slow the convergence to a solution. See
LeCun et al. (
2012) for an overview of learning rate issues.
The backpropagation learning algorithm:
Starts with random or pre-selected weights at .
Then, for each observation, weights are updated at step
t according to:
where the cost function gradients,
,
, can easily be obtained applying the chain rule:
where
and
stand for the input and output of each NN node, respectively.
Following this logic for every observation, the total error should decrease for each additional iteration step.
Stochastic Gradient Descent (SGD) was proposed by
Bottou (
2010) to deal with large-scale problems. In SGD, instead of updating the parameters after iterating once through the full training set, i.e., after one epoch, we update the weights after randomly choosing one observation. It is expected that the gradient based on the single observation is an approximation of the expected gradient of the training set. In our option application, we use a variation of the SGD, where instead of updating the weights after each observation, we set a size for a subset called the batch size and update the parameters for each randomly chosen subset of that size, within the same epoch. This process accelerates the computation time, prevents numerical instability, and is more efficient to use in large datasets, as shown in
LeCun et al. (
2012) and
Bottou (
2010). Therefore, when calibrating an NN, one needs to set the number of epochs (iterations) and the batch (random subsets) size that conditions the number of parameter updates per epoch.
Finally, in machine learning methods, it is common to use data scaled to a specific range (zero to one for example) in order to increase the accuracy of the models and allowing the loss function to find a global, or local, minimum. The transformation we use is given by:
where
is the value to normalize and
and
are, respectively, the minimum and maximum values of the range. This transformation is also done due to the fact that the input variables have different scales. Output variables may be scaled or not, as they have a unique scale.
The learning algorithm from the multilayer perceptron could be affected by the different scales of the variables, and the loss function can fail to converge to the local, or global, minimum. This area is currently an area of research in machine learning. The problem with finding a local or global minimum was first presented by
Rumelhart et al. (
1985), where the authors concluded that although the learning algorithm finds a solution in almost every practical try, it does not guarantee that a solution can be found. In terms of the global minimum,
Choromanska et al. (
2015) focused their study on the loss function non-convexity, which leads the learning algorithms to find a local minimum instead of a global minimum. The authors showed that, although there is a lack of theoretical support for optimization algorithms in NN, the global minimum is not relevant in practice because it may lead to overfitting the model.