Neural Network Pricing of American Put Options

: In this study, we use Neural Networks (NNs) to price American put options. We propose two NN models—a simple one and a more complex one—and we discuss the performance of two NN models with the Least-Squares Monte Carlo (LSM) method. This study relies on American put option market prices, for four large U.S. companies—Procter and Gamble Company (PG), Coca-Cola Company (KO), General Motors (GM), and Bank of America Corp (BAC). Our dataset is composed of all options traded within the period December 2018 until March 2019. Although on average, both NN models perform better than LSM, the simpler model (NN Model 1) performs quite close to LSM. Moreover, the second NN model substantially outperforms the other models, having an RMSE ca. 40% lower than the presented by LSM. The lower RMSE is consistent across all companies, strike levels, and maturities. In summary, all methods present a good accuracy; however, after calibration, NNs produce better results in terms of both execution time and Root Mean Squared Error (RMSE).


Introduction
This study compares two different methods to price American put options. While call options give the right to its holder to buy the underlying asset at a pre-defined price, the strike value, put options give the right to sell it. European style options can only be exercised at a pre-defined fixed date, the maturity. American style options, instead, can be exercised at any moment until maturity, leading to an optimal stopping time problem.
The case of American put options is particularly hard to solve, and there are no closed-form solutions.
Here, we focus on the comparison of two methods: the least-squares Monte Carlo method, a simulation method first presented by Longstaff and Schwartz (2001), and a machine learning method, Neural Networks (NNs).
Most studies on NN explore mainly European style options. That is the case of Hutchinson et al. (1994), the first article where NNs were trained to price options, but also of Bennell and Sutcliffe (2004); Garcia and Gençay (2000); Gradojevic et al. (2009) ;Yao et al. (2000); Liu et al. (2019). A very recent exception is Jang and Lee (2019), which looked into market prices of S&P 100 American put options.

Literature Overview
The pricing of American style options is a classical problem in financial mathematics. This type of problem is associated with a moving boundary that is related to the optimal time to exercise the option. Various approaches have been proposed and developed for the valuation problem of American options. Methods based on solving partial differential equations include finite differences schemes as introduced by Brennan and Schwartz (1977) and the finite element method as presented by Allegretto et al. (2001).
In the class of simulation and tree approaches, the most popular techniques include the binomial and trinomial tree methods and the Least-Squares Monte Carlo (LSM) method proposed in Longstaff and Schwartz (2001). Zhao (2018) studied the pricing of put and call options by comparing eight different valuation methods on a one-dimensional scale. He compared numerical methods as the binomial and trinomial tree methods and the explicit and implicit finite difference and concluded that the binomial method had the best overall performance in terms of both time and accuracy. In the simulation methods class, he analyzed the simulated tree from Broadie et al. (1997), the bundling technique from Tilley (1993), and the LSM method, with LSM showing the best results also in terms of both accuracy and execution time.
The fundamental idea of the LSM approach is that the optimal exercise strategy is obtained by the conditional expectation of the payoff from maintaining the option alive. This conditional expectation is estimated using the cross-sectional information in the simulated paths by use of least squares, allowing accurately valuing American options by simulation. It should be noted that simulation methods tend to be more attractive when compared to other numerical schemes when the complexity of the option grows or the number of the underlying assets increases.
The consistency and convergence rates of the LSM were investigated by Clément et al. (2002). Variations of the original regression techniques have also been explored. Examples are nonparametric least squares regression estimates as in Egloff (2005) and Egloff et al. (2007), the smoothing spline regression estimates of Kohler (2008), or the recursive kernel regression estimates by Barty et al. (2008). In Kohler et al. (2010), the authors presented a further modification of the LSM algorithm using neural networks instead of regressions, proving the convergence of the proposed algorithm and applying it to pricing American put options written on several underlyings, using simulated data under Black-Scholes assumptions and the Heston model.
Artificial Neural Networks (NN) were firstly mathematically formalized in McCulloch and Pitts (1943) as a biologically-inspired programming paradigm that enables a computer to learn from observable data. The introduction of the error backpropagation learning algorithm by Rumelhart et al. (1985) has since contributed to the increase in popularity of NNs in many research fields. NNs and deep learning currently provide the most efficient solution to many problems in image recognition, speech recognition, or natural language processing. They have also been applied to predict and/or classify economic and financial variables such as the gross domestic product as in Tkacz (2001), the unemployment rate in Moshiri and Brown (2004), inflation in Choudhary and Haider (2012), exchange rates in Pacelli et al. (2011), or even to predict financial crashes as in Rotundo (2004).
In particular, for pricing financial derivatives, many studies have been pointing to the advantages of using NN as the main or complementary tool. For instance, Hutchinson et al. (1994) proposed the use of learning networks for estimating the price of European options and argued that learning networks are able to recover the Black-Scholes formula using a two year training set of daily options prices and that the resulting network could be used to obtain prices and successfully delta-hedge options out-of-sample. Garcia and Gençay (2000) estimated a generalized option pricing formula that has a functional format similar to the Black-Scholes formula through a feedforward NN model and obtained small delta-hedging errors relative to the hedging performance of the Black-Scholes model. Yao et al. (2000) used backpropagation NN to forecast option prices of Nikkei 225 index futures and showed that for volatile markets, an NN option pricing model can outperform the traditional Black-Scholes model. However, they pointed out that the Black-Scholes model is still adequate for pricing at-the-money options and suggested to partition the data according to moneyness when applying neural networks. In Bennell and Sutcliffe (2004), a comparison between the performance of Black-Scholes with an NN was carried out for pricing European call options on the FTSE 100 index, and they concluded that for out-of-the-money options, the NN is clearly superior to Black-Scholes and comparable for other cases excluding the deep in-the-money and long maturity options. Gradojevic et al. (2009) proposed a Modular Neural Network (MNN) model to obtain prices for S&P-500 European call options. The modules based on moneyness and time to maturity of the options improve the out-of-sample performance when compared with standard feedforward neural network option pricing models.
While most studies on NN, dealing with real observed market data, focus on European options, the empirical study of Jang and Lee (2019) targeted S&P 100 American put options prices and also pointed out the better performance of NNs when compared with the classical financial option models.
Recently,  showed how NN can be used to provide estimates for the optimal value for other stopping time problems, such as the pricing of Bermudan max-call and callable multi barrier reverse convertible options and the problem of optimally stopping a fractional Brownian motion, providing an alternative for simulation methods for high-dimensional problems.  focused on high dimensionality American and Bermudan options.
In this study, we consider the one-dimensional American put pricing case and compare the classical LSM method, as proposed in Longstaff and Schwartz (2001), as the idea is to compare the "best" of the pure simulation methods (according to the comparative study of Zhao 2018) with the more recent approach of using pure NN and deep learning methods. An advantage of the latter is the fact that it is a model free, real market data-driven approach.

Least-Squares Monte Carlo Method
The LSM implementation can be understood as a three step procedure: • First, a model must be chosen, and using standard Monte Carlo techniques, a large number of paths for the underlying risky asset are simulated.
Here, we assume stock prices S follow a geometrical Brownian motion (GBM) whose dynamics, under the risk neutral measure, are given by: where W t is a Wiener process, r is the risk-free interest rate, and q and σ are the associated stock dividend yield and volatility, respectively.
• Second, a backward induction process is conducted in which a value is recursively assigned to each state at every time step. At each point of the option cash-flow matrix, the exercise value is known, but the continuation value needs to be estimated. LSM uses regressions, considering the in-the-money states of the option cash-flow matrix at each time step and measurable functions of underlyings at the same time step and state, to estimate the continuation value of the option.
In this study, we consider a five degree polynomial, so the continuation function is estimated as: where Y is the continuation value, X is the simulation stock value, and the residual error.
• Finally, when all states are valued for every time step, we calculate the value of the option by passing through all the time steps and states and by making an optimal decision on option exercise at every step for a particular price path and the value of the payoff that would result.
The LSM approach is easily implemented, but it depends on the chosen model (model risk), on the approximation method for the continuation value, and on the simulation decisions such as the number of paths, step size, etc.
In terms of underlying models, one could have considered alternative models, such as jump processes and processes with stochastic volatility, among many others. Here, we opted for the simplest possible model, GBM, but we did consider Bloomberg's implied volatility to value each different option, so, to some extent, we include volatility smiles in the standard GBM model. For each simulation, we used a total of 1000 paths and 50 time steps. Haykin (1994) defined an NN as a "massively parallel distributed processor made up of simple processing units that has a natural propensity for storing experiential knowledge and making it available for use".

Neural Networks
The best part about considering an NN approach is that one does not need to assume any underlying pattern for the data. That is, NNs are a model-free purely data-driven approach. Patterns are captured using an almost exhaustive search. The provided data are regular enough (e.g., no infinite variance). The only assumption one needs to consider is on the ability of the NN to approximate the output function. As we discuss below, these are extremely mild assumptions due to the existence of the universal approximation theorem.
For our pricing of American put options, we collected market data on put prices (output), but also on the same inputs the model in Equation (1) requires: price of the underlying asset, volatility, interest rate, dividend yield, options' strike price, and maturity. As we will see in Section 4.4, in our simplest NN, some of these inputs will not even be considered.
We consider an NN with a feedforward architecture as represented in Figure 1. Although different architectures can be considered, for tackling the problem of option pricing, the feedforward dynamic is the most widely used. Examples of this include Tang and Fishwick (1993), Hutchinson et al. (1994), Yao et al. (2000), and Garcia and Gençay (2000).
For exposition simplicity, we briefly present the dynamics of a feedforward NN with just one hidden layer, as represented in Figure 1. The architecture's logic allows for more than one hidden layer, which would add more connections and weights between the hidden layers only. In our options pricing application, we use both a model with just one hidden layer and another with several. Figure 1. Feedforward neural network structure. Where X i , with i = 1, · · · , d represent the input nodes, H k , wit k = 1, · · · , K, the hidden nodes, f the output and the activation functions φ and ψ are as described in Equations (3) and (4), respectively.
A feedforward NN can be summarized as follows: • The first layer, (X 1 to X d ), represents the input layer, where d is the number of input variables. The second layer is the hidden layer, (H 1 to H k ), where k is the number of nodes. Finally, f represents the output variable. The nodes in the same layer are not connected to each other, but each node is connected to each node from the neighboring layer. This connection is given by the weighted sum of the values of the previous nodes.
• Starting with the connection between the input layer and the hidden layer, let X i represent an input node and H k the k th node from the hidden layer, then each hidden node is obtained as: where N i is the number of input variables, w k,i is the weight of the input layer i with respect to the hidden node k, b k is the bias node, and φ is an activation function.
• As in the hidden layer, the output node also depends on an activation function, with the weighted sum of the hidden nodes as the argument. Once we have a value for each H k , the output of the function is given by: where f is the output value, N k is the number of nodes in the hidden layer, v k is the weight of the node H k , b is the bias, and ψ is also an activation function.
The bias node is an extra node added to each layer (except the output layer). It is an extra argument to the activation function and acts as the independent term in a regression model (i.e., as a scaler). While the nodes are connected with the input layer through the weights, the bias node is not connected to the input layer. The argument of the activation function depends on the input nodes and the respective weights and is then used to complete the connection from the input nodes to each hidden node. It is the activation function that scales the argument to a different range, introducing non-linearity and making the model susceptible to non-linear combinations between the input variables.
The universal approximation theorem states that, under mild assumptions on the activation function, a feedforward network with one hidden layer and a finite number of neurons is able to approximate continuous functions on compact subsets of R n . Intuitively, this theorem states that, when given appropriate parameters, a simple NN can represent a wide variety of functions. One of the first versions of the theorem was proven by Cybenko (1989), for sigmoid activation functions. Leshno et al. (1993) later showed that the class of deep neural networks is a universal approximator if and only if the activation function is not polynomial, although for an NN with a single hidden layer, the width of such networks would need to be exponentially large. Hornik et al. (1989) showed that the multilayer feedforward architecture gives NNs the potential of being universal approximators and that this property is not obtained from the specific choice of the activation function, but on the multilayer feedforward architecture itself, which gives NNs the potential of being universal approximators. The usage of NNs as approximators for complex and multidimensional problems was reinforced by Barron (1993) with the result regarding the accuracy of the approximation of functions, which implies that the number of elements of an NN does not have to increase exponentially with the space dimension to maintain errors at a low level.
Besides the classical sigmoid function, f (x) = 1 1+e −x , which can be seen on the left of Figure 2, several other activation functions have been considered in the literature. Krizhevsky et al. (2012) empirically compared the sigmoid function to a nonlinear function called Rectified Linear Units (ReLU), concluding that it consistently improves the NN training, decreasing its error. Lu et al. (2017) showed that an NN of width n + 4 with ReLU activation functions is able to approximate any Lebesgue integrable function on an n-dimensional input space with respect to the L 1 distance, if the network depth is allowed to grow. Hanin and Sellke (2017) showed that an NN width n+1 would suffice to approximate any continuous function of n-dimensional input variables.  (5). The difference between the ReLU functions is in the negative part of the function, where in the case of leaky ReLU (red line), a can take a small value (0.01), while for ReLU, a is zero. Xu et al. (2015) investigated the use of variants of ReLU, such as leaky ReLU. Although the authors recognized the need for rigorous theoretical treatment of their empirical results, they showed that leaky ReLU seems to work consistently better than the original ReLU.
The general equation for both classical ReLUs and leaky ReLU is given by: ax, otherwise, for 0 ≤ a < 1 .
The classical ReLU and an instance of a leaky ReLU are also represented on the right of Figure 2. Following the most recent literature, in our option pricing NN, we opted for leaky ReLU (with a = 0.1), instead of the traditional sigmoid function or standard ReLU, which has the advantage of avoiding zero gradients.
We note that the universal approximation theorem states that, given appropriate parameters, we can represent a wide variety of interesting functions using simple NN, but it does not touch upon the algorithmic learnability of those parameters. The most common learning procedure was introduced by Rumelhart et al. (1985) and is known as backpropagation. Under this setting, the weights of the nodes are iteratively adjusted in order to minimize the so-called cost function.
The cost function rates how good the NN does in the whole set of observations. Taking the Mean Squared Error (MSE) as an example of a cost function, we get: where n is the number of observations, f j (θ) is the NN output value, for θ the set of parameters, and f * j is the real observed market values. Backpropagation adjusts the connection weights to correct for each error found.The error amount is effectively split among the connections. Technically, backpropagation computes the gradient of the cost function at a given state with respect to the weights. The weight updates are commonly done using Gradient Descent (GD), where t is the iteration step we refer to, α is the learning rate, and ∇C the gradient of the cost function. The choice of α should be made carefully as values near one could cause the algorithm to be unstable and oscillate, resulting in missing a global/local minimum in one iteration, whereas values near zero can converge to a non-optimal solution and also slow the convergence to a solution.
See LeCun et al. (2012) for an overview of learning rate issues. The backpropagation learning algorithm: • Starts with random or pre-selected weights at t = 0. • Then, for each observation, weights are updated at step t according to: where the cost function gradients, ∇w k,i , ∇v k , can easily be obtained applying the chain rule: where in and out stand for the input and output of each NN node, respectively. • Following this logic for every observation, the total error should decrease for each additional iteration step.
Stochastic Gradient Descent (SGD) was proposed by Bottou (2010) to deal with large-scale problems. In SGD, instead of updating the parameters after iterating once through the full training set, i.e., after one epoch, we update the weights after randomly choosing one observation. It is expected that the gradient based on the single observation is an approximation of the expected gradient of the training set. In our option application, we use a variation of the SGD, where instead of updating the weights after each observation, we set a size for a subset called the batch size and update the parameters for each randomly chosen subset of that size, within the same epoch. This process accelerates the computation time, prevents numerical instability, and is more efficient to use in large datasets, as shown in LeCun et al. (2012) and Bottou (2010). Therefore, when calibrating an NN, one needs to set the number of epochs (iterations) and the batch (random subsets) size that conditions the number of parameter updates per epoch.
Finally, in machine learning methods, it is common to use data scaled to a specific range (zero to one for example) in order to increase the accuracy of the models and allowing the loss function to find a global, or local, minimum. The transformation we use is given by: where x i is the value to normalize and X min and X max are, respectively, the minimum and maximum values of the range. This transformation is also done due to the fact that the input variables have different scales. Output variables may be scaled or not, as they have a unique scale. The learning algorithm from the multilayer perceptron could be affected by the different scales of the variables, and the loss function can fail to converge to the local, or global, minimum. This area is currently an area of research in machine learning. The problem with finding a local or global minimum was first presented by Rumelhart et al. (1985), where the authors concluded that although the learning algorithm finds a solution in almost every practical try, it does not guarantee that a solution can be found. In terms of the global minimum, Choromanska et al. (2015) focused their study on the loss function non-convexity, which leads the learning algorithms to find a local minimum instead of a global minimum. The authors showed that, although there is a lack of theoretical support for optimization algorithms in NN, the global minimum is not relevant in practice because it may lead to overfitting the model.

Description, Treatment, and Statistics
We For each option, besides daily close prices on the option itself, we collected underlying stock prices, strike prices, maturities, volumes, and implied volatilities 1 . We also retrieved the quarterly dividend paid per share from each company throughout the studied period. In terms of the risk-free interest rate, the U.S. Treasury rate for different maturities was used. For each option, we selected the rate with the closest maturity to that option 2 .
We applied a liquidity filter-a minimum amount of 20 trades per trading day-to ensure our analysis was based on reliable data. Furthermore, due to Bloomberg gathering information on trades during the last day of trading of an option, some trades presented zero maturity, so we had to remove those observations. Finally, we eliminated all missing values.
From the original 37,952 observations, we ended up with 21,111, which was still a much larger number than what could be found in the literature. Compare, for instance, with Kelly and Shorish (1994), Yao et al. (2000), or Kohler et al. (2010Kohler et al. ( ), who used 1369Kohler et al. ( , 17,790, and 2000 Table 1 presents the basic statistics of our sample input variables, Figure 3 is the associated histograms and Figure 4 the box plots. 1 For implied volatility, we used the value determined by Bloomberg's quantitative analytics department "Equity Implied Volatility Surface".  Input variables: Bloomberg's implied volatility, moneyness (ratio between stock price and strike price), maturity (time to maturity, here reported in days), dividend yields (also known as relative dividends), and the interest rate (FED risk-free rates).  Besides implied volatility, which varied mostly across the four different underlying companies, the variables that most impacted the output variable (put prices) were moneyness and maturity. Figure 5 represents the put price histogram and a moneyness versus maturity heat graph on put prices. In terms of the interest rate, there were no big variations in the term structure throughout the three months under consideration. In addition, the short end of the term structure was quasi-flat.
We considered as At-The-Money (ATM) those options with a five percent deviation from the current stock price. In put options, an option was In-The-Money (ITM) if the stock price was below the strike price, which meant that the moneyness, defined before as stock price divided by strike price, was below one. Therefore, below the 0.95 threshold were the ITM options, and above the 1.05 threshold were the Out-of-The-Money (OTM) options.  Table 2 shows the percentage of each category for each underlying company.  From Figure 3, we can also see that the moneyness distribution was heavily centered around one, with some extreme values in OTM options, and that, as expected, our maturity was heavily skewed and almost fully concentrated in maturities below one year (less than 365 days). Table 2 emphasizes that only 16% of the options in our sample were ITM and only about 18% had maturities larger than six months. Although both moneyness and maturity "biases" were expectable as short-term maturities and ATM/OTM are, by far, the most traded options in the market, this is an issue that may influence the NN models' precision, when learning about ITM and/or larger maturity options.
Descriptive statistics per company can also be found in Table 3. Each underlying company has different characteristics, which positively impacts the machine learning process.

Training and Test Datasets
In order to train/develop and then test our NN models, we needed to have our data divided into a training dataset and a test dataset. The training dataset is a set of examples used to train the learning process parameters (weights, etc.) and to fine-tune (calibrate) the model. The test dataset is a dataset used to provide an unbiased evaluation of final model fit to the training dataset. As is common in machine learning techniques, here the division of both datasets was made randomly, with the training set consisting of 80% of the full dataset. It was, thus, over the remaining 20%-still more than 4000 options-that we tested the performance of both our NN models and the LSM method.
Using Kolmogorov-Smirnov tests, one can check if the distributions of the test and training sets are similar, feature-by-feature. Here, with the random sample selection, there was an assurance of similar distributions, as confirmed in Table 4 and Figure 6. If we were to use our NN models in other samples, checking that the training set used to train a particular NN model and the test set we wanted to apply the model to were in fact representative of the same reality seemed good practice.

LSM Specification
As mentioned in the Methodology Section, we considered the underlying stock to follow a GBM as in Equation (1). Although the equation variables r (interest rate), q (dividend yield), and σ (volatility) were assumed constant, it is important to recall that, in our test set, each option had a different set of parameters.
Besides the four different underlying stocks, we also had different trading dates t, with different initial stock values, dividend yields, different option maturities T, and interest rate observations that depended both on the trading date date t and maturity T. Finally, we also opted to use Bloomberg implied volatilities for σ, and the implied volatilities were option specific. In practical terms, this meant that for the case of each option of our test set-4222 options-we had to simulate both the underlying asset and to use LSM on each step of the cash-flow option matrix (as explained in Section 3.1) to obtain the output-its put price. For each simulation, we used a total of 1000 paths and 50 time steps.
By simulating the underlyings using option specific details, we improved over the standard GBM model, as we also took into account market data, in particular the existence and evolution of interest rates' term structure and volatility smiles, over the test sample period.

NN Models' Calibration
In order to minimize the error of an NN model, we needed to calibrate it within the training dataset. We used some beforehand-made decisions and also optimized parameters.
For both of our NN models, in order to start the learning process, we gave random weights, from the normal distribution with mean equal to zero and standard deviation equal to 0.5, to the hidden nodes. The starting values of the hidden nodes were a random choice and immediately changed after the first learning cycle ended. In both cases, we also considered a learning rate α = 0.005 with decay per epoch of 1 × 10 −6 . The learning rate defined the number of the corrective steps that the algorithm took to adjust for errors obtained for each observation and led to a trade-off: a higher learning rate reduced the training time, at the cost of lower accuracy, while a lower learning rate took longer, but was more accurate.
For our NN Model 1, we opted to consider an NN as simple as possible with only one hidden layer, and a minimal number of inputs. In particular, given the small changes in the dividend yield and interest rates, we did not even consider them as inputs.
In our NN Model 2, we considered an increased number of inputs, adding to all the regular option inputs, also dummy variables per company and the average (training set) put price per company. The dummy variable takes only zero or one as values.The introduction of such variables is an alternative to training a different NN, for each company. The average put price per company works as an anchor, making the learning process faster.In terms of hidden layers, we compared NNs with 2, 3, and 4 layers and opted for 3 layers as the best performing model.
All other parameters were optimized, i.e., the alternatives were compared through a cross-validation test, where we looked at the training dataset MSE as the comparison metric: where e i is the error of each prediction, with y i representing the prediction of the model and y * i the true value of the option.
In NN Model 1 and for the number of nodes for our one hidden layer, we tested for three to 10 nodes. When comparing the values (see Figure 7), we noted that 7, 8, and 9 nodes were the ones with the lowest variance in MSE and also the lowest mean values. Therefore, for NN Model 1, we, thus, opted for nine hidden nodes in our one hidden layer. Likewise, for NN Model 2, we also optimized the number of nodes for the chosen three layers, considering for each layer between three and 20 nodes. The optimization was performed simultaneously over the three layers. Considering all combinations, the best performing solution was 16 nodes for Layer 1, 8 for Layer 2, and 4 for Layer 3. After testing the three activation functions previously mentioned-sigmoid, ReLU, and leaky ReLU-a leaky ReLU with α = 0.1 performed better in both models. Recall Figure 2.
An epoch is defined as the complete dataset training cycle. In terms of the updating sample size, commonly referred to as batch size, we assumed a batch size equal to 64. For our NN Model 1, we used 400 epochs, while for NN Model 2, we used 3000.
Finally, while in NN Model 1, we scaled both the input and output variables, in NN Model 2, we did not normalize the output variable. For this reason, the MSE values in Figure 8 have different ranges. Once we decided the NN models' architecture and optimized its parameters-in Table 5-the final step was the weight learning process (as described in Section 3.2), also known as network fitting.
In terms of learning curves for our NN models, in Figure 8, we can see the MSE curve from NN Model 1 going very fast to 0.001. NN Model 2 also seemed to decrease very fast, but if we take a better look at Figure 9, we see that there was still space for improvement. Notable also is the high variance when converging to the local optimum. We can also observe that for NN Model 1, the learning curve for the validation set is below the learning curve for the training set, which can indicate that Model 1 is too simple to reproduce the pricing dynamics, as it presents signs of an under-fitted model. On the contrary, we observe a different behavior for NN Model 2 as we included more layers and variables to reflect the non-linearity and complexity of the relationships between input and output variables.

Results
In this section, we compare our NN models' results with those of the LSM method, using the RMSE as the comparison measure for error, as well as the execution time for the comparison of the time spent by each method to price the options.
The methods were applied using the programming language Python Version 3.7.3 and run on a Macbook Pro 14.1 (MacOS Version 10.14.6) with an Intel Core i5 2.3 GHz processor with a memory of 2133 MHz and 8 GB of RAM.
The fitting of our NN models to the training set took about 2 and 5 min, for NN Model 1 and NN Model 2, respectively, due to the usage of TensorFlow and Keras 3 Python modules, which substantially decreased the fitting time. NN Model 1 was faster than NN Model 2, because of its simpler architecture and fewer number of input variables.
LSM took about 0.38 s to price one option alone. Once fitted, the NN models' pricing of an option was immediate. Even if we took the total fitting time into account, dividing it by the number of options in the test set, we got 0.07 s per option for NN Model 2, less than 20% of the time when compared to LSM, and 0.03 s for NN Model 1.
To compare the model results, we used RMSE, given by RMSE = MSE 0.5 , with the MSE as previously defined in Equation (11), but now considered only over the test dataset. Table 6 presents the RSME results. The RSME can be interpreted as the average USD ($) deviation from the true option value. Overall NN Model 2 was the best performing model, presenting total average deviations of $0.16 per option, while for NN Model 1, the average deviation was $0.22, and for LSM, it was $0.26. This meant that overall, for our randomly selected test dataset, on average, NN models outperformed LSM, considered in the literature as one of the most accurate methods to price American options. Procter & Gamble was the company with the highest deviations, no matter the model. This is probably due to the fact that its underlying was the one with the highest spot values (recall Table 3). Comparing across models, NN Model 2 was the one where this effect was less obvious.
These were, of course, just general results. Below, we analyze the performance of each model-both in absolute and relative RMSE terms-per moneyness and maturity classes. Table 7 presents results per company and for different moneyness and maturity. Figure 10 illustrates the same results. Since these were absolute results, i.e., values in USD ($), naturally, the most expensive ITM options tended to present higher deviations than ATM or OTM options. See Figure 10a. This bias-or "smirk"-is present on the graphs of the four companies. More importantly, at all instances, our NN Model 2 (orange line) had the lowest errors. When comparing our naive NN Model 1 (blue line) and LSM (grey line), the evidence was mixed. For the underlyings GM and KO, our NN Model did better than LSM, no matter the moneyness class. When BAC stock was the underlying, LSM showed less errors than NN Model 1, with error levels above, but close to NN Model 2. Finally, for the PG stock, LSM and NN Model 1 showed very close results to one another, with NN Model 1 performing slightly better for ITM options and LSM slightly better for ATM and OTM options.
In terms of maturities, the longer the maturity, the more expensive is the option. This expected bias in model performance was evident for NN Model 1 and LSM, but interestingly enough not for NN Model 2, which tended to present a quasi-flat absolute error term structure. See Figure 10b. As before, our NN Model 2 performed better than the other two models for all maturities and companies, performing fantastically well in the case of options with maturities larger than six months. 3 TensorFlow is an open-source library used in machine learning models; for further details, see Abadi et al. (2016). Keras is an NN library that is capable of running on top of TensorFlow and other libraries. It provides an API for building deep learning models quickly and efficiently. When dealing with a variety of options with various prices, relative RMSE gave us the average error in percentage of option prices. Figure 11 and Table 8 present the relative results, presenting RMSE as the percentage of put prices.
The overall message was the same as with the absolute results: NN Model 2 was the best model, and our naive NN Model 1 performed close to LSM, beating it in some instances. The ranges of relative errors of NN Model 2 were the smallest in all categories of moneyness and maturity. OTM had the highest percentual errors for all models. We must, however, remember these are relatively inexpensive options. Still, for the underlyings under consideration, NN Model 2 errors for OTM ranged from 9.53% to 15.95%, while for NN Model 1, they ranged from 23.32% to 38.60% and for LSM from 19.66% to 44.14%. to be relatively inexpensive options. Still, for the underlyings under consideration, NN Model 2 errors for OTM ranged from 9.53% to 15.95%, while for NN Model 1, they ranged from 23.32% to 38.60% and for LSM from 19.66% to 44.14%. ITM options were the ones presenting smaller ranges of relative errors, in between 4.93% and 9.45%, depending on the model and/or company under analysis.  The "winning" model-NN Model 2-presented errors as a percentage of options prices considerably lower than the alternative models, around: 5% for ITM, 10% for ATM, 12% for OTM, 20% for less than on month, 12% for between 1 month and 6 months, and 6% higher than 6 month maturity.
To conclude the analysis of our NN models and get a better feeling on the relative importance of each input, we used a permutation method. For each variable, we sorted the column of the variable randomly, ceteris paribus, and re-ran the new predictions on both NN models, comparing the new RMSE with the original RMSE from the correct, in terms of percentual deviation. Figure 12 shows the ordered feature importance. For instance, when in NN Model 2, we randomized interest rate values, the total RMSE increased by 27%, which was not as substantial as the absence of a correct strike price, which increased the RMSE by approximately 6000%. It is not surprising that getting as input the correct spot and/or strike prices was key in obtaining the correct option prices. The graphs are more interesting when read in comparative terms. For instance, it is clear that for NN Model 2, the dummy variables were important, with relevance similar to the option maturity and more relevant than implied volatility. Table 9 shows a measure of the relative importance of each input for both NN models. This relative importance was computed using the permutation technique results, dividing each input RMSE percent increase by the sum of all inputs' RMSE percent increase, for each model.

Conclusions
Neural Networks (NNs) had many advantages when it comes to option pricing as explored throughout this work. Still, some disadvantages could be encountered, as the necessity to have a complete and large dataset to train the modeling structure. This implies, for instance, that more exotic options and, in particular, over-the-counter options cannot be priced as fairly by the NN when compared with other derivatives more readily available in the markets and with a higher trading volume. Therefore, a note of advice is that, when using NN, the model is calibrated to past data, which means that any changes to the future composure of financial markets, for example, a big financial crisis, might modify the values of worldwide options, leading to miscalculations in prices and the need to retrain models.
In terms of the limitations of the analysis presented in this study, we note that although the number of options used here was larger than in most studies in the literature, it was still scarce, especially when compared to the datasets used to calibrate NN models in other fields. That is the case, for instance, in image recognition where NNs are the main methodology applied.
Regarding execution time, while LSM produced results in a reasonable amount of time to price a single option, if a more complicated underlying process, for example a jump process with stochastic volatility, were added into the LSM, the execution time might be compromised. On the contrary, once calibrated, the NN had an immediate execution time and could beat traditional methods also in terms of error performance. This study compared the classical LSM with two proposed NN models, i.e., our aim was to compare a pure simulation method to pure NN alternatives. Thus, we did not consider "mixed models" where NN are used within the LSM approach to estimate continuation values (as in Kohler et al. (2010)).
This study focused on training two NN models to a dataset from American put options with four companies' stocks as the underlyings; each of the stocks from a different sector of activity. Despite using all traded options on those underlyings, we considered a relatively small time frame: from December 2018 until March 2019. It could be meaningful to analyze if an NN trained for a specific sector would yield better results, and if an NN trained on a complete year would alter the outcome of the results.
Nonetheless, the results of this study were extremely promising, in terms of NN applicability. Our simplest (naive) model, NN Model 1, performed at a level similar to or better than LSM, which is considered a reference in the empirical literature. As for the proposed NN Model 2, it outperformed by far both LSM and NN Model 1. The results were robust across underlyings, maturities, and levels of moneyness.
Should NN continue to gather good results, investment companies could start using them, to price an option and complete a trade instantaneously. Funding: This work was partially supported by Project CEMAPRE/REM-UIDB/05069/2020 financed by FCT/MCTES (Portuguese Science Foundation), through national funds.