A GARCH Model with Artiﬁcial Neural Networks

: In this paper, we incorporate a GARCH model into an artiﬁcial neural network (ANN) for ﬁnancial volatility modeling and estimate the parameters in Tensorﬂow. Our goal was to better predict stock volatility. We evaluate the performance of the models using the mean absolute errors of powers of the out-of-sample returns between 2 March 2018 and 28 February 2020. Our results show that our modeling procedure with an ANN can outperform the standard GARCH(1,1) model with standardized Student’s t distribution. Our variable importance analysis shows that Net Debt/EBITA is among the six most important predictor variables in all of the neural network models we have examined. The main contribution of this paper is that we propose a Long Short-Term Memory (LSTM) model with a GARCH framework because LSTM can systematically take into consideration potential nonlinearity in volatility structure at different time points. One of the advantages of our research is that the proposed models are easy to implement because our proposed models can be run in Tensorﬂow, a Python package that enables fast and automatic optimization. Another advantage is that the proposed models enable variable importance analysis.


Introduction
GARCH models are popular in finance and economic literature.Bollerslev [1] introduced a GARCH(1,1) model in which the residual term follows a standardized Student's t distribution.Nikolaev et al. [2] suggested that the GARCH model takes into account the dynamics of time-dependent variance and therefore incorporated a GARCH model into non-linear neural networks.So and Yu [3] compared multiple GARCH models on their performance in value at risk estimation.So and Wong [4] developed expected shortfall and median shortfall estimation under GARCH models.Chen and So [5] extended the GARCH model to have threshold nonlinearity.Chen et al. [6] studied Markov switching GARCH models for volatility forecasting.So and Xu [7] used GARCH models to capture more intraday information.Efimova and Serletis [8] modeled oil, natural gas, and electricity markets using a multivariate GARCH model because it allows for rich dynamics in the variance-covariance structure.Their finding is higher precision in key estimates achieved by using a less computationally demanding model.Kristjanpoller and Minutolo [9] extended their model by applying an artificial neural network (ANN) to the GARCH model.Zhipeng and Shenghong [10] extended the BEKK-GARCH model by applying the Markov regime switching method that incorporates model then changes with time.Consequently, their model building process requires much more time and computation power.Thirdly, the objective functions are deterministic while our objective function takes into account the probabilistic nature of returns.Their objective functions are heteroscedasticity adjusted mean squared error and heteroscedasticity adjusted mean squared error but our objective function is the negative loglikelihood function.
There are obvious differences between our proposed models and Kim and Won [16]'s model.Firstly, their LSTM model inputs include the parameters of GARCH, EGARCH, and EWMA at different times in order to measure the trend factors at different times, indicating that they carried out multiple optimizations for calculation of inputs.This requires more time for training and may accumulate errors in inputs due to a substantial number of objective functions for optimizations.Secondly, their model is more complex than our proposed models, meaning that their model is more prone to overfitting.Thirdly, the objective functions are deterministic while our objective function takes into account the probabilistic nature of returns.Their objective functions are mean absolute error (MAE), mean squared error, heteroscedasticity adjusted MSE and heteroscedasticity adjusted MAE but our objective function is the negative loglikelihood function.
The main contribution of this paper is that it proposes an LSTM model with a GARCH framework.LSTM can systematically take into consideration the potential nonlinearity in a volatility structure at different time points.Therefore, our models are more robust and comprehensive in predicting the volatility of returns.To examine the accuracy of our models, we need to visualize the recurrent network model performances and evaluate metrics of model performances.In order to visualize the recurrent network model performances, we compare the actual returns with the predicted standard deviations and compare the actual squared returns with the predicted variances.In general, the predicted standard deviations increase when the returns fluctuates rapidly.For some of the models, actual r 2 t and the predicted variance are very close.In addition, to evaluate metrics in the recurrent network models, we calculate the mean absolute errors of absolute returns for the recurrent network models, and compare them with those for the GARCH(1,1) model with standardized Student's t distribution.The results show that two of the recurrent network models outperform the GARCH(1,1) model with standardized Student's t distribution.As a result that the models are complex and our aim was to understand the models better, we calculate the variable importance of each of the variables in the data set.The idea of variable importance in this paper is the difference in loglikelihood after removing the variable of interest.Our result shows that Net Debt/EBITA is important in all of the recurrent network models we have examined.
There are two main advantages of the proposed models.The first advantage is that they are easier to implement because the program runs are carried out in Tensorflow.As mentioned in Heaton et al. [19] and Chen et al. [20], it is convenient to set up ANN architectures in Tensorflow.The computational efficiency, which is demonstrated in our numerical experiments, is also an attractive feature of Tensorflow.In Nikolaev et al. [13] and Kristjanpoller and Minutolo [9], it was necessary to write down and compute the derivatives.In Tensorflow, with a standard layer like LSTM, the optimization is automatic and fast after defining the objective function, network structure and commonly used optimization algorithm.The second advantage is that the proposed models enable variable importance analysis.Nikolaev et al. [13], Nikolaev et al. [2], Kristjanpoller and Minutolo [9], and Kristjanpoller and Minutolo [14] did not provide variable importance analysis and therefore did not allow for a thorough understanding by readers of what variables are important in their models.Our paper provides a metric for accessing variable importance to demonstrate which variables are important.
Our paper is structured as follows.Section 2 presents our proposed network architecture and displays the objective function.Section 3 describes the data and its processing.Section 4 describes the training specifications.Section 5 presents the results after running the models.

Network Architecture
To forecast variance, the methodology in this study follows a hybrid model: LSTM model and GARCH framework.The model first defines that the return r t follows a standardized Student's t distribution with conditional variance h t .The conditional variance h t is then modeled by LSTM.The objective function for parameter estimation is the negative loglikelihood.
The GARCH model is commonly used for return modeling because it takes into account the dynamic nature of the conditional variance of the return.The GARCH(1,1) model can be expressed as The stationary condition of the GARCH(1,1) model is α 1 + β 1 < 1.In Formulas ( 1) and (2), the return at time t, r t , follows a normal distribution with mean zero and variance h t , given the information up to time t − 1.The variance at time t is the linear combination of the conditional variance and the squared return at time t − 1.
In the GARCH(1,1) model, the current conditional variance is a linear function of conditional variance and the squared return on the previous day.However, this assumption may not hold because the relationship may not be linear.Nikolaev et al. [13] showed that dynamic recurrent network yields results with improved statistical performance.Nikolaev et al. [2] showed an improved result after using a nonlinear recurrent network.The target variable return can be modeled as where T(0, 1, υ) is Student's t distribution with mean 0, variance 1 and degrees of freedom υ.Assume that x t is an information vector at time t.
The architecture of h t is shown in Figure 1.In the model, we use the data on the previous 20 days (x t−20 , ..., x t−1 ) to predict the conditional variance of the current returns, h t .If the period used for predicting the variance of the current return is too long, the risk is that the prediction decreases owing to the fact that the earlier data may not reflect the future return structure.If the period used for predicting the variance of the current return is too short, the risk is that the available information may not be utilized to predict the variance of the returns.However, the numbers of days of data used for prediction of the returns can be changed according to users' preferences.
Assume that input row vector in LSTM cell is x −1 .The LSTM cell can be represented by where x −1,m is the m-th element of x −1 for = t − 19, , , , t.The terminology in LSTM models can be found in Hochreiter and Schmidhuber [21] and Gers et al. [22].Here, w out j,m , w in j,m , and w F j,m are parameters to be estimated in the model; f out j (.) and f in j (.) are activation functions of output gate and input gate respectively, and the default activation functions are sigmoid functions in Tensorflow; s c j is internal state; g c v j () is an activation function for cells and the default activation function is a hyperbolic tangent function in Tensorflow; u(.) is a differentiable function that scales an memory cell from an internal state, and the default function, u(.), is a hyperbolic tangent function in Tensorflow.Equations ( 3) and ( 4) compute the output and input gate activation, respectively.In Equation ( 5), the internal state s c v j ( ) is calculated by adding squashed and gated input to the previous internal state s c v j ( − 1).Then, in Equation ( 6), the cell output y c v j ( ) is calculated by multiplying the output gate activation with internal states squashed by u(.).LSTM models can help understand the long-distance relationship between volatility and the explanatory variables.He et al. [23] compared an LSTM model with other neural network models when predicting students' performance in a specific course.They discussed that LSTM can memorize longer sequence and achieve better results for tasks requiring understanding of long-distance relationship.
After running 20 LSTM cells as specified in Figure 1, we use a deep network to output h t .Assume that D(.) is a deep neural network as in Feng et al. [24].Then, h t can be represented by where D(.) is a deep learning architecture consisting of dense layers.In the deep learning architecture with function D(.), the number of hidden units and number of layers are as specified in Feng et al. [24].The activation functions may not be Rectified Linear Unit (RELU) functions because we experimented with different activation functions and found that the model performance depends on the activation functions.
In the neural network, the backpropagation and chain rule are used when computing the gradients of the objective function.The LSTM layer is considered to be the first layer in Tensorflow and, therefore, we can use the default functions without worrying about gradient issues.However, when considering the activation functions in D(.), we prefer the activation functions that do not result in diminishing/exploding gradient issue.To assess this issue, we tried an RELU and identity function.In our experiment, the identity function works better for this network because RELU may introduce a zero gradient in some regions.
To model non-negative volatility, a mapping from neural network output to a non-negative number is required.Dugas et al. [25] introduced a softplus function as sigmoid's primitive.They discussed that a softmax function is differentiable and has positive derivatives.Senior and Lei [26] explained that the reason for using a softplus function in neural network is that unlike an RELU, a softplus function does not have a zero gradient in regions, which enables easier training.They found that a neural network with a softplus function shows an improved performance and faster convergence.Consequently, we apply a softplus function to the output from LSTM cells in (7).

Objective Function
We follow the approach by Nikolaev et al. [2] who applied a maximum likelihood approach in their proposed models.Assume that g(.; µ, σ 2 , υ) denotes Student's t density function with a mean µ, variance σ 2 , and degree of freedom υ.The objective function is where The default optimization in Tensorflow is to minimize the objective function and therefore the negative loglikelihood objective function is used.

Data Processing
The target stock is the S&P 500 Index.After the removal of data entries for holidays, the data consisted of 17 variables with 5032 observations between 29 February 2000 and 28 February 2020 from the Bloomberg Terminal.In the data, the target variable is the daily percentage change in prices.Assume that P t is the price at time t.According to the definition from Bloomberg, the daily percentage change in prices can be written as r t = P t − P t−1 P t−1 × 100.
Denote an indicator function as I(.).Variables in the data set and their summary statistics are as follows: Table 1 shows the S&P 500 Index variables collected from the Bloomberg terminal.In Table 1, Total Assets, Total Current Assets, Net Debt Per Share, Book Value per Share, and Enterprise Value have a substantially higher mean, standard deviation, and range than other variables, indicating that additional effort in normalization or standardization may be needed in order to simplify the training for the neural networks.
Then, we have done the following steps to create training and testing instances for program runs: 1.
Read the data by pandas in Python.

2.
Normalize the data set.Assume that x is a vector.We normalize it by .

3.
Create the instances by labeling the data on the previous 20 days as the input and labeling the current return as the target variable.

5.
Shuffle the training instances.
Normalization is crucial for simplifying the tasks in a neural network.Shavit [27] concluded that proper normalization simplifies neural network's task and improves its ability to control the process.Jayalakshmi and Santhakumaran [28] showed that the performance of diabetes classification is dependent on the normalization method.Yin et al. [29] applied min-max normalization images.As a result that normalization is widely used and can help to simplify the task in the neural network, we opted to apply normalization technique.The interrelationship of the data after normalization can be visualized by the correlation matrix graph shown in Figure 2.  Most of the variables do not have a strong linear relationship, justifying the use of a neural network.After normalization, the instances are created by labeling the data on the previous 20 days as the input and labeling the current return as the target variable.As shown in Figure 1, the inputs and outputs are (x t−20 , x t−19 , ...., x t−1 ) and r t for t = 21, ..., 4529, respectively.The next step is to split the data set into training instances (90%) and testing instances (10%).The instances with the 90% earliest dates are training instances and the rest of instances are testing instances.The target variables for training are in between 28 March, 2000 and 1 March, 2018 while the target variables for testing are in between 2 March, 2018 and 28 February, 2020.The final step is to shuffle the training instances so as to avoid an effect from the ordering of training instances.

Training
We have the following specifications for program runs: 1.
The instances during training are split into batches with size 500 for optimization.

3.
The number of epochs it can tolerate for no improvement is 10.
Student's t distribution in Python needs to be standardized as in GARCH-t (Bollerslev [1], So and Xu [7]).
During training, the instances are split into batches with a size of 500 for optimization.According to Goodfellow et al. [31], with the minibatch method, the gradient is computed on more than one occasion but fewer than full training examples.They listed a number of reasons for using minibatches in deep neural networks.Minibatches provide a more linear return of accuracy.They also noted that in order to maintain stability due to the high variance in estimation of the gradient, small learning rate needs to be set.
To address the problem of setting appropriate learning rates at each epoch, Adam [30], an adaptive learning rate algorithm, is used in neural network training.The core idea of Adam is that it incorporates a bias correction.In commonly used optimization algorithms such as RMSprop [32], they did not take into account the bias correction even though they accumulated the first order moments and second order moments in the (exponential) average equations.Goodfellow et al. [31] discussed that with early stopping, the algorithm can halt whenever overfitting begins to occur.The number of epochs it can tolerate for no improvement, which is referred as patience in keras, is set to be 10.The maximum number of epochs is 1000, which enables the program to continue running before early stopping.Special care is needed in defining the distribution within the Tensorflow package.Student's t distribution in Python needs to be standardized by (υ − 2)/υ in order to produce the desired results.

Empirical Results
We have produced network diagrams shown in Figure 3 for reference.The shape of each neural network and the names of the layers have been shown in order to give a more extensive view of the architecture.For example, in Figure 3a, the input layer is of dimension 20 × 17 because we use 20 days of data with 17 variables.Architectures such as LSTM (32)+dense (16) and LSTM (32)+dense  are shown in captions of Figure 4. LSTM(32)+dense( 16) means an LSTM layer with 32 units, followed by a dense layer with 16 units.LSTM (32)+dense  means an LSTM layer with 32 units, followed by two dense layers with 16 units and 8 units, respectively.As LSTM (32)+dense  contains one more layer than LSTM(32)+dense( 16), LSTM (32)+dense  contains more parameters than LSTM (32)+dense (16).The reason for an LSTM layer with different dense layers in the table is that we would like to see if the difference in dense layers can result in different out-of-the-sample mean absolute error.
The program runs last for less than two minutes on a CPU-compiled Tensorflow.All of the program runs converge and stop early.To ensure that the program always returns the same results, the random seed in neural networks is set to be zero.After training, we have also drawn training and testing negative loglikelihood over epochs.If the experimental designs are not good, the negative loglikelihoods do not decrease with time.Figure 4 shows the negative loglikelihood over training.As the negative loglikelihood decreases over training for all models except LSTM( 32)+dense(64-32-16-8), the overall performance for the training set improved over training.We have also plotted actual r t and the predicted standard deviation for validation set.If the actual r t fluctuates rapidly and the predicted standard deviation does not increase, the model does not have high prediction power.Figure 5 shows that except for LSTM (32), the predicted standard deviations increase when the returns fluctuates rapidly.To review the performance of the neural networks, the squared returns and the predicted variances in the testing set are plotted.In Figure 6, except for LSTM (32), actual r 2 t and the predicted variance are very close.The implementation of neural network does not require the input of the batch size; therefore, the first dimension for each layer is a question mark.For example, the first layers of models are input layers and input matrices are of the shape (20,17).The input and output for input layer are (?, 20, 17).As the first layers are the same and the outputs of the final layers are of dimension one for all models, the architectures are named according to output units of layers excluding the first and final layer.For Figure 3c, after removing the input layer and the final layer, the LSTM layer with 32 output units is followed by a dense layer with 16 output units and a dense layer with 8 output units.Therefore, it is named LSTM(32)+dense .and the predicted variance in the same graphs.Except for LSTM (32)+dense (32), actual r 2 t and the predicted variance are very close Forsberg and Ghysels [33] discussed that absolute returns predict volatility better.They showed that absolute returns is immune to jumps.From So and Xu [7] and Winkelbauer [34], we can prove the moments of absolute returns in GARCH model to be where Γ(.) is a gamma function.So and Xu [7] discussed that the use of |r t,i | 1.5 can make their evaluation more robust.Out-of-the-sample mean absolute errors (MAE) in Table 2 are evaluated between 2 March 2018 and 28 February 2020.The model does not change when evaluating the errors.The following steps are taken during MAE: 1.
Forecast one-step ahead h t .2.
Compute the expectations of the variables shown in Table 2 by Equation ( 8).
Variable importance is essential for variable selection.However, there is no universal way to evaluate variable importance in different models.Breiman [36] discussed that for randomly generated input features with weak correlations, the convergence is very slow and the result is poor in the random forest model.He permuted the value of the feature of interest while calculating variable importance.Two Python packages were developed to help researchers calculate the variable importance.Korobov and Lopuhin [37] suggested replacing the feature with random noise when calculating variable importance.Pedregosa Gael Varoquaux Alexandre Gramfort Vincent Michel Bertrand Thirion et al. [38] suggested permuting the input columns of features.
In addition to the two Python packages, caret [39], an R package that has been used for Exam Predictive Analytics by the Society of Actuaries, has taken variable importance into consideration when developing machine learning models.Kuhn [39] defined different kinds of variable importance for different models.For example, in linear regression models, variable importance is defined to be the absolute value of the t-statistic for each model parameter that is used.For the recursive partitioning model, variable importance is defined to be the reduction in the loss function (e.g., mean squared error) attributed to each variable at each split that is tabulated.
To observe which variables are more important, we perform a variable importance analysis.The variable of interest in the input is replaced with zero.Then, the new negative loglikelihood, denoted as L reduced , is computed.Assume that L f ull is the original loglikelihood.The variable importance for the variable is calculated as follows: 2(L f ull − L reduced ).
The intuition for this likelihood ratio metric is that the models we are using are probability models and the metric for evaluation is loglikelihood.Note that the variable importance we are using can be negative because the variable may not predict the variances very well.Figure 7 shows the variable importance for the six models.In Figure 7, Net Debt/EBITA is among the six most important variables for all models.In Figure 7e, the importance of Net Debt/EBITA is extremely high compared to other variables.Variable importance.To visualize variable importance, we plotted variables importance for each model.The higher the variable importance is, the more important the variable is.For all models, Net Debt/EBITDA is very important.

Conclusions
A non-linear GARCH model is of great importance to researchers and academics because the relationship between the current variance and the variances in the previous period may not be linear.This paper proposes a GARCH(1,1) model with an LSTM network for conditional variance in order to examine how important neural networks are when modeling stock returns.The appropriateness of deploying a recurrent neural network is investigated by the visualization of the actual returns and predicted standard deviations, visualization of the actual squared returns, and predicted variance and tabulation of mean absolute errors.Our results show that for most of the neural network models, the predicted standard deviations increase when the returns fluctuate rapidly.The actual r 2 t and the predicted variance are very close and two of the recurrent network models outperform the GARCH(1,1) model with the standardized Student's t distribution in terms of MAE for |r t | and |r t | 1.5 .As some tested models do not outperform the GARCH(1,1) model, the proposed LSTM model should be assessed carefully before being applied for prediction of returns.Our variance importance analysis for the models shows that Net Debt/EBITA is among the six most important predictor variables in all of the neural network models that we have examined.For implementation of our ANN-GARCH approach, in practice, we can follow what we did to split a data sample into two parts: the training set and the testing set.Since the computation is fast, we can consider a large number of ANN architectures based on experience and perform the out-of-sample testing as shown in Table 2.Then, we can identify promising architectures for predicting future volatility.

Figure 2 .
Figure 2. Heat map of the correlation matrix of data after normalization.The colors indicate the magnitudes of the correlations.

Figure 3 .
Figure 3.Architectures to be tested.The first dimension in each layer indicate a batch size.The implementation of neural network does not require the input of the batch size; therefore, the first dimension for each layer is a question mark.For example, the first layers of models are input layers and input matrices are of the shape(20,17).The input and output for input layer are (?,20,17).As the first layers are the same and the outputs of the final layers are of dimension one for all models, the architectures are named according to output units of layers excluding the first and final layer.For Figure3c, after removing the input layer and the final layer, the LSTM layer with 32 output units is followed by a dense layer with 16 output units and a dense layer with 8 output units.Therefore, it is named LSTM(32)+dense.

Figure 5 .Figure 6 .
Figure 5. Actual r t and the predicted standard deviation.To visualize the model performances, we plotted actual r t and the predicted standard deviation in the same graphs.Except for LSTM(32)+dense(32), when actual r t fluctuates more rapidly, the predicted standard deviation increases.

Figure 7 .
Figure 7.Variable importance.To visualize variable importance, we plotted variables importance for each model.The higher the variable importance is, the more important the variable is.For all models, Net Debt/EBITDA is very important.

Table 2 .
Out-of-sample mean absolute errors (MAE) evaluated between 2 March 2018 and 28 February 2020.To evaluate the performances of models, we computed MAE for |r t |, |r t | 1.5 , and r 2 t .The smaller MAE are, the better the performance is.