Multi-Transformer: A New Neural Network-Based Architecture for Forecasting S&P Volatility

Events such as the Financial Crisis of 2007-2008 or the COVID-19 pandemic caused significant losses to banks and insurance entities. They also demonstrated the importance of using accurate equity risk models and having a risk management function able to implement effective hedging strategies. Stock volatility forecasts play a key role in the estimation of equity risk and, thus, in the management actions carried out by financial institutions. Therefore, this paper has the aim of proposing more accurate stock volatility models based on novel machine and deep learning techniques. This paper introduces a neural network-based architecture, called Multi-Transformer. Multi-Transformer is a variant of Transformer models, which have already been successfully applied in the field of natural language processing. Indeed, this paper also adapts traditional Transformer layers in order to be used in volatility forecasting models. The empirical results obtained in this paper suggest that the hybrid models based on Multi-Transformer and Transformer layers are more accurate and, hence, they lead to more appropriate risk measures than other autoregressive algorithms or hybrid models based on feed forward layers or long short term memory cells.


Introduction
Since the Financial Crisis of 2007-2008, financial institutions have enhanced their risk management framework in order to meet the new regulatory requirements set by Solvency II or Basel III. These regulations have the aim of measuring the risk profile of financial institutions and minimizing losses from unexpected events such as the European sovereign debt crisis or COVID-19 pandemic. Even though banks and insurance entities have reduced their losses thanks to the efforts made in the last years, unexpected events still cause remarkable losses to financial institutions. Thus, efforts are still required to further enhance market and equity risk models in which stock volatility forecasts play a fundamental role. Volatility, understood as a measure of an asset uncertainty [1,2], is not directly observed in stock markets. Thus, taking into consideration the stock market movements, a statistical model is applied in order to compute the volatility of a security.
GARCH-based models [3,4] are widely used for volatility forecasting purposes. This family of models is especially relevant because it takes into consideration the volatility clustering observed by [5]. Nevertheless, as the persistence of conditional variance tends to be close to zero, Refs. [6][7][8][9] developed more flexible variations of the traditional GARCH models. In addition, the models introduced by [10] (EGARCH) and [11] (GJR-GARCH) take into consideration that stocks volatility behaves differently depending on the market trend, bearish or bullish. Multivariate GARCH models were developed by [12,13]. Bollerslev et al. [14] applied the previous model to financial time series, while [15] introduced a time-varying multivariate GARCH. Dynamic conditional correlation GARCH, BEKK-GARCH and Factor-GARCH were arXiv:2109.12621v1 [q-fin.CP] 26 Sep 2021 other variants of this family that were developed by [16][17][18], respectively. Finally, it is worth mentioning that, in contrast to classical GARCH, the first-order zero-drift GARCH model (ZD-GARCH) proposed by [19] is non-stationary regardless of the sign of Lyapunov exponent and, thus, it can be used for studying heteroscedasticity and conditional heteroscedasticity together.
Another relevant family is composed by stochastic volatility models. As they assume that volatility follows its own stochastic process, these models are widely used in combination with Black-Scholes formula to assess derivatives price. The most popular process of this family is the [20] model which assumes that volatility follows an Cox-Ingersoll-Ross [21] process and stock returns a Brownian motion. The main challenge of the Heston model is the estimation of its parameters. Refs. [22,23] proposed a generalized method of moments to obtain the parameters of the stochastic process, while [24][25][26][27] used a simulation approach to estimate them. Other relevant stochastic volatility processes are Hull-White [28] and SABR [29] models.
The last relevant family is composed of those models based on machine and deep learning techniques. Even though GARCH models are considered part of the machine learning toolkit, these models are considered another different family due to the significant importance that they have in the field of stock volatility. Thus, this family takes into consideration the models based on the rest of the machine and deep learning algorithms such as artificial neural networks [30], gradient boosting with regression trees [31], random forests [32] or support vector machines [33]. Refs. [34][35][36] applied machine learning techniques such as Support Vector Machines or hidded Markov models to forecast financial time series. Hamid and Iqbid [37] applied Artificial Neural Networks (ANNs) to demonstrate that the implied volatility forecasted by this algorithm is more accurate than Barone-Adesi and Whaley models.
ANNs have been also combined with other statistical models with the aim of improving the forecasting power of individual ANNs. The most common approach applied in the field of stocks volatility is merging GARCH-based models with ANNs. Refs. [38][39][40][41][42][43][44] developed different architectures based in the previous approach for stock volatility forecasting purposes. All these authors demonstrated that hybrid models overcome the performance of traditional GARCH models in the field of stock volatility forecasting. It is also worth mentioning the contribution of [45], who combined different GARCH models with ANNs in order to compare their predictive power. ANN-GARCH models have been also applied to forecast other financial time series such as metals [46,47] or oil [48,49] volatility. Apart from the combination with GARCH-based models, ANNs have been merged with other models for volatility forecasting purposes. Ramos-Pérez et al. [50] merged ANNs, random forests, support vector machines (SVM) and gradient boosting with regression trees in order to forecast S&P500 volatility. This model overcame the performance of a hybrid model based on feed forward layers and GARCH. Vidal and Kristjanpoller [51] proposed an architecture based on convolutional neural networks (CNNs) and long-short term memory (LSTM) units to forecast gold volatility. LSTMs were also used by [52] to forecast currency exchange rates volatility. It is also worth mentioning that GARCH models have not been only merged with ANNs, Peng et al. [53] combined SVM with GARCH-based models in order to predict cryptocurrencies volatility.
The aim of this paper is to introduce a more accurate stock volatility model based on an innovative machine and deep learning technique. For this purpose, hybrid models based on merging Transformer and Multi-Transformer layers with other approaches such as GARCH-based algorithms or LSTM units are introduced by this paper. Multi-Transformer layers, which are also introduced in this paper, are based on the Transformer architecture developed by [54]. Transformer layers have been successfully implemented in the field of natural language processing (NLP). Indeed, the models developed by [55,56] demonstrated that Transformer layers are able to overcome the performance of traditional NLP models. Thus, this recently developed architecture is currently considered the state-of-the-art in the field of NLP. In contrast to LSTM, Transformer layers do not incorporate recurrence in their structure. This novel structure relies on a multi-head attention mechanism and positional embeddings in order to forecast time series. As [54] developed Transformer for NLP purposes, positional embeddings are used in combination with word embeddings. The problem faced in this paper is the forecasting of stock volatility and, thus, the word embedding is not needed and the positional embedding has been modified as it is explained in Section 2.4.
In contrast to Transformer, Multi-Transformer randomly selects different subsets of training data and merges several multi-head attention mechanisms to produce the final output. Following the intuition of bagging, the aim of this architecture is to improve the stability and accurateness of the attention mechanism. It is worth mentioning that the GARCHbased algorithms used in combination with Transformer and Multi-Transformer layers are GARCH, EGARCH, GJR-GARCH, TrGARCH, FIGARCH and AVGARCH.
Therefore, three main contributions are provided by this study. First, Transformer layers are adapted in order to forecast stocks volatility. In addition, an extension of the previous structure is presented (Multi-Transformer). Second, this paper demonstrates that merging Transformer and Multi-Transformer layers with other models lead to more accurate volatility forecasting models. Third, the proposed stock volatility models generate appropriate risk measures in low and high volatility regimes. The Python implementation of the volatility models proposed in this paper is available in this repository.
As it is shown by the extensive literature included in this section, stock volatility forecasting has been a relevant topic not only for financial institutions and regulators but also for the academia. As financial markets can suffer drastic sudden drops, it is highly desirable to use models that can adequately forecast volatility. It is also useful to have indicators that can accurately measure risk.This paper makes use of recent deep and machine learning techniques to create more accurate stock volatility models and appropriate equity risk measures.
The rest of the paper is organized as follows: Section 2 describes the dataset, the measures used for validating the volatility forecasts and provides a look at the volatility models used as benchmark. Then, this section presents the volatility forecasting models proposed in this paper, which are based on Transformer and Multi-Transformer layers. As NLP Transformers need to be adapted in order to be used for volatility forecasting purposes and Multi-Transformer layers are introduced by this paper, explanations about the theoretical background of these structures are also given. The analysis of empirical results is presented in Section 3. Finally, the results are discussed in Section 4, followed by concluding remarks in Section 5.

Materials and Methods
This section is divided in five different subsections. The first one (Section 2.1) describes the data for fitting the models. The measures for validating the accuracy and value at risk (VaR) of each stock volatility model are explained in Section 2.2. Section 2.3 presents the stock volatility models and algorithms used for benchmarking purposes. Section 2.4 explains the adaptation of Transformer layers in order to be used for volatility forecasting purposes and, finally, the Multi-Transformer layers and the models based on them are presented in Section 2.5.

Data and Model Inputs
The proposed architectures and benchmark models are fitted using the rolling window approach (see Figure 1). This widely used methodology has been applied in finance, among others, by [57][58][59][60]. Rolling window uses a fixed sample length for fitting the model and, then, the following step is forecasted. As in this paper the window size is set to 650 and the forecast horizon to 1, the proposed and benchmark models are fitted using the last 650 S&P trading days and, then, the next day volatility is forecasted. This process is repeated until the whole period under analysis is forecasted. The periods used as training and testing set will be defined at the end of this subsection. The input variables of the models proposed are the daily logarithmic returns (r t−i ) and the standard deviation of the last five daily logarithmic returns: As Multi-Transformer, Transformer and LSTM layers are able to manage time series, a lag of the last 10 observations of the previous variables are taken into consideration for fitting these layers. Thus, the input variables are: In accordance with other studies such as [38] or [50], the realized volatility is used as response variable for the models based on ANNs; where E[r f ] = ∑ i−1 n=0 r t+n /i and i = 5. As shown in the previous formula, the realized volatility can be defined as the standard deviation of future logarithmic returns.
The dataset for fitting and evaluating the volatility forecasting models contains market data of S&P from 1 January 2008 to 31 December 2020. The optimum configuration of the models is obtained by applying the rolling window approach and selecting the configuration which minimizes the error (RMSE) in the period going from 1 January 2008 to 31 December 2015. The optimum configuration in combination with the rolling window methodology is applied in order to forecast the volatility contained in the testing set (from 1 January 2016 to 31 December 2020). The empirical results presented in Section 3.2 are based on the forecasts of the testing set.

Models Validation
This subsection presents the measures selected for validating and comparing the performance of the benchmark models with the algorithms proposed in this paper.
The mean absolute value (MAE) and the root mean squared error (RMSE) have been selected for validating the forecasting power of the different stock volatility models: where N is the total number of observations. The validation carried out by this study is not only interested on the accuracy, but also on the appropriateness of the risk measures generated by the different stock volatility forecasting models. In accordance with Solvency II Directive, 99.5% VaR has been selected as risk measure. Although Solvency II has the aim of obtaining the yearly VaR, the calculations carried out in this paper will be based on a daily VaR in order to have more data points and, thus, more robust conclusions on the performance of the different models. The parametric approach developed by [61] is used for validating the different VaR estimations. The aim of this test is accepting (or rejecting) the hypothesis that the number of VaR exceedances are aligned with the confidence level selected for calculating the risk measure. In addition to the previous test, the approach suggested by [62] is also applied in order to validate the appropriateness of VaR.

Benchmark Models
This subsection introduces the benchmark models used in this paper: GARCH, EGARCH, AVGARCH, GJR-GARCH, TrARCH, FIGARCH and two architectures that combine GARCHbased algorithms with ANN and LSTM, respectively. The GARCH-based algorithms will be fitted assuming that innovations, t , follow a Student's t-distribution. Thus, the returns generated by these models follow a conditional t-distribution [63].
The generalized autoregressive conditional heteroskedasticity (GARCH) model developed by [4] has been widely used for stock volatility forecasting purposes. GARCH(p,q) has the following expression: where ω i , α i and β i are the parameters to be estimated, r t−i the previous returns and σ 2 t−i the last observed volatility. As previously stated, innovations ( t ) follow a Student's t-distribution.
The absolute value GARCH [64], AVGARCH(p,q), is similar to the traditional GARCH model. In this case, the absolute value of previous return and volatility is taken into consideration to forecast volatility:σ As volatility behaves differently depending on the market tendency, models such as EGARCH, GJR-GARCH or TrGARCH were developed. EGARCH(p,q) [10] has the following expression for the logarithm of stocks volatility: where ω i , α i , β i and γ i are the parameters to be estimated and e t = r t /σ t . The GJR-GARCH(p,o,q) developed by [11] has the following expression: As with EGARCH model, ω i , α i , β i and γ i are the parameters to be estimated. I [r t−1 <0] takes the value of 1 when the subscript condition is met. Otherwise I [r t−1 <0] = 0. The volatility of the Threshold GARCH(p,o,q) (TrGARCH) model is obtained as follows: As with the previous two architectures, ω i , α i , β i and γ i are the model parameters. The last GARCH-based algorithm used in this paper is the fractionally integrated GARCH (FIGARCH) model developed by [65]. The conditional variance dynamic iŝ where L is the lag operator and d the fractional differencing parameter.
In addition to the previous approaches, two other hybrid models based on merging autoregressive algorithms with ANNs and LSTMs are also used as benchmark. Figure 2 shows the architecture of ANN-GARCH and LSTM-GARCH. The inputs of the algorithms are the following: • The last daily logarithmic return, r t−1 , for the ANN-GARCH and the last ten in the case of the LSTM-GARCH (as explained in Section 2.1). • The standard deviation of the last five daily logarithmic returns: where E[r] = ∑ n i=1 r t−i /n and n = 5. As with the previous input variable, the last standard deviation is considered in the ANN-GARCH, whereas the last ten are taken into consideration by the LSTM-GARCH architecture.
The GARCH-based algorithms included within the ANN-GARCH and LSTM-GARCH models are the six algorithms previously presented in this same subsection (GARCH, EGARCH, AVGARCH, GJR-GARCH, TrARCH, FIGARCH). As explained in Section 2.1, the true implied volatility, σ i,t , is used as response variable to train the models. This variable is the standard deviation of the future logarithmic returns: where E[r f ] = ∑ i−1 n=0 r t+n /i. In this paper, i = 5. As it is shown in Figure 2, the input of the ANN-GARCH model is processed by two feed forward layers with dropout regularization. These layers have 16 and 8 neurons, respectively. The final output is produced by a feed forward layer with one neuron. In the case of the LSTM-GARCH, inputs are processed by a LSTM layer with 32 units and two feed forward layers with 8 and 1 neurons, respectively, in order to produce the final forecast.

Transformer-Based Models
Before explaining the volatility models based on Transformer layers (see Figure 3), all the modifications applied to their architecture are presented in this subsection. As previously stated, Transformer layers [54] were developed for NLP purposes. Thus, some modifications are needed in order to apply this layer for volatility forecasting purposes. In contrast to LSTM, recurrence is not present in the architecture of Transformer layers. The two main components used by these layers in order to deal with time series are the following: PE (pos,2 i+1 ) = cos(pos/1000 2i/dim ) (15) where dim is the total number of explanatory variables (or word embedding dimension in NLP) used as input in the model, pos is the position of the observation within the time series and i = (1, 2, . . . , dim − 1). This positional encoder modifies the input data depending on the lag of the time series and the embedding dimension used for the words. As volatility models do not use words as inputs, the positional encoder is modified in order to avoid any variation of the inputs depending on the number of time series used as input. Thus, the positional encoder suggested in this paper changes depending on the lag, but it remains the same across the different explanatory variables introduced in the model. As in the previous case, a wave function plays the role of positional encoder: where pos = (0, 1, . . . , N pos − 1) is the position of the observation within the time series and N pos maximum lag. • Multi-Head attention. It can be considered the key component of the Transformer layers proposed by [54]. As shown in Figure 3, Multi-Head attention is composed of several scaled dot-product attention units running in parallel. Scaled dot-product attention is computed as follows: where Q, K and V are input matrices and d k the number of input variables taken into consideration within the dot-product attention mechanism. Multi-Head attention splits the explicative variables in different groups or 'heads' in order to run the different scaled dot-product attention units in parallel. Once the different heads are calculated, the outputs are concatenated (Concat operator) and connected to a feed forward layer with linear activation. Thus, the Multi-Head attention mechanism has the following expression: where h is the number of heads. It is also worth mentioning that all the matrices of parameters (W Q i , W K i , W V i and W O ) are trained using feed forward layers with linear activations.
In addition to the scaled dot-product and the Multi-Head attention mechanisms, Figure 3 shows the Transformer layers used in this paper. As suggested by [54], the Multi-Head attention is followed by a normalization, a feed forward layer with ReLU activation and, again, a normalization layer. Transformer layers also include two residual connections [66]. Thanks to these connections, the model will decide by itself if the training of some layers needs to be skipped during some phases of the fitting process.
The modified version of Transformer layers explained in the previous paragraphs are used in the volatility models presented in Figure 4. The T-GARCH architecture proposed in this paper merges the six GARCH algorithms presented in Section 2.3 with Transformer and feed forward layers in order to forecastσ i,t . In addition to the previous algorithms and layers, TL-GARCH includes a LSTM with 32 units. In this last model, the temporal structure of the data is recognized and modelled by the LSTM layer and, thus, no positional encoder is needed in the Transformer layer. Both models have the following characteristics: • Adaptative Moment Estimator (ADAM) is the algorithm used for updating the weights of the feed forward, LSTM and Transformer layers. This algorithm takes into consideration current and previous gradients in order to implement a progressive adaptation of the initial learning rate. The values suggested by [67] for the ADAM parameters are used in this paper and the initial learning rate is set to δ = 0.01. • The feed forward layers with dropout present in both models have 8 neurons, while the output layer has just one. • The level of dropout regularization θ [68] is optimized with the training set mentioned in Section 2.1.

•
The loss function used for weights optimization and back propagation purposes is the mean squared error. • Batch size is equal to 64 and the models are trained during 5000 epochs in order to obtain the final weights.

Multi-Transformer-Based Models
This subsection presents the Multi-Transformer layers and the volatility models based on them. The Multi-Transformer architecture proposed in this paper is a variant of the Transformer layers proposed by [54]. The main differences between both architectures are the following: • As shown in Figure 5, Multi-Transformer layers generate T different random samples of the input data. In the volatility models proposed in this paper, 90% of the observations of the database are randomly selected in order to compute the different samples.
As with the Transformer architecture applied in this paper, the positional encoder used is PE pos instead of PE (pos,2 i ) and PE (pos,2 i+1 ) . The aim of the Multi-Transformer layers introduced in the paper is to improve the stability and accuracy by applying bagging [69] to the attention mechanism. This technique is usually applied to algorithms such as linear regression, neural networks or decision trees. Instead of applying the procedure on all the data that are input into the model, the proposed methodology uses bagging only to the attention mechanism of the layer architecture.
The computational power required by bagging is one of the main limitations of this technique. As Multi-Transformer applies bagging to the attention mechanisms, their weights are trained several times in each epoch. Nevertheless, bagging is not applied to the rest of the layer weights and, thus, this offsets partially the previous limitation. It is also worth mentioning that bagging preserves the bias and this may result in underfitting.
On the other hand, this technique should bring two main advantages to the Multi-Transformer layer. First, bagging reduces significantly the error variance. Second, the aggregation of learners using this technique leads to a higher accuracy and reduces the risk of overfitting.
The structure of the volatility models based on Multi-Transformer layers ( Figure 6) is similar to the architectures presented in Section 2.4. The MT-GARCH merges Multi-Transformer and feed forward layers with the six GARCH models presented in Section 2.3. In addition to the previous algorithms and layers, MTL-GARCH adds a LSTM with 32 units. The rest of the characteristics such as the optimizer, the number of neurons of the feed forward layers or the level of dropout regularization are the same than those presented in the previous section for T-GARCH and TL-GARCH. The risk measures of ANN-GARCH, LSTM-GARCH and all the models introduced by this paper (Sections 2.4 and 2.5) are calculated assuming that daily log-returns follow a non-standardize Student's t-distribution with standard deviation equal to the forecasts made by the volatility models. It is worth mentioning that Student's t-distribution generates more appropriate risk measures than normal distribution due to the shape of its tail [70,71]. In addition, this assumption is in line with the GARCH-based models used as benchmark and the inputs of the hybrid models presented in this paper.

Results
In this section, the forecasts and the risk measures of the volatility models presented in previous sections are compared with the ones obtained from the benchmark models. In addition, the following subsection shows the optimum hyperparameters of the benchmark and proposed hybrid volatility models.

Fitting of Models Based on Neural Networks
As explained in Section 2.1, rolling window approach ([57-60] among others) is applied for fitting the algorithms. The training set used for optimizing the level of dropout regularization contains S&P returns and observed volatilities from 1 January 2008 to 31 December 2015. Table 1 presents the error by model and level of θ. The results of the optimization process reveals that θ = 0 generates higher error rates than the rest of the possible values regardless of the model. This means that models based on architectures such as Transformer, LSTM or feed forward layers need an appropriate level of regularization in order to avoid overfitting. According to the results, this is especially relevant for ANN-GARCH, where the error strongly depends on the level of regularization. The dropout level that minimizes the error of each model is selected.

Comparison against Benchmark Models
Once the optimum dropout level of each of the proposed volatility forecasting models based on Transformer and Multi-Transformer is selected, their performance is compared with the benchmark models (traditional GARCH processes, ANN-GARCH and LSTM-GARCH) presented in Section 2.3. Tables 2 and 3 present the validation error (RMSE and MAE) by year and model. The column 'Total' shows the error of the whole test period (from 1 January 2016 to 31 December 2020). The main conclusions drawn from the these tables are the following: • Traditional GARCH processes are outperformed by models based on merging artificial neural network architectures such as feed forward, LSTM or Transformer layers with the outcomes of autoregressive algorithms (also named hybrid models). • The comparison between ANN-GARCH and the rest of the volatility forecasting models based on artificial neural networks (LSTM-GARCH, T-GARCH, TL-GARCH, MT-GARCH and MTL-GARCH) reveals that feed forward layers lead to less accurate forecasts than other architectures. Multi-Transformer, Transformer and LSTM were specially created to forecast time series and, thus, the volatility models based on these layers are more accurate than ANN-GARCH. • Merging Multi-Transformer and Transformer layers with LSTMs leads to more accurate predictions than traditional LSTM-based architectures. Indeed, TL-GARCH achieves better results than LSTM-GARCH, even though the number of weights of TL-GARCH is significantly lower. Thus, the novel Transformer and Multi-Transformer layers introduced for NLPs purposes can be adapted as described in Sections 2.4 and 2.5 in order to generate more accurate volatility forecasting models. It is also worth mentioning that Multi-Transformer layers, which were also introduced in this paper, lead to more accurate forecasts thanks to their ability to average several attention mechanisms. In fact, the model that achieves the lower MAE and RMSE is a mixture of Multi-Transformer and LSTM layers (MTL-GARCH).  To enhance the analysis of the results shown in Tables 2 and 3, Figure 7 collects the RMSE and the observed volatility by year. Notice that only the most accurate GARCH-based model is shown in order to improve the visualization of the graph. The black dashed line shows that the observed volatility of 2020 was significantly higher than the rest of the years due to the turmoil caused by COVID-19 outbreak. As expected, the error of every model is also higher in 2020 because the market volatility was more unpredictable than the rest of the years. Nevertheless, it has to be mentioned that the 2020 forecasts of traditional autoregressive algorithms are significantly less accurate than hybrid models based on architectures such as LSTM, Transformer or Multi-Transformer layers.
Although the observed volatility is lower in years before 2020, autoregressive models are also outperformed by hybrid models. Nevertheless, the difference between both sets of models is remarkably lower.
The p-values of the Kupiec and Christoffersen tests by volatility model and year are shown in Tables 4 and 5, respectively. In contrast to the approach suggested by Kupiec, Christoffersen test is not only focused on the total number of exceedances, but it also takes into consideration the number of consecutive VaR exceedances. As stated in Section 2.2, the risk measure and confidence level (99.5% VaR) selected are in line with Solvency II Directive. This regulation sets the principles for calculating the capital requirements and assessing the risk profile of the insurance companies based in the European Union. This law covers not only the underwriting risks but also financial risks such as the potential losses due to variations on the interest rate curves or the equity prices.
The column 'Total' of Tables 4 and 5 reveal that only TL-GARCH, MT-GARCH and MTL-GARCH produce appropriate risk measures (p-value higher than 0.05 in both tests) for the period 2016-2020. The rest of the models fail both tests and, thus, their risk measures can not be considered to be appropriate for that period. As with any other statistical test, the higher the number of data points the more relevant are the outcomes obtained from the test. That is the reason why the previous paragraph focuses on the 'Total' column and not on the specific results obtained by year. The results by year show that most of the models fail the test in 2020 due to the high level of volatility produced by COVID-19 pandemic.
According to these results, the stock volatility models introduced in this paper (T-GARCH, TL-GARCH, MT-GARCH and MTL-GARCH) produce more accurate estimations and appropriate risk measures in most of the cases. Regarding the models accuracy, it is specially remarkable the difference observed in 2020, where COVID-19 caused a significant turmoil in the stock market. Concerning the appropriateness of equity risk measures, three out of four models based on Transformer and Multi-Transformer pass Kupiec and Christofferesen test for the period 2016-2020, while all the benchmark models fail at least one of them. Notice that the proposed models are compared with other approaches belonging to its own family (ANN-GARCH and LSTM-GARCH) and autoregressive models belonging to the GARCH family.

Discussion
This paper introduced a set of volatility forecasting models based on Transformer and Multi-Transformer layers. As Transformer layers were developed for NLP purposes [54], their architecture is adapted in order to generate stock volatility forecasting models. Multi-Transformer layers, which are introduced by this paper, have the aim of improving the stability and accuracy of Transformer layers by applying bagging to the attention mechanism. The predictive power and risk measures generated by the proposed volatility forecasting models (T-GARCH, TL-GARCH, MT-GARCH and MTL-GARCH) are compared with traditional GARCH processes and other hybrid models based on LSTM and feed forward layers.
Three main outcomes were drawn from the empirical results. First, hybrid models based on LSTM, Transformer or Multi-Transformer layers outperform traditional autoregressive algorithms and hybrid models based on feed forward layers. The validation error by year shows that this difference is more relevant in 2020, when the volatility of S&P500 was significantly higher than in the previous years due to COVID-19 pandemic. Volatility forecasting models are mainly used for pricing derivatives and assessing the risk profile of financial institutions. As the more relevant shocks on the solvency position of financial institutions and derivatives prices are observed in high volatility regimes, the accurateness of these models is particularly important in years such as 2020.
The higher performance of hybrid models have also been demonstrated by [38][39][40][41][42][43][44]. These papers merged traditional GARCH models with feed forward layers to predict stock market volatility. This type of models have shown also a superior performance in other financial fields such as oil market volatility [48,49] and metals price volatility [46,47]. Notice that this paper does not only present a comparison with traditional autoregressive models, but it also shows that Transformer and Multi-Transformer can lead to more accurate volatility estimations than other hybrid models.
Second, Multi-Transformer layers lead to more accurate volatility forecasting models than Transformer layers. As expected, applying bagging to the attention mechanism has a positive impact on the performance of the models presented in this paper. It is also remarkable that empirical results demonstrate that merging LSTM with Transformer or Multi-Transformer layers has also a positive impact on the models performance. On one hand, the volatility forecasting model based on Multi-Transformer and LSTM (named MTL-GARCH) achieves the best results in the period 2016-2020. On the other hand, the merging of Transfomer with LSTM (TL-GARCH) leads to a lower error rate than the hybrid model based only on LSTM layers (LSTM-GARCH) even though the number of weights of the first model is significantly lower. Thus, the use of Transfomer layers can lead to simpler and more accurate volatility forecasting models. Notice that Transformer layers are already considered the state of art thanks to BERT [55] and GPT-3 [56]. These models have been successfully used for sentence prediction, conversational response generation, sentiment classification, coding and writing fiction, among others.
Third, the results of Kupiec and Christoffersen tests revealed that only the risk estimations made by MTL-GARCH, TL-GARCH and MT-GARCH can be considered as appropriate for the period 2016-2020, whereas traditional autoregressive algorithms and hybrid models based on feed forward and LSTM layers failed, at least, one of the tests. As previously stated, volatility does not play only a key role in risk management but also in derivative valuation models. Thus, using a volatility model that generates appropriate risk measures can lead to more accurate derivatives valuation.

Conclusions
Transformer layers are the state of the art in natural language processing. Indeed, the performance of this layer have overcome the performance of any other previous model in this field [56]. As Transformer layers were specially created for natural language processing, they need to be modified in order to be used for other purposes. Probably, this is one of the main reasons why this layer have not been already extended to other fields. This paper provides the modifications needed to apply this layer for stock volatility forecasting purposes. The results shown in this paper demonstrates that Transformer layers can overcome also the performance of the main stock volatility models.
Following the intuition of bagging [69], this paper introduces Multi-Transformer layers. This novel architecture has the aim of improving the stability and accuracy of the attention mechanism, which is the core of Transformer layers. According to the results, it can be concluded that this procedure improves the accuracy of stock volatility models based on Transformer layers.
Leaving aside the comparisons between Transformer and Multi-Transformer layers, the hybrid models based on them have overcome the performance of autoregressive algorithms and other models based on feed forward layers and LSTMs. The architecture of these hybrid models (T-GARCH, TL-GARCH, MT-GARCH and MTL-GARCH) based on Transformer and Multi-Transformer layers is also provided in this paper.
According to the results, it is also worth noticing that the risk estimations based on the previous models are specially appropriate. The VaR of most of these models can be considered accurate even in years such as 2020, when the COVID-19 pandemic caused a remarkable turmoil in the stock market.
Consequently, the empirical results obtained with the hybrid models based on Transfomer and Multi-Transformer layers suggest that further investigation should be conducted about the possible application of them for derivative valuation purposes. Notice that volatility plays a key role in the financial derivatives valuation. In addition, the models can be extended by merging Transformer or Multi-Transformer layers with other algorithms (such as gradient boosting with trees or random forest) or modifying some key assumptions of the attention mechanism.