Next Article in Journal
The Determinants of Limited Household Participation in Risky Financial Markets: Evidence from China Using Explainable Machine Learning
Previous Article in Journal
Does ESG Index Recognition Improve Firm Performance? Evidence from Thailand’s ESG100 Using Staggered Difference-in-Differences
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning and Transformer Architectures for Volatility Forecasting: Evidence from U.S. Equity Indices

by
Gergana Taneva-Angelova
* and
Dimitar Granchev
Faculty of Economics and Social Sciences, University of Plovdiv Paisii Hilendarski, 4000 Plovdiv, Bulgaria
*
Author to whom correspondence should be addressed.
J. Risk Financial Manag. 2025, 18(12), 685; https://doi.org/10.3390/jrfm18120685 (registering DOI)
Submission received: 4 November 2025 / Revised: 23 November 2025 / Accepted: 26 November 2025 / Published: 2 December 2025
(This article belongs to the Special Issue Quantitative Methods for Financial Derivatives and Markets)

Abstract

Volatility forecasting plays a crucial role in financial markets, portfolio management, and risk control. Classical econometric models such as GARCH, ARIMA, and HAR-RV are widely used but face limitations in capturing the nonlinear and regime-dependent dynamics of financial volatility. This study compares traditional econometric models (HAR-RV, ARIMA, GARCH) with deep learning (DL) architectures (LSTM, CNN-LSTM, PatchTST-lite, and Vanilla Transformer) in forecasting realized variance (RV) for major U.S. equity indices (S&P 500, NASDAQ 100, and the Dow Jones Industrial Average) over the period 2000–2025. RV is used as the dependent variable because it is a standard model-free proxy for market volatility. Forecast accuracy is evaluated across forecast horizons of h = 1, 5, 22 days using QLIKE, RMSE, and MAE, along with Diebold–Mariano (DM) significance tests and overfitting diagnostics. Results show that Transformer-based models achieve the lowest errors and strongest generalization, particularly at short horizons and during volatile periods. Overall, the findings highlight the growing advantage of AI-driven models in delivering stable and economically meaningful volatility forecasts, supporting more effective portfolio allocation and risk management—especially in environments marked by rapid market shifts and structural breaks.
JEL Classification:
C22; C45; C58; G17; G11

1. Introduction

Volatility forecasting is a fundamental component of modern financial analysis, as it underpins risk management, asset allocation, derivative pricing, and monetary policy decisions. Accurate predictions volatility forecasts help investors and policymakers anticipate uncertainty and make informed strategic choices.
Volatility plays a central role in financial economics because it directly shapes key mechanisms of risk and asset valuation. First, volatility determines the risk premium investors require as compensation for uncertainty, thereby influencing expected returns across asset classes. Second, precise volatility estimates are essential for widely used pricing frameworks—including CAPM, Black-Scholes, and stochastic-volatility models—where volatility enters as a core input. Third, volatility determines the efficiency of portfolio diversification and allocation through mean-variance optimization, making its forecast quality critical for constructing stable, risk-adjusted portfolios. In risk management, volatility is an explicit component of Value-at-Risk (VaR) and Expected Shortfall (ES). Forecasting errors therefore translate directly into miscalibrated capital buffers, producing either understated downside risk or overly conservative capital requirements. As an indicator of market stress, volatility also shapes investor psychology, market liquidity, and transaction costs. Rising volatility widens bid-ask spreads, increases slippage, and reduces market depth. Moreover, volatility acts as a key indicator of market regimes–distinguishing tranquil from turbulent periods–and is widely employed in Markov-switching models to identify state transitions. Persistent increases in volatility help flag crisis conditions and structural breaks, including events such as the Global Financial Crisis (GFC), COVID-19, and inflation shocks. Taken together, these properties illustrate why improvements in volatility forecasting carry both statistical and economic relevance: they enhance pricing accuracy, strengthen portfolio allocation, improve risk measurement, and provide earlier detection of regime shifts and systemic stress.
Traditionally, econometric approaches such as the Autoregressive Integrated Moving Average (ARIMA) (Box & Jenkins, 1970) and the Generalized Autoregressive Conditional Heteroscedasticity (GARCH) (Engle, 1982; Bollerslev, 1986) have been widely used, valued for their simplicity and theoretical grounding. ARIMA captures linear dependencies on past values and forecast errors, while GARCH models time-varying variance and the well-known clustering of volatility in financial markets. Extensions such as EGARCH introduce asymmetry by recognizing that negative shocks can in-crease volatility more strongly than positive ones (Nelson, 1991). However, their parametric and largely linear structure makes them poorly suited for the empirical properties of realized volatility, which include long-memory, nonlinear adjustments and regime-dependent behavior documented extensively in the literature (Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002; Andersen et al., 2001). By examining whether AI-based architectures respond more effectively to shifts in volatility regimes than classical econometric models, this study addresses a key economic question: to what extent can model architecture enhance the detection of structural changes, information flows, and market conditions that influence risk-related decisions? This perspective ensures that the comparison across model classes reflects not only statistical accuracy but also the mechanisms through which markets process information.
The model classes considered in this study correspond to established market mechanisms. ARIMA captures short-run autocorrelation in volatility proxies, GARCH models conditional variance and leverage effects, and HAR-RV reproduces long-memory behavior by combining daily, weekly, and monthly components. DL architectures extend these ideas by modelling nonlinear interactions and structural breaks, while Transformer-based models employ self-attention to identify relevant segments of volatility history, making them well suited to abrupt shifts and heterogeneous information flows.
Although models such as Stochastic Volatility, Realized GARCH, GARCH-MIDAS, and HAR-J are well established in the literature, they are not included here to maintain methodological consistency. These models rely on heterogeneous input structures—such as intraday realized measures, jump components, or mixed-frequency information—which are not directly comparable to the daily data framework used in this study Moreover, primary objective of this paper is to compare standard econometric benchmarks with deep-learning and Transformer architectures. Introducing specialized extensions such as MIDAS or Realized GARCH would substantially broaden the scope of the analysis and reduce the clarity and interpretability of the comparison. Importantly, several of the structural features targeted by these advanced econometric models are already absorbed endogenously by modern AI architectures, particularly those with attention mechanisms. For these reasons, the analysis focuses on a consistent and comparable set of models and leaves the exploration of such extensions for future research.
As a primary analytical contribution, the study applies a consistent and unified evaluation framework that implements the same forecasting, generalization, and overfitting diagnostics across classical, deep-learning, and Transformer models. To the extent allowed by the available literature, existing studies typically evaluate these model classes separately, over different time periods, or under non-comparable criteria, which limits direct comparability. Here, a single methodological protocol is applied uniformly to all architectures, enabling a more structured and comparable assessment of their performance under different market conditions.
A further methodological contribution is the use of a multi-estimator framework for realized volatility. Instead of relying on a single proxy, the analysis evaluates all model classes across three established estimators—Close-to-Close, Parkinson, and Yang-Zhang. Based on the available literature, many empirical studies rely on only one estimator, reducing robustness and comparability. By applying an identical forecasting and validation protocol to all three RV measures, the study provides a more robust and estimator-independent assessment of model accuracy and illustrates how different architectures respond to intraday variation, overnight movements, and combined sources of volatility.
An additional contribution is the enriched economic interpretation of the forecasts. The analysis shows how models behave during calm periods—when volatility evolves smoothly—and during crisis regimes, when it becomes sharply nonlinear and shocks intensify. This contrast clarifies which model architectures deliver more timely responses, more stable risk assessments, and more accurate uncertainty estimates under different market conditions. In this way, forecasting accuracy is not evaluated in isolation but is placed within the broader decision-making context relevant for risk managers, investors, and institutional participants.
The remainder of this paper is structured as follows: Section 2 reviews the relevant literature; Section 3 presents the data and methodology; Section 4 reports the empirical results; Section 5 provides the economic interpretation of forecast results; and Section 6 concludes.

2. Literature Review

Volatility forecasting has traditionally relied on classical econometric methods such as GARCH, ARIMA, and their extensions. Reviewing both econometric and deep learning perspectives provides a fuller understanding of how different models capture volatility dynamics. This section reviews key empirical contributions in both volatility forecasting and general time-series modelling, summarised in Table 1. The empirical literature on volatility can be broadly divided into two main strands. The first encompasses forecasting-oriented studies, which evaluate statistical models such as ARIMA, GARCH, and HAR-RV, as well as more recent deep-learning architectures, with a primary focus on predictive accuracy. The second strand examines the economic nature of volatility itself, including risk premia, volatility-of-volatility, uncertainty shocks, and their macro-financial transmission mechanisms (Engle et al., 2013). Although these two lines of research are related, they address distinct questions. The present study belongs to the forecasting-oriented literature but also incorporates economic motivation from the second strand in order to strengthen the interpretation and economic relevance of the results.
The GARCH family remains central to volatility modeling and has been extensively applied in both standalone and hybrid forecasting frameworks. Extensions such as EGARCH, GJR-GARCH, and GARCH-MIDAS capture asymmetry and the influence of mixed-frequency macroeconomic factors (Ersin & Bildirici, 2023; Asgharian et al., 2013; Virk et al., 2024). Recent studies suggest that incorporating nonlinear elements through neural-network components, such as LSTM, can further enhance predictive accuracy, significantly improving long-horizon forecasts (Ersin & Bildirici, 2023). Empirical evidence generally confirms that GARCH-type models outperform linear specifications such as ARIMA when forecasting stock-return variance (Asgharian et al., 2013). Including macroeconomic variables can enhance long-horizon forecasts but may also lead to overfitting or data-mining bias under certain conditions (Virk et al., 2024). Recent findings by Bildirici and Ersin confirm that adding LSTM components to GARCH-MIDAS frameworks substantially improves long-horizon performance (Ersin & Bildirici, 2023). Although ARIMA remains effective for modelling linear dynamics, it struggles to handle abrupt structural changes (Ferreira & Medeiros, 2021). Nonlinear models such as LSTM can adapt better during such periods, though ARIMA may still outperform LSTM in certain short-term contexts (Harikumar & Muthumeenakshi, 2025).
Recurrent neural networks, especially long short-term memory (LSTM) architectures (Hochreiter & Schmidhuber, 1997; Greff et al., 2017), have shown strong predictive accuracy by learning longer-term dependencies in financial time-series data (Hochreiter & Schmidhuber, 1997; Greff et al., 2017; Z. Zhang et al., 2025). LSTM networks address the vanishing gradient problem through memory cells that preserve long-horizon dynamics, a property relevant to persistent volatility (Hochreiter & Schmidhuber, 1997; Greff et al., 2017). Building on this foundation, hybrid convolutional LSTM (CNN-LSTM) models (Z. Zhang et al., 2025; Shi et al., 2015; Borovykh et al., 2017) combine convolutional layers, which extract local features, with recurrent layers that capture temporal structure. This design allows the model to represent short-term movements alongside long-run dependencies.
Transformer-based (TRF) architectures (Vaswani et al., 2017) have recently advanced sequence modeling by replacing recurrence with attention mechanisms. This approach allows the model to analyze relationships across all time steps simultaneously, making Transformers efficient at capturing both local and global patterns in financial data. For time-series forecasting, the PatchTST framework partitions the input into overlapping patches before applying attention, enhancing computational efficiency for longer-horizon predictions (Nie et al., 2023). While machine-learning studies document notable forecasting gains, most evaluate narrow model sets, short samples, or single horizons, leaving open important questions about their economic relevance and robustness under varying market regimes (Chun et al., 2025; Souto & Moradi, 2024; Zeng et al., 2023).
Recent evidence shows that Transformer architectures can successfully forecast synthetic Ornstein–Uhlenbeck processes and daily S&P 500 dynamics (Brugiere & Turinici, 2025). Prior work suggests that Transformers often predict log-quadratic variation more accurately than daily returns (Brugiere & Turinici, 2025; Souto & Moradi, 2024). Modern Transformer variants improve scalability and long-horizon accuracy (Nie et al., 2023). In financial applications, models such as PatchTST, Informer, and Autoformer frequently outperform first-generation Transformers (Nie et al., 2023; Souto & Moradi, 2024; Zeng et al., 2023). Zeng et al. also demonstrates that hybrid CNN-Transformer architectures outperform both traditional deep learning and econometric models when applied to financial time series (Zeng et al., 2023). A Quant-former model combining sentiment analysis and investor factor construction has been shown to outperform other quantitative factor models in stock-price prediction (Z. Zhang et al., 2025).
Convolutional neural networks (CNNs) are particularly effective at capturing short-range patterns in time-series data, while Transformers are more effective at modeling long-range dependencies. Recent evidence shows that hybrid CNN-Transformer architectures can combine the advantages of local feature extraction and long-range dependency modelling, outperforming traditional deep-learning and econometric models in financial time-series forecasting (Zeng et al., 2023). More broadly, machine-learning (ML) models tend to outperform classical volatility models across horizons, though real trading performance may be affected by transaction costs and implementation frictions (Chun et al., 2025). Chun, Cho, and Ryu also find that ML-based volatility models outperform GARCH and HAR-RV benchmarks across multiple horizons, especially when used for volatility-timing strategies (Chun et al., 2025).
Recent studies on realized volatility estimation show that multi-grid techniques reduce microstructure noise (L. Zhang et al., 2005), while combining multiple estimators enhances overall forecast accuracy (Patton & Sheppard, 2015). Jump-robust measures such as power and bipower variation further enhance volatility estimation and risk measurement (Patton & Sheppard, 2015; Tauchen & Zhou, 2011), reflecting the ongoing development of more accurate and robust realized-volatility estimators. Parallel to these econometric developments, recent work has compared ML models (such as LSTM and CNN) with classical approaches for realized-volatility forecasting. However, despite substantial interest in machine-learning applications, empirical research applying Transformer-based architectures specifically to realized-variance forecasting remains limited. This study contributes to this emerging strand by evaluating Transformer, LSTM and CNN-LSTM models alongside representative classical benchmarks (ARIMA, GARCH, HAR-RV) across multiple horizons and major U.S. indices over a long sample (2000–2025).
While the literature provides important insights into volatility modelling, several gaps remain. HAR-RV and GARCH models perform well for short horizons but often weaken during periods of structural change, indicating the need for models that can capture nonlinear and regime-dependent behavior. Recent machine-learning studies report accuracy gains, yet many analyse isolated model families or focus on single indices, limiting cross-model comparability across horizons and market conditions. Although Transformer-based architectures have shown strong results in general time-series forecasting, their application to realized-variance forecasting remains limited. These gaps motivate our unified framework, which jointly evaluates classical econometric, deep-learning, and Transformer models across multiple indices and forecast horizons, enabling a consistent comparison and a clearer economic interpretation of the observed performance differences.

3. Materials and Methods

This section outlines the data sources, preprocessing procedures, and modeling design employed in the study. It describes how RV measures were constructed, how datasets were synchronized across indices, and how econometric, DL, and TRF frameworks were trained and evaluated under a unified experimental setup.

3.1. Data Collation and Preprocessing

The empirical analysis is based on daily price data for 3 major U.S. equity indices (the S&P 500, NASDAQ 100 and DJIA), covering the period from January 2000 to August 2025 (Table 2). This long historical window provides sufficient depth for constructing multi-horizon volatility forecasts and ensures stable estimation of realized-variance measures. It also captures a wide range of market environments, allowing models to be evaluated under both low- and high-volatility conditions.
The datasets were obtained from Investing.com (2025) (https://www.investing.com/, accessed on 5 September 2025), and include daily open, high, low, and close (OHLC) prices, ensuring full consistency across indices. The series contain 6450 daily observations per index, spanning nearly 25 years of trading data (Table 2). After applying a 120-day rolling lookback window, around 6300 supervised sequences were produced for three horizons—1, 5, and 22 days (h = 1, 5, 22), representing daily, weekly, and monthly intervals. Logarithmic returns were computed from daily closing prices to capture compounding effects and short-term dynamics.
RV was estimated using three range-based measures: Close-to-Close, Parkinson (Parkinson, 1980) and Yang–Zhang (Yang & Zhang, 2000). A logarithmic transformation of realized variance (log(RV) was then applied to stabilize variance, improving model stability and convergence, forming target variable for all models.
While realized variance is theoretically defined as the sum of squared intraday re-turns, we rely on range-based estimators due to the absence of consistent intraday data for the entire period and across all indices. Range-based realized measures (Parkinson, 1980) have been shown to be high-frequency-efficient and less noisy than close-to-close variance estimates, offering a practical and theoretically justified proxy when only daily OHLC data are available. Consistent with the realized-volatility literature, we therefore treat our dependent variable as a range-based proxy for integrated volatility (L. Zhang et al., 2005; Patton & Sheppard, 2009; O. Barndorff-Nielsen & Shephard, 2004). RV is chosen as the dependent variable because it provides a model-free, data-driven measure of integrated volatility whose asymptotic and finite-sample properties are well established in empirical finance (Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002; Andersen et al., 2001). Following Andersen, Bollerslev, Diebold and Labys (Andersen et al., 2003; Andersen et al., 2001) and O. E. Barndorff-Nielsen and Shephard (2002), RV provides a more accurate benchmark of integrated volatility than squared returns or parametric GARCH-implied variance. Using three complementary RV definitions improves measurement robustness by capturing multiple dimensions of price variation and reducing estimator-specific bias (L. Zhang et al., 2005; Andersen et al., 2003; O. Barndorff-Nielsen & Shephard, 2004).
The choice of the 2000–2025 period is motivated by both statistical and economic considerations. Statistically, a long sample is essential for training deep learning models, and evaluating multi-horizon forecasts, and ensuring stable estimation of realized-variance measures. Economically, this period encompasses several distinct volatility regimes—from the dot-com aftermath and the Global Financial Crisis to the COVID-19 shock and the post-pandemic tightening cycle—allowing us to examine whether AI-based architectures adapt more effectively to structural changes than classical econometric models.
Data were split 80/20 into training and testing sets, with the test sample beginning in August 2020 to capture major high-volatility episodes such as the COVID-19 crisis, the 2022 inflation shock, and the 2023–2025 market adjustment.
Before constructing the target variable, the raw series were cleaned and aligned chronologically. Missing or duplicated rows were removed to ensure consistent and gap-free returns, as such issues can distort both log-return and RV calculations. Daily logarithmic returns were calculated as follows:
r t = ln P t P t 1
where P t is the daily closing price. In financial econometrics, realized variance (RV) is theoretically defined as the sum of squared intraday log-returns (Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002; Patton, 2011):
R V t =   i = 1 M t r t , i 2
Because intraday data are unavailable for the full 2000–2025 sample, RV is approximated using daily information. The simplest approximation is the squared daily log-return (Patton, 2011):
R V t = r t 2
This approximation captures daily price variability and is widely used in long-horizon studies when intraday data are unavailable. Although high-frequency data can yield finer estimates, daily squared returns offer a reliable long-horizon analysis. To improve measurement quality, two additional range-based estimators are employed—the Parkinson (Parkinson, 1980) measure and the Yang–Zhang (Yang & Zhang, 2000) estimator—which use daily high, low, open, and close prices to capture intraday variation when true intraday data are unavailable. These estimators are known to be high-frequency-efficient and less noisy than close-to-close volatility, making them suitable for long-sample forecasting studies studies (Parkinson, 1980; L. Zhang et al., 2005; O. Barndorff-Nielsen & Shephard, 2004). Variance is used instead of volatility because it is additive and provides a more stable modeling scale. Following Andersen et al. (2003) and Patton (2011), the RV is theoretically defined as the sum of squared intraday returns. Since intraday observations are unavailable, we approximate this quantity using the three daily OHLC-based estimators described above. To stabilize the distribution and reduce skewness, we apply the logarithmic transformation (Corsi, 2005):
log R V t = log r t 2
which becomes the dependent variable for all models:
y t = log R V t
This transformation yields the log-realized variance, which normalizes scale, stabilizes variance, and produces an approximately Gaussian distribution that enhances model performance across both classical and deep-learning frameworks. Moreover, because R V t is strictly positive, the logarithmic transformation is always well-defined, and forecast values can be mapped back to the variance scale through exponentiation (Andersen et al., 2003; Andersen et al., 2001; Corsi, 2005). After generating predictions in the log-variance domain, forecasts are converted to the RV domain using:
R V ^ t + h = exp log R V ^ t + h
where t is the current time index and t + h denotes the future point corresponding to an h-step-ahead forecast.
Forecasting log(RV) rather than raw RV is standard in modern financial econometrics due to its improved statistical properties—lower skewness, stabilized variance, and better learning behavior—which contribute to higher predictive accuracy across linear and nonlinear specifications (Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002; Corsi, 2005). The resulting forecasts remain interpretable for applications such as volatility targeting and Value-at-Risk estimation.
To ensure comparability across model classes, all input features were standardized using training-sample statistics only. The modeling framework employed a 120-day rolling lookback window and three forecast horizons (h = 1, 5, 22), capturing both short and medium-term volatility persistence. Predicted values were transformed back from the logarithmic to the variance scale for evaluation in risk management contexts (Table 3). This multi-horizon rolling-window setup is well established in realized-variance forecasting (Taylor, 2005; Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002).
Table 3 summarises the main model families considered in the study. The comparison highlights several methodological trade-offs: classical econometric models remain transparent and efficient but struggle with nonlinear dynamics and regime changes; DL models provide greater flexibility but require more data and offer limited interpretability; and TRF architectures deliver strong long-range modelling capacity but are still relatively new in volatility forecasting. To mitigate the black-box limitations of DL and TRF models, the study adopts a transparent and reproducible framework, combining volatility decomposition (Close-to-Close, Parkinson, Yang–Zhang), interpretable loss metrics (QLIKE, MAE, RMSE), and visual diagnostics such as loss curves and true-vs-forecast panels. This ensures that the empirical results remain robust, comparable, and economically meaningful.
Figure 1 presents the structured workflow adopted in this study, integrating classical econometric methods with contemporary ML and TRF architectures. The framework ensures methodological rigor and comparability across model families. It shows how econometric and AI-based approaches complement each other in capturing volatility dynamics and translating empirical evidence into actionable insights for financial decision-making and policy applications.

3.2. Methodology

The methodology section outlines the analytical framework and modeling strategies employed in this study. It details the structure, estimation procedures, and validation methods applied to three major model groups used to forecast RV across major U.S. indices.

3.2.1. Classical Econometric Models

The ARIMA model (Box & Jenkins, 1970) is one of the most established tools in financial time-series forecasting. It combines an autoregressive (AR) term, which captures dependence on past observations, a moving average (MA) term, which reflects past forecast errors, and an integration (I) term that ensures stationarity through differencing. Following the classical Box-Jenkins (Box & Jenkins, 1970) specification, a general non-seasonal ARIMA(p, d, q) process can be written as:
ϕ L 1 L d y t = c + θ L ε t
ε t ~ i . i . d   0 ,   σ 2
where L denotes the lag operator, ϕ L = 1 ϕ 1 L ϕ p L p represents the autoregressive polynomial of order p ,   ( 1 L ) d is the differencing operator applied d times, and θ L = 1 + θ 1 L + + θ q L q denotes the moving average polynomial of order q. The innovation term ε t is assumed to be independently and identically distributed with zero mean and variance σ 2 .
Although ARIMA models are often applied to returns, in volatility forecasting it is more appropriate to model the logarithm of realized log(RV). This transformation stabilizes variance and reduces skewness, improving linear model performance when capturing persistence in RV (Taylor, 2005; Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002; Andersen et al., 2001). The use of log(RV) is well established in the realized-variance literature and generally produces more interpretable forecasts than modeling prices or returns directly (Andersen et al., 2003; O. E. Barndorff-Nielsen & Shephard, 2002). ARIMA provides a fundamental baseline, effectively capturing linear dependencies and short-term autocorrelation in the conditional mean. However, they assume constant variance, which is unrealistic for financial data characterized by volatility clustering. To address this, Engle (1982) introduced the ARCH model, later generalized by Bollerslev (1986) into the GARCH framework.
While ARIMA models the conditional mean, GARCH models the conditional variance, capturing how volatility evolves over time. This enables GARCH to represent persistent volatility and time-varying risk. In practice, the two models are often combined—ARIMA captures mean dynamics, and GARCH models residual volatility—yielding a more complete description of financial time series. In practice, the GARCH (1,1) specification is the most widely used:
σ t 2 = ω + α ε t 1 2 + β σ t 1 2
Here, α measures the immediate impact of new shocks, while β reflects volatility persistence. This compact and flexible specification allows GARCH to complement ARIMA. In this study, GARCH(1,1) is estimated on daily log-returns, and its conditional variance is compared against the log(RV) targets derived from the three realized-variance estimators.
While ARIMA and GARCH remain essential tools for modeling linear dependence and conditional heteroskedasticity (Box & Jenkins, 1970; Engle, 1982; Bollerslev, 1986), they are limited in capturing long-memory effects and regime shifts often present in RV. To address this, Corsi (2009) introduced the HAR-RV, which incorporates volatility components over multiple horizons.
Formally:
R V t + 1 = β 0 + β d R V t ( d ) + β w R V t ( w ) + β m R V t ( m ) + ϵ t
where R V t ( d ) , R V t ( w ) , and R V t ( m ) represent the realized variance computed over daily, weekly and monthly intervals, respectively.
The HAR-RV model bridges the simplicity of traditional econometrics with the ability to capture long-memory dynamics, serving as a robust benchmark before moving to nonlinear and DL models.

3.2.2. Advanced Models

To complement classical econometric methods, this study employs ML and DL models designed to capture nonlinear dependencies and long-range dynamics frequently observed in RV. These approaches are well suited to volatility forecasting, where regime shifts, structural breaks, and asymmetric responses to shocks are common.
The LSTM network (Hochreiter & Schmidhuber, 1997) models persistent volatility patterns through memory cells and gating mechanisms that preserve long-range information. A hybrid CNN-LSTM architecture (Shi et al., 2015) is also implemented: convolutional layers extract short-term local variations, while LSTM layers capture slower-moving components. Both models are trained on log(RV) using the QLIKE loss, which is standard for variance forecasting. Its core update rule can be summarized as (Greff et al., 2017):
c t = f t c t 1 + i t c ~ t
where c t represents the memory cell, f t and i t represent the forget and input gates, and c ~ t is the candidate state.
Following Y. Zhang et al. (2025), the model uses RV as the target variable, while the convolutional layer extracts local temporal features using the CNN-LSTM hybrid architecture. In line with Borovykh et al. (2017), the architecture first applies one-dimensional convolutional filters to capture local temporal features:
z t = R e L u k = 0 K 1 W k x t k + b
where W k are learnable convolutional kernels of length K, which act as sliding filters that move across the input sequence to extract local temporal features; z t   are then passed to the LSTM layer, which models long-term dependencies.
Transformer models (Vaswani et al., 2017) use self-attention to evaluate pairwise dependencies across the entire input window without recurrence. This enables them to capture both local and global patterns in log(RV). The study employs two compact variants: a lightweight encoder and the PatchTST-lite model (Nie et al., 2023), which partitions the input into overlapping patches to increase efficiency and improve long-horizon forecasting. Following Nie et al. (2023), we implement a lightweight version of the PatchTST architecture (PatchTST-lite) by adopting the reduced embedding size, fewer encoder blocks, and simplified attention configuration recommended in the original implementation, which preserves predictive performance while substantially lowering computational cost. Both models are trained on log(RV) using the QLIKE loss and evaluated via anchored walk-forward validation (Nie et al., 2023), where the training window remains fixed and the test window moves forward in time. Given a sequence of inputs X = x 1 , , x T , the self-attention mechanism maps it into contextualized representations (Vaswani et al., 2017):
A t t e n t i o n Q ,   K ,   V = s o f t m a x Q K T d k V
where Q = X W Q ,   K = X W K , and V = X W V are the query, key, and value matrices.
Building on the CNN-LSTM framework, which captures short-term volatility bursts and longer-term persistence, the PatchTST-lite Transformer extends this idea by dividing the input into overlapping patches processed through multi-head self-attention (Nie et al., 2023). This structure allows the model to learn both local and global patterns in log(RV). The final prediction is obtained from the encoder output:
y ^ t + h = W 0 T r a n s f o r m e r E n c o d e r Z + b o
where Z = z 1 , z 2 , , z N denotes the patch representations and h   1,5 , 22 is the forecast horizon. PatchTST-lite reduces attention complexity from O L 2 to O N 2 , improving scalability for long input windows.
In this study, both Transformer variants—a lightweight encoder and PatchTST-lite—are trained on log(RV) using the QLIKE loss, and evaluated through anchored walk-forward validation (Nie et al., 2023). This procedure closely mirrors real-world forecasting, as the training sample expands over time. To strengthen inference, we also conduct regime-specific evaluation (calm vs. turbulent periods) and apply Diebold-Mariano (Diebold & Mariano, 1995) tests to assess the significance of model-performance differences.

4. Results

The empirical results in this section compare the forecasting performance of the econometric models across multiple horizons and volatility regimes. The evaluation framework integrates three complementary dimensions: forecast accuracy, statistical significance, and model reliability. Forecast accuracy is assessed using MAE, RMSE, and QLIKE, which quantify how closely model predictions track realized volatility. The DM test (Diebold & Mariano, 1995) determines whether differences in forecast errors between model pairs are statistically significant. Model reliability is examined through overfitting diagnostics, comparing in-sample and out-of-sample losses to assess generalization quality. This three-layer framework provides a transparent and coherent basis for comparing classical and data-driven models across all volatility estimators and horizons.

4.1. Descriptive Statistics and Preliminary Analysis

Before evaluating the forecasting models, we examine how RV behaves across the three indices: S&P 500, NASDAQ 100, and DJIA. Three RV estimators are used: Close-to-Close, Parkinson, and Yang–Zhang. All RV measures are analyzed in log form to stabilize variance and reduce extreme values. The descriptive analysis reveals that RV is strongly right-skewed and leptokurtic across all indices, indicating that extreme volatility spikes occur far more often than expected under a normal distribution. The mean values exceed the medians, confirming heavy right tails, while dispersion measures vary substantially over time. These features reflect volatility clustering—tranquil periods followed by abrupt spikes. Even after log-transformation, the series display heterogeneity and outliers, consistent with fat-tailed volatility distributions. When examining the time dynamics of logRV, we observe gradually decaying autocorrelations and persistent shocks, implying that the underlying processes are mean-reverting but exhibit long memory. This pattern is more pronounced in range-based estimators, which incorporate intraday variation indirectly. This helps explain why models such as HAR-RV and GARCH(1,1) perform reliably: both depend on serial correlation.
Formal cross-correlation matrices and unit-root tests were not included, as these diagnostics fall outside the primary objective of the study, which is to compare forecasting models rather than to analyse inter-index dependence or long-run stochastic properties. The use of log-transformed realized volatility is well established in the literature to yield stationary, mean-reverting series suitable for econometric and machine-learning forecasting frameworks. Given the stable statistical behavior of the series and the extensive empirical evidence supporting their stationarity, additional correlation and stationarity tests were deemed unnecessary for the purposes of this analysis.

4.2. Classical Model Performance

This subsection examines the empirical performance of the classical econometric models. The evaluation focuses on forecast accuracy (MAE, RMSE, QLIKE), overfitting diagnostics, and overall model behavior across horizons and volatility regimes.

4.2.1. Point Forecast Accuracy (MAE, RMSE, QLIKE)

Forecast accuracy is evaluated using MAE, RMSE, and QLIKE loss. Figure 2 presents the average performance of ARIMA(1,0,1), GARCH(1,1), and HAR-RV models across the three realized-volatility estimators (Close, Parkinson, and Yang–Zhang), aggregated over all indices and forecast horizons (h = 1, 5, 22). The results show a clear and consistent pattern: HAR-RV achieves the lowest MAE and RMSE values, confirming its strong ability to capture multi-scale volatility dynamics. By incorporating lagged RV terms, the HAR-RV adjusts more smoothly to persistent volatility patterns, leading to better short and medium-term forecasts. GARCH(1,1) delivers moderate results compared with ARIMA(1,0,1), and reduces forecast errors more effectively than ARIMA(1,0,1), but remains less accurate than HAR-RV (Appendix A Table A1). Its QLIKE values remain low, reflecting good clustering capture but weaker amplitude precision. ARIMA(1,0,1) systematically exhibits the highest error values in all three metrics, indicating limited adaptability to heteroskedastic behavior. Among the RV estimators, the Yang–Zhang measure yields the lowest overall errors, indicating superior sensitivity to daily and overnight price variation and providing the most stable target dynamics for classical models.
In summary, these results confirm that the strong persistence of RV is essential for reliable volatility forecasts. HAR-RV provides the most robust classical benchmark, while ARIMA remains limited by its inability to adapt to time-varying volatility.

4.2.2. Statistical Significance (DM Test)

The DM test evaluates whether two competing models generate significantly different forecast errors over the same evaluation period. A low p-value (below 0.05) indicates that the models differ significantly, while higher values suggest that their forecasts are statistically similar (Appendix A Table A2).
Figure 3 presents pairwise p-values for all model comparisons. At the short horizon (h = 1), almost all model comparisons show highly significant differences (p ≈ 0.000). This confirms that all three classical models generate clearly distinct short-term forecasts. The largest contrast appears between HAR-RV and GARCH(1,1), emphasizing the role of lagged RV terms in improving short-horizon prediction. At the long horizon (h = 22), statistical differences weaken. The p-value between ARIMA and HAR-RV (p ≈ 0.17) shows that their long-run forecasts are statistically similar. In contrast, GARCH(1,1) remains significantly different from both models, reflecting its stronger dependence on persistence dynamics. The average heatmap across all horizons supports these observations: HAR-RV and GARCH remain statistically distinct, whereas ARIMA and HAR-RV converge as the forecast horizon increases.
Overall, the DM test confirms that the classical models are not interchangeable. HAR-RV consistently delivers superior short-term accuracy, ARIMA(1,0,1) becomes comparable at medium and long horizons, and GARCH(1,1) captures volatility persistence but does not exceed HAR-RV in accuracy.

4.2.3. Overfitting Diagnostics for Classical Models

Model robustness is assessed through in-sample and out-of-sample QLIKE losses for ARIMA(1,0,1), GARCH(1,1), and HAR-RV (Figure 4). A Train/Test ratio close to 1 signals good generalization, while substantial deviations indicate mis-specification. The diagnostics combine out-of-sample losses, train/test ratios, rolling-window behavior, and parameter-path stability as complementary indicators of robustness. Residual portmanteau tests (e.g., Ljung–Box) are not included, as they serve a supportive rather than decisive role in forecasting evaluation.
Figure 5 reports parameter stability and QLIKE-based diagnostics for the GARCH(1,1) model applied to the S&P 500, NASDAQ 100, and DJIA. The left panels track the evolution of ω ,   α 1   and β 1 under expanding-window refits. Across all indices, the persistence term α 1 + β 1 remains below 1, confirming covariance stationarity and stable conditional-variance dynamics. The right panels display train and test QLIKE losses. Ratios between 0.74 and 0.87 remain below 1, confirming mild and well-controlled overfitting and strong generalization performance.
Overall, the overfitting diagnostics indicate that all classical models generalize well and remain stable across market conditions. Train/Test QLIKE ratios below one, together with stable parameter paths, show that these models capture persistent volatility dynamics without fitting noise. From a financial perspective, this confirms that ARIMA(1,0,1), HAR-RV, and GARCH(1,1) produce reliable volatility estimates that are robust to shifts in market regimes—an essential property for risk management, trading strategies, and forecasting applications.

4.2.4. Forecasting Results

The forecasting performance of the classical models confirms the conclusions drawn from the error and overfitting analyses. Across all indices, RV displays clear clustering and persistence patterns, which ARIMA(1,0,1) captures only partially. GARCH(1,1) adapts better to variance shifts and follows observed peaks and troughs more closely, especially during volatility spikes. However, its forecasts tend to revert too quickly toward the mean, leading to underestimation during extended high-volatility periods. In contrast, the HAR-RV model delivers the smoothest and most consistent forecasts, effectively tracing the underlying volatility dynamics across both short and medium horizons. Its multi-lag structure enables it to incorporate long-term memory effects, which improves synchronization with RV, particularly for the Yang–Zhang estimator provides a stable representation of long-range dependence and filters short-term noise more effectively than ARIMA and GARCH. To further illustrate these dynamics, Figure 6 presents representative forecast panels for ARIMA(1,0,1), GARCH(1,1), and HAR-RV across three major U.S. indices and three RV estimators (Close, Parkinson, Yang–Zhang) over forecast horizons of h = 1, 5, 22 days. The plots show the logarithmic variance (true vs. predicted) and demonstrate how each model captures volatility movements at different time scales.
Across all indices, HAR-RV consistently aligns most closely with realized variance. Its use of multi-period lags allows for smoother yet responsive adaptation to volatility persistence, effectively capturing both short-term spikes and gradual shifts with relatively low error dispersion. GARCH(1,1) also models volatility clustering well, though it tends to overreact in calm periods and slightly underestimate prolonged extremes during turbulent episodes. This moderate bias is particularly visible in high-volatility windows, where GARCH reverts too rapidly toward its conditional mean. ARIMA(1,0,1), by contrast, produces flatter and less adaptive forecasts, reflecting its linear structure and limited ability to capture volatility feedback effects. As a result, its predictions deviate more strongly from RV, especially for short-horizon forecasts. At longer horizons (h = 22), both HAR-RV and GARCH(1,1) maintain coherent predictive patterns, while ARIMA forecasts gradually diverge from RV, confirming the superior robustness of models explicitly built on variance dynamics. Among the RV estimators, Yang–Zhang yields the smallest and most stable forecast errors across models, suggesting that its inclusion of both overnight and intraday information improves the detection of persistent variance components. These forecasting patterns are consistent across all three indices and reinforce the advantages of RV-based models when long-memory effects are present. Combined with the overfitting diagnostics, these results motivate the transition toward AI-based architectures capable of capturing nonlinear dependencies and regime-specific volatility patterns.

4.3. Advanced Models Performance

This section examines the empirical performance of the advanced models—DL and TRF architectures. These approaches are designed to capture nonlinearities, long-term dependencies, and complex interactions within realized variance (RV) that linear econometric models often fail to represent. The evaluation covers three main dimensions: (1) forecast accuracy across horizons (h = 1, 5, 22), (2) training stability and convergence, and (3) statistical significance of performance differences based on the DM test. All models are trained on log-realized variance and validated using an anchored walk-forward procedure, which expands the training window sequentially while keeping the initial anchor point fixed. This approach reflects realistic forecasting conditions and ensures consistent out-of-sample evaluation.

4.3.1. Forecast Accuracy for Advanced Models (MAE, RMSE, QLIKE)

Figure 7 summarizes the aggregated results across all indices and realized variance estimators. Each cell in the heatmap reports the mean error value for a given model and horizon, with darker blue shades corresponding to lower errors (i.e., stronger predictive performance). The Transformer and PatchTST-lite architectures consistently achieve the lowest error values across all metrics, with the advantage becoming particularly pronounced at longer horizons (h = 22), where recurrent models gradually lose calibration. The LSTM model performs competitively at short horizons (h = 1) but loses accuracy as the horizon extends. In contrast, CNN-LSTM exhibits the weakest calibration overall, reflected in its higher RMSE and QLIKE losses (Appendix A Table A3) These results indicate that attention-based architectures outperform recurrent and convolutional models by better capturing multi-scale temporal dependencies and the nonlinear persistence characteristic of financial volatility.

4.3.2. Statistical Significance (Diebold–Mariano)

To verify whether the observed differences in forecasting accuracy are statistically meaningful, the DM test is applied across all horizons and RV measures. The DM statistic compares the forecast loss differentials between pairs of models: positive values indicate that the model in the row performs better, while negative values imply weaker performance. Figure 8 presents pairwise DM statistics for horizons h = 1, 5, 22, together with the aggregated mean heatmap. Each cell in the heatmap showsthe mean DM value between two models. Warmer colors (red) signal statistically significant outperformance of the row model (p < 0.05), whereas cooler colors (blue) indicate the opposite. At short horizons (h = 1), differences in predictive accuracy are generally minor, suggesting that most architectures adapt similarly to near-term volatility fluctuations. However, as the horizon increases (h = 5 and h = 22), Transformer and PatchTST-lite show consistently superior and statistically significant performance relative to the recurrent models. This reflects their stronger ability to model long-term persistence and nonlinear variance dynamics. Among all models, CNN-LSTM performs the weakest, showing predominantly negative DM values relative to its counterparts.
As the forecast horizon lengthens, performance differences become clearer and more systematic. While all models perform similarly over short horizons (h = 1), the Transformer and PatchTST-lite architectures gain a clear advantage at medium and long horizons (h = 5, 22). Their ability to capture long-term dependencies and persistent volatility dynamics translates into statistically significant improvements over the recurrent models.

4.3.3. Overfitting Diagnostics for Deep Learning Models

To verify that the DL models produce reliable and stable forecasts, an overfitting audit was conducted. The purpose of this analysis is to assess whether the models learn structural patterns in volatility instead of memorizing the training data. For this reason, both the Train/Test loss ratios and the learning curves were examined across forecast horizons (h = 1, 5, 22).
Figure 9 provides a visual overview of model generalization and stability across horizons. In Panel (a), the average Train/Test loss ratios remain close to one and below the red reference line, indicating a balanced relationship between training and test losses and suggesting that the models do not overfit to a meaningful extent. As the forecast horizon increases (h = 22), the uncertainty bands widen slightly, which is expected when predicting further into the future. At the short horizon (h = 1), both training and validation losses decline quickly and converge within 8–10 epochs, with minimal difference between them. This pattern indicates stable learning dynamics and only minor overfitting, making additional regularization unnecessary. At the medium horizon (h = 5), the validation loss begins to plateau earlier than the training loss, creating a small but visible gap between the two. This suggests mild overfitting, which could be mitigated through small increases in dropout, weight decay, or early stopping. At the long horizon (h = 22), the gap between training and validation losses becomes clearer. While the training loss continues to decrease, the validation loss remains higher and more volatile, indicating partial overfitting. In this case, lower learning rates and stronger regularization, or a shorter input window may help stabilize learning.
Overall, the DL models generalize well at short horizons, show some divergence at medium horizons, and require tighter regularization for longer-term forecasts. These diagnostics confirm that the networks are generally well-calibrated and exhibit only modest overfitting as the forecasting horizon and model difficulty increases.

4.3.4. Forecasting Results

To complement the quantitative evaluation, Figure 10 illustrates the predicted and RV trajectories for representative cases across the three forecast horizons (h = 1, 5, 22). These visual comparisons provide an intuitive view of how each model tracks volatility fluctuations and responds to volatility clusters and regime. All models were trained and validated using an anchored walk-forward procedure to ensure a realistic, time-consistent evaluation of out-of-sample performance. Model complexity was controlled through early stopping and out-of-sample validation to limit overfitting.
At the short horizon (Panel a, Figure 10, h = 1), both Transformer and LSTM follow the realized variance closely, capturing most short-lived volatility bursts. Their forecasts react quickly to new information and exhibit minimal delay after market shocks, reflecting strong sensitivity to high-frequency dynamics. At the medium horizon (Panel b, h = 5), forecasts become smoother and less reactive to short-term noise. The PatchTST-lite model maintains consistent alignment with the overall variance level, while recurrent architectures (LSTM and CNN-LSTM) show lagged responses and tend to underpredict during turbulent episodes. This behavior suggests that attention-based models integrate information more effectively across multiple time scales. At the long horizon (Panel c, h = 22), all models naturally show wider deviations from realized variance due to accumulated forecast uncertainty. Nevertheless, the Transformer continues to reproduce broad volatility regimes more accurately than the other architectures. It captures both the persistence of calm periods and the amplitude of volatility spikes, demonstrating robust adaptability even at longer horizons.
Taken together, the visual results confirm the quantitative findings: attention-based models such as Transformer and PatchTST-lite deliver more stable and well-calibrated volatility forecasts across horizons, whereas recurrent models perform well in short-term dynamics but gradually lose precision as the forecast horizon lengthens.

4.4. Comparative Evaluation and Statistical Significance

To complement the statistical testing, this section applies SHAP analysis to understand which factors drive model performance and forecast error variability across datasets (Figure 11). For QLIKE, the analysis reveals that R V _ c l o s e is the dominant determinant of forecasting accuracy, followed by ARIMA(1,0,1) and HAR-RV, while the forecast horizon H exerts a smaller but consistent influence. The choice of index has little impact on error magnitude, indicating that differences in performance are primarily driven by model architecture and the specification of the RV measure, rather than by market-specific characteristics. For the DM statistic, the largest SHAP contributions come from ARIMA(1,0,1) and R V _ c l o s e , indicating that these features explain most of the statistical differences between models. GARCH(1,1) and HAR-RV also rank highly, confirming that classical econometric frameworks continue to provide robust benchmarks for comparative evaluation.
In contrast, DL architectures exhibit smaller individual SHAP contributions, implying that their predictive strength stems from broader multi-feature interactions rather than reliance on a single dominant driver. Across both QLIKE and DM metrics, the SHAP patterns consistently show that volatility-based inputs (e.g., R V _ c l o s e , Yang–Zhang) remain key predictors of accuracy, while the choice of model family determines whether these improvements become statistically significant. Short horizons (h = 1, 5) contribute more strongly to DM dominance, whereas long horizons (h = 22) tend to dilute these differences due to higher forecast uncertainty.
The SHAP bar and beeswarm plots offer complementary insights: the mean (SHAP) bars highlight the relative importance of each factor, while the beeswarm plots capture the direction and variability of their effects. Higher SHAP values indicate factors that increase forecasting errors, whereas negative values correspond to better accuracy. The close similarity between QLIKE and DM SHAP patterns confirms that the same underlying factors shape both absolute error size and relative model dominance. Overall, the findings show that model architecture—rather than index choice or horizon length—is the key determinant of forecasting quality. The SHAP results therefore reinforce the broader empirical conclusion: attention-based models, particularly PatchTST-lite, achieve the most reliable and accurate volatility forecasts by leveraging rich, multi-feature interactions rather than isolated predictors.

4.5. Subsample Robustness Analysis

The sample is divided into four economically distinct periods: pre-GFC (2000–2006), GFC (2007–2009), post-GFC (2010–2019), and the COVID-19 period (2020–2025). For each subsample, we recompute the QLIKE error for the best-performing classical model and the best DL architecture (Table 4).
Across all regimes, the ranking of models remains broadly consistent. HAR-RV delivers the lowest classical-model errors during tranquil periods, when volatility persistence dominates and shocks are moderate. GARCH(1,1) becomes the strongest classical benchmark during crisis episodes such as the GFC and COVID-19, reflecting its ability to respond rapidly to large volatility shocks. Among the advanced models, Transformer-based architectures consistently achieve the lowest QLIKE errors in stress periods, particularly when volatility shifts rapidly and exhibits nonlinear propagation. At shorter horizons in calm regimes, recurrent models such as LSTM and CNN-LSTM perform competitively, but their advantage diminishes during turbulent periods and as the forecast horizon increases.
These results indicate that the main conclusions of the study are not driven by a single historical window. Instead, the model rankings remain stable across major economic regimes, confirming that (a) classical models outperform in stable, low-volatility conditions, while (b) Transformers dominate in crisis-driven and high-uncertainty environments, where long-range dependencies and nonlinear dynamics become critical.
The variation in best-performing models across subsamples reflects differences in market regimes. During tranquil periods (Pre-GFC and Post-GFC), long-memory dynamics dominate and HAR-RV achieves the lowest QLIKE errors, whereas in crisis environments (GFC, COVID-19) GARCH(1,1) performs best because it reacts more rapidly to sudden volatility shocks. Since QLIKE penalizes volatility underestimation, regime shifts naturally lead to changes in the model ranking.

5. Economic Interpretations of Forecast Results

From the perspective of volatility forecasting and market dynamics, the comparison of predictive models reveals several economic mechanisms that shape the behavior of financial uncertainty. Volatility does not evolve smoothly over time; instead, it clusters into prolonged calm periods followed by episodes of heightened turbulence. This well-documented clustering phenomenon explains why investors cycle between increased risk-taking during tranquil phases and heightened caution when volatility rises. Because volatility directly influences risk premia, investor behavior, and the speed at which information is incorporated into asset prices, improvements in forecasting accuracy carry clear economic, rather than purely statistical, relevance.
Traditional econometric models such as GARCH(1,1) and HAR-RV remain reliable baselines for capturing persistence and volatility clustering (Ersin & Bildirici, 2023; Asgharian et al., 2013; Virk et al., 2024). They react to recent shocks and incorporate long-memory components, making them useful for understanding how markets up-date risk assessments.
The strong performance of Transformer-based architectures shows that volatility is driven not merely by random shocks, but by the way financial news and macroeconomic events propagate through markets—unevenly, in waves, and with varying in-tensity. Investor reactions are asymmetric: negative news often triggers sharp overreactions, while positive information is absorbed more gradually. Transformers capture these informational cascades, behavioral asymmetries, and long-memory effects far more effectively than classical linear models. As a result, the empirical findings relate directly to themes of market efficiency, asymmetric information transmission, and the economics of uncertainty.
Because classical models rely on fixed parametric structures they struggle to detect abrupt regime shifts, asymmetric responses to bad news, and structural breaks that increasingly characterize modern markets. During macroeconomic shocks, geopolitical events, or panic-driven sell-offs, these models tend to react too slowly, leading to underestimated risk premia and delayed adjustments in investor positioning.
In terms of news processing, differences in forecast accuracy reveal that classical models implicitly assume smooth information arrival and linear shock propagation (Ersin & Bildirici, 2023; Asgharian et al., 2013; Virk et al., 2024), which explains their weaker performance during crisis episodes. Attention-based architectures, by contrast, extract relevant signals from irregular and clustered news flows, enabling them to adapt to state-dependent information dynamics. Their superior accuracy suggests that markets react selectively, rather than uniformly—a view aligned with behavioral finance and regime-dependent risk pricing.
The economic implications extend directly to risk management and portfolio construction. More accurate volatility forecasts improve the calibration of VaR and ES, reduce the risk of underestimating losses, and support timely adjustments in leverage, margin requirements, and exposure limits. Transformer models detect volatility spikes earlier than GARCH, LSTM, or CNN-LSTM, allowing faster deleveraging and more effective crisis responses (Nie et al., 2023; Zeng et al., 2023). In volatility-managed strategies, models that recognize regime changes early provide a notable advantage: because portfolio exposure scales inversely with expected volatility, earlier detection of spikes leads to higher risk-adjusted performance—an area where Transformers consistently outperform other frameworks.
Deep-learning models such as LSTM and CNN-LSTM capture nonlinear and non-stationary dynamics well (Harikumar & Muthumeenakshi, 2025) because they learn hidden dependencies, repeated patterns, and sudden shifts in market behavior. However, their performance depends strongly on the choice of window length (e.g., 20, 60, or 120 days). Short windows cause the model to overweight recent events and overestimate volatility, leading to excessive caution and lost return opportunities. Long windows overly smooth shocks, causing delayed reactions—especially in crises, when rapid de-risking is critical. Their sensitivity to data scaling can also distort shock magnitude and delay portfolio adjustments. Thus, despite their strong nonlinear capabilities, their forecasts may be less stable under extreme market conditions.
In contrast, Transformer models generalize far more robustly across horizons and market regimes because they do not rely on a fixed time window or on sequential data processing. Their self-attention mechanism allows them to compare all observations simultaneously and extract the most relevant signals from the entire series, whether the market is calm or highly turbulent. This enables them to capture both local shocks and global regime shifts–even when these occur abruptly.
From a financial standpoint, this capability is essential: the model automatically in-creases attention to accelerating price movements, rising cross-asset correlations, negative-news clusters, and growing liquidity stress. Transformers therefore detect early signs of risk-off behavior, widening risk premia, and mounting market uncertainty much faster than recurrent and classical models. This results in earlier reductions in exposure, more precise VaR and ES estimates, and more agile portfolio-management decisions during periods when accuracy is most valuable.
Among all tested models, PatchTST-lite delivers the strongest out-of-sample performance across indices and horizons. This is consistent with evidence that time-series transformers such as Informer and Autoformer capture long-range dependencies more effectively while avoiding the gradient-decay limitations of recurrent networks (Nie et al., 2023; Zeng et al., 2023). Low QLIKE and RMSE values across the S&P 500, NASDAQ 100, and DJIA confirm their ability to model regime changes and the dynamics of uncertainty.
Finally, consistent with literature on realized volatility and multi-estimator frame-works (L. Zhang et al., 2005; Patton & Sheppard, 2009; O. Barndorff-Nielsen & Shephard, 2004; Tauchen & Zhou, 2011), forecasting accuracy improves when diverse volatility signals are combined–such as daily returns, range-based measures, extreme price moves, volume pressure, news shocks, and shifts in correlations. HAR models achieve this through multi-scale aggregation, while attention-based models learn these heterogeneous components directly. Integrating these sources provides a more complete representation of market risk, leading to improved estimation of risk premia, more efficient exposure management, and earlier detection of emerging market stress.
In sum, the superior accuracy of attention-based architectures indicates that financial volatility is driven by long-memory dynamics, nonlinear information flows, regime shifts, and asymmetric investor reactions. Models capable of learning these mechanisms produce forecasts that more accurately reflect the economic structure of uncertainty.

6. Conclusions

This study provides a unified empirical comparison of classical econometric models, DL architectures, and modern Transformer frameworks in forecasting realized variance for major U.S. equity indices. The results show that model architecture is a central determinant of forecasting accuracy and robustness across horizons. Classical models such as GARCH(1,1) and HAR-RV remain reliable baselines under stable market conditions, where persistence and multi-scale variance dynamics dominate. However, their flexibility is limited during turbulent periods and structural breaks.
DL models enhance predictive accuracy by capturing nonlinearities and long-range dependencies in volatility dynamics, yet the shift toward attention-based architectures represents a substantive methodological transition. Lightweight Transformer variants such as PatchTST-lite demonstrate stronger generalization, greater adaptability to regime changes, and more stable performance across horizons. Their ability to learn long-memory effects, nonlinear information propagation, and heterogeneous investor reactions allows them to outperform both classical and recurrent neural models.
The findings carry important implications for practitioners and policymakers. More accurate volatility forecasts strengthen risk-management practices, improve exposure-scaling rules, and support more reliable calibration of VaR and ES during turbulent periods. Since volatility is a key state variable for risk premia and capital allocation, models capable of capturing regime-dependent dynamics provide earlier and more informative signals about shifting market uncertainty and investor behavior. The strong performance of Transformer architectures highlights the growing relevance of explainable AI in volatility modeling and contributes to bridging the gap between predictive accuracy and economic interpretability.
The study also identifies several limitations that open avenues for further research. First, range-based realized-volatility measures rely on daily OHLC data and do not fully capture intraday variation. Second, the empirical evaluation focuses on statistical accuracy rather than economic backtesting through volatility-managed strategies, option-hedging experiments, or extended VaR/ES validation. Third, despite their strong performance, Transformer models remain relatively opaque, motivating future work on hybrid designs that embed structural economic constraints.
Promising directions for future research include integrating Transformer-based volatility forecasts into option-pricing and risk-management frameworks, assessing their impact on implied-volatility surfaces and dynamic-hedging accuracy, and extending model design to incorporate macro-financial variables, sentiment indicators, or high-frequency realized measures.
Switching-regime models and economically constrained Transformers also represent an important avenue, as they can detect abrupt market transitions, capture asymmetric volatility responses (leverage effects), and model time-varying risk premia. These features bring models closer to real-world financial behavior, where risk, uncertainty, and investor reactions depend on the prevailing regime rather than remaining constant.
Overall, the evidence shows that attention-based architectures represent the next generation of volatility-forecasting tools. Their combination of predictive accuracy, scalability, and economic relevance enhances the methodological toolkit and deepens our understanding of how financial markets process information and transmit uncertainty across regimes.

Author Contributions

Conceptualization, G.T.-A. and D.G.; methodology, G.T.-A.; software, G.T.-A.; validation, G.T.-A., and D.G.; formal analysis, G.T.-A. and D.G.; investigation, G.T.-A. and D.G.; resources, D.G.; data curation, G.T.-A.; writing—original draft preparation, G.T.-A. and D.G.; visualization, G.T.-A.; supervision, G.T.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the public domain at https://www.investing.com. These data were derived from the following publicly accessible resource: Investing.com (https://www.investing.com).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

All appendix tables are reproduced directly from Python-generated outputs produced within the empirical forecasting pipeline.
Table A1. Forecasting Errors of Classical Volatility Models Across Indices, RV Estimators, and Forecast Horizons (MAE, RMSE, QLIKE).
Table A1. Forecasting Errors of Classical Volatility Models Across Indices, RV Estimators, and Forecast Horizons (MAE, RMSE, QLIKE).
IndexRVHModelMAERMSEQLIKE
s&p_500Close1ARIMA(1,0,1)2.1904662.94368915.86222
s&p_500Close1GARCH(1,1)2.0625413.0077491.586882
s&p_500Close1HAR-RV1.8973412.6085374.457791
s&p_500Close5ARIMA(1,0,1)2.1917372.94499615.92837
s&p_500Close5GARCH(1,1)2.0854453.0350181.588451
s&p_500close5HAR-RV1.9169422.6167564.718463
s&p_500close22ARIMA(1,0,1)2.1971842.95060116.21369
s&p_500close22GARCH(1,1)2.1715913.1317851.62144
s&p_500close22HAR-RV1.9509252.6699165.794402
s&p_500parkinson1ARIMA(1,0,1)1.5457191.8914992.001075
s&p_500parkinson1GARCH(1,1)0.9796651.2021440.450411
s&p_500parkinson1HAR-RV0.696570.8715940.481789
s&p_500parkinson5ARIMA(1,0,1)1.5480111.8940962.005168
s&p_500parkinson5GARCH(1,1)1.018581.2433380.475387
s&p_500parkinson5HAR-RV0.783010.9824620.657114
s&p_500parkinson22ARIMA(1,0,1)1.5577921.9051592.02269
s&p_500parkinson22GARCH(1,1)1.156041.3895180.569135
s&p_500parkinson22HAR-RV1.0119141.2802521.568271
s&p_500yang_zhang1ARIMA(1,0,1)1.2121021.5396171.071757
s&p_500yang_zhang1GARCH(1,1)0.663320.7972950.293285
s&p_500yang_zhang1HAR-RV0.048410.0778930.003229
s&p_500yang_zhang5ARIMA(1,0,1)1.213861.541811.074058
s&p_500yang_zhang5GARCH(1,1)0.689690.8165870.295144
s&p_500yang_zhang5HAR-RV0.171580.2416570.032598
s&p_500yang_zhang22ARIMA(1,0,1)1.221381.5511461.083891
s&p_500yang_zhang22GARCH(1,1)0.7986470.9109860.328554
s&p_500yang_zhang22HAR-RV0.9305891.48128610.57867
nasdaq_100close1ARIMA(1,0,1)2.3551223.0899439.16095
nasdaq_100close1GARCH(1,1)1.996762.8252391.512568
nasdaq_100close1HAR-RV1.825542.4393043.504109
nasdaq_100close5ARIMA(1,0,1)2.3575193.09287439.623
nasdaq_100close5GARCH(1,1)2.0116342.8442471.513021
nasdaq_100close5HAR-RV1.8595932.4826874.316211
nasdaq_100close22ARIMA(1,0,1)2.3677733.10545141.65694
nasdaq_100close22GARCH(1,1)2.0717242.9152771.529352
nasdaq_100close22HAR-RV1.9821752.6175894.969706
nasdaq_100parkinson1ARIMA(1,0,1)1.8340972.3235193.073861
nasdaq_100parkinson1GARCH(1,1)1.0008791.2063540.46129
nasdaq_100parkinson1HAR-RV0.6750130.8450990.425372
nasdaq_100parkinson5ARIMA(1,0,1)1.837812.3277073.09232
nasdaq_100parkinson5GARCH(1,1)1.0279141.2340520.478126
nasdaq_100parkinson5HAR-RV0.8270371.0549980.615849
nasdaq_100parkinson22ARIMA(1,0,1)1.8536292.3455313.172278
nasdaq_100parkinson22GARCH(1,1)1.1273121.3377390.543594
nasdaq_100parkinson22HAR-RV1.0464671.4029880.991795
nasdaq_100yang_zhang1ARIMA(1,0,1)1.6928442.3523142.111175
nasdaq_100yang_zhang1GARCH(1,1)0.5077710.6536750.254583
nasdaq_100yang_zhang1HAR-RV0.0492860.0835130.003853
nasdaq_100yang_zhang5ARIMA(1,0,1)1.6964492.3566992.121979
nasdaq_100yang_zhang5GARCH(1,1)0.5184260.6609130.250222
nasdaq_100yang_zhang5HAR-RV0.1814470.2587630.037915
nasdaq_100yang_zhang22ARIMA(1,0,1)1.7119112.3753212.168591
nasdaq_100yang_zhang22GARCH(1,1)0.5751850.7072920.25044
nasdaq_100yang_zhang22HAR-RV0.7673061.1430810.550826
dow_jonesclose1ARIMA(1,0,1)2.210042.9583648.852452
dow_jonesclose1GARCH(1,1)2.0342652.8883981.557542
dow_jonesclose1HAR-RV1.8732532.4858334.554812
dow_jonesclose5ARIMA(1,0,1)2.2115752.9602648.872977
dow_jonesclose5GARCH(1,1)2.0578252.9166081.559809
dow_jonesclose5HAR-RV1.9175612.538665.925871
dow_jonesclose22ARIMA(1,0,1)2.2181752.9683968.961539
dow_jonesclose22GARCH(1,1)2.1467473.0163241.595059
dow_jonesclose22HAR-RV1.8995162.5257675.345172
dow_jonesparkinson1ARIMA(1,0,1)1.5875411.9610461.521659
dow_jonesparkinson1GARCH(1,1)0.9141551.1952550.41325
dow_jonesparkinson1HAR-RV0.6986550.9516570.505822
dow_jonesparkinson5ARIMA(1,0,1)1.5904671.9642171.523238
dow_jonesparkinson5GARCH(1,1)0.9518111.2329990.436347
dow_jonesparkinson5HAR-RV0.7646941.0379070.682624
dow_jonesparkinson22ARIMA(1,0,1)1.6029641.977741.530081
dow_jonesparkinson22GARCH(1,1)1.086661.3691780.524821
dow_jonesparkinson22HAR-RV0.944671.253471.37604
dow_jonesyang_zhang1ARIMA(1,0,1)1.0404941.2821621.104241
dow_jonesyang_zhang1GARCH(1,1)0.6223540.7646670.28261
dow_jonesyang_zhang1HAR-RV0.0471670.0786260.003472
dow_jonesyang_zhang5ARIMA(1,0,1)1.0414621.2833411.106915
dow_jonesyang_zhang5GARCH(1,1)0.6494380.7832290.284522
dow_jonesyang_zhang5HAR-RV0.1703210.2436590.034213
dow_jonesyang_zhang22ARIMA(1,0,1)1.0455791.2883691.118378
dow_jonesyang_zhang22GARCH(1,1)0.7604270.8769050.317268
dow_jonesyang_zhang22HAR-RV0.6974681.0081851.51359
Table A2. Diebold–Mariano Test Results.
Table A2. Diebold–Mariano Test Results.
IndexRVHABDM_Statp_Value
s&p_500close1ARIMA(1,0,1)GARCH(1,1)11.953880
s&p_500close1ARIMA(1,0,1)HAR-RV9.8037630
s&p_500close1HAR-RVGARCH(1,1)21.07310
s&p_500close5ARIMA(1,0,1)GARCH(1,1)8.099955 4.44   ×   10 16
s&p_500close5ARIMA(1,0,1)HAR-RV6.452851 1.10   ×   10 10
s&p_500close5HAR-RVGARCH(1,1)14.786070
s&p_500close22ARIMA(1,0,1)GARCH(1,1)4.3154571 1.59   ×   10 5
s&p_500close22ARIMA(1,0,1)HAR-RV3.1979670.001384
s&p_500close22HAR-RVGARCH(1,1)6.057798 1.38   ×   10 9
s&p_500parkinson1ARIMA(1,0,1)GARCH(1,1)14.149480
s&p_500parkinson1ARIMA(1,0,1)HAR-RV14.070680
s&p_500parkinson1HAR-RVGARCH(1,1)1.8571470.06329
s&p_500parkinson5ARIMA(1,0,1)GARCH(1,1)7.581582 3.42   ×   10 14
s&p_500parkinson5ARIMA(1,0,1)HAR-RV6.74267 0.56   ×   10 11
s&p_500parkinson5HAR-RVGARCH(1,1)4.786655 1.70   ×   10 6
s&p_500parkinson22ARIMA(1,0,1)GARCH(1,1)3.8364740.000125
s&p_500parkinson22ARIMA(1,0,1)HAR-RV1.0778540.281099
s&p_500parkinson22HAR-RVGARCH(1,1)4.0776984 4.55   ×   10 5
s&p_500yang_zhang1ARIMA(1,0,1)GARCH(1,1)19.568390
s&p_500yang_zhang1ARIMA(1,0,1)HAR-RV27.127830
s&p_500yang_zhang1HAR-RVGARCH(1,1)−27.63760
s&p_500yang_zhang5ARIMA(1,0,1)GARCH(1,1)8.8626980
s&p_500yang_zhang5ARIMA(1,0,1)HAR-RV11.864480
s&p_500yang_zhang5HAR-RVGARCH(1,1)−13.75680
s&p_500yang_zhang22ARIMA(1,0,1)GARCH(1,1)4.318028 1.57   ×   10 5
s&p_500yang_zhang22ARIMA(1,0,1)HAR-RV−1.661340.096644
s&p_500yang_zhang22HAR-RVGARCH(1,1)1.7954060.072589
nasdaq_100close1ARIMA(1,0,1)GARCH(1,1)11.723620
nasdaq_100close1ARIMA(1,0,1)HAR-RV11.167760
nasdaq_100close1HAR-RVGARCH(1,1)21.518980
nasdaq_100close5ARIMA(1,0,1)GARCH(1,1)8.095809 4.44   ×   10 16
nasdaq_100close5ARIMA(1,0,1)HAR-RV7.490017 6.88   ×   10 14
nasdaq_100close5HAR-RVGARCH(1,1)5.670492 1.42   ×   10 8
nasdaq_100close22ARIMA(1,0,1)GARCH(1,1)4.394012 1.11   ×   10 5
nasdaq_100close22ARIMA(1,0,1)HAR-RV4.065009 4.80   ×   10 5
nasdaq_100close22HAR-RVGARCH(1,1)6.726306 1.74   ×   10 11
nasdaq_100parkinson1ARIMA(1,0,1)GARCH(1,1)15.025160
nasdaq_100parkinson1ARIMA(1,0,1)HAR-RV15.648220
nasdaq_100parkinson1HAR-RVGARCH(1,1)−2.289950.022024
nasdaq_100parkinson5ARIMA(1,0,1)GARCH(1,1)8.7948550
nasdaq_100parkinson5ARIMA(1,0,1)HAR-RV8.5287950
nasdaq_100parkinson5HAR-RVGARCH(1,1)4.159174 3.19   ×   10 5
nasdaq_100parkinson22ARIMA(1,0,1)GARCH(1,1)4.720948 2.35   ×   10 6
nasdaq_100parkinson22ARIMA(1,0,1)HAR-RV4.045961 5.21   ×   10 5
nasdaq_100parkinson22HAR-RVGARCH(1,1)4.419376 9.90   ×   10 6
nasdaq_100yang_zhang1ARIMA(1,0,1)GARCH(1,1)26.161520
nasdaq_100yang_zhang1ARIMA(1,0,1)HAR-RV29.001470
nasdaq_100yang_zhang1HAR-RVGARCH(1,1)−21.00110
nasdaq_100yang_zhang5ARIMA(1,0,1)GARCH(1,1)11.789160
nasdaq_100yang_zhang5ARIMA(1,0,1)HAR-RV12.843130
nasdaq_100yang_zhang5HAR-RVGARCH(1,1)−9.301050
nasdaq_100yang_zhang22ARIMA(1,0,1)GARCH(1,1)5.840253 5.21   ×   10 9
nasdaq_100yang_zhang22ARIMA(1,0,1)HAR-RV5.004617 5.60   ×   10 7
nasdaq_100yang_zhang22HAR-RVGARCH(1,1)6.228023 4.71   ×   10 10
dow_jonesclose1ARIMA(1,0,1)GARCH(1,1)10.672740
dow_jonesclose1ARIMA(1,0,1)HAR-RV6.429368 1.28   ×   10 10
dow_jonesclose1HAR-RVGARCH(1,1)21.932890
dow_jonesclose5ARIMA(1,0,1)GARCH(1,1)7.241718 4.43   ×   10 13
dow_jonesclose5ARIMA(1,0,1)HAR-RV2.832220.004623
dow_jonesclose5HAR-RVGARCH(1,1)11.674950
dow_jonesclose22ARIMA(1,0,1)GARCH(1,1)3.8601480.000113
dow_jonesclose22ARIMA(1,0,1)HAR-RV1.912950.055754
dow_jonesclose22HAR-RVGARCH(1,1)4.412203 1.02   ×   10 5
dow_jonesparkinson1ARIMA(1,0,1)GARCH(1,1)16.453570
dow_jonesparkinson1ARIMA(1,0,1)HAR-RV15.160310
dow_jonesparkinson1HAR-RVGARCH(1,1)4.832531 1.35   ×   10 6
dow_jonesparkinson5ARIMA(1,0,1)GARCH(1,1)8.8379370
dow_jonesparkinson5ARIMA(1,0,1)HAR-RV6.730461 1.69   ×   10 11
dow_jonesparkinson5HAR-RVGARCH(1,1)5.459573 4.77   ×   10 8
dow_jonesparkinson22ARIMA(1,0,1)GARCH(1,1)4.542445 5.56   ×   10 6
dow_jonesparkinson22ARIMA(1,0,1)HAR-RV0.5554690.578574
dow_jonesparkinson22HAR-RVGARCH(1,1)4.1539133.27 x 10 5
dow_jonesyang_zhang1ARIMA(1,0,1)GARCH(1,1)16.251250
dow_jonesyang_zhang1ARIMA(1,0,1)HAR-RV21.522190
dow_jonesyang_zhang1HAR-RVGARCH(1,1)−24.50770
dow_jonesyang_zhang5ARIMA(1,0,1)GARCH(1,1)7.324583 2.40   ×   10 13
dow_jonesyang_zhang5ARIMA(1,0,1)HAR-RV9.4230730
dow_jonesyang_zhang5HAR-RVGARCH(1,1)−11.99760
dow_jonesyang_zhang22ARIMA(1,0,1)GARCH(1,1)3.5514590.000383
dow_jonesyang_zhang22ARIMA(1,0,1)HAR-RV−0.666640.505002
dow_jonesyang_zhang22HAR-RVGARCH(1,1)2.1853290.028865
Table A3. Forecast Error Values for Advanced Volatility Models (LSTM, CNN-LSTM, Transformer, PatchTST-lite) Across RV Estimators and Forecast Horizons (h = 1, 5, 22).
Table A3. Forecast Error Values for Advanced Volatility Models (LSTM, CNN-LSTM, Transformer, PatchTST-lite) Across RV Estimators and Forecast Horizons (h = 1, 5, 22).
IndexRVHModelMAERMSEQLIKE
s&p_500close1LSTM1.8635862.4338814.608724
s&p_500close1CNNLSTM1.8783742.4302794.927867
s&p_500close1Transformer1.818572.4194573.866168
s&p_500close1PatchTST-lite1.8452792.4370994.244226
s&p_500close5LSTM1.9062832.4511635.255926
s&p_500close5CNNLSTM1.8983242.4493515.276246
s&p_500close5Transformer1.9109182.4589695.579458
s&p_500close5PatchTST-lite1.8707842.4635614.807492
s&p_500close22LSTM1.8884472.4597614.807224
s&p_500close22CNNLSTM1.8964992.4689594.7817
s&p_500close22Transformer1.8924532.474754.881677
s&p_500close22PatchTST-lite1.9019062.4786225.532154
s&p_500parkinson1LSTM0.7176750.9091850.57689
s&p_500parkinson1CNNLSTM0.7093630.8999140.573365
s&p_500parkinson1Transformer0.7071070.8977270.593991
s&p_500parkinson1PatchTST-lite0.7166750.9092920.610773
s&p_500parkinson5LSTM0.7911761.0013540.767843
s&p_500parkinson5CNNLSTM0.8020611.0171910.884593
s&p_500parkinson5Transformer0.783120.9920670.805681
s&p_500parkinson5PatchTST-lite0.7902070.9979290.785396
s&p_500parkinson22LSTM0.8492731.0613420.755378
s&p_500parkinson22CNNLSTM0.860871.0766020.839096
s&p_500parkinson22Transformer0.8539191.0737780.858656
s&p_500parkinson22PatchTST-lite0.8467251.0677730.989817
s&p_500yang_zhang1LSTM0.1070.15610.013388
s&p_500yang_zhang1CNNLSTM0.0916860.137370.010077
s&p_500yang_zhang1Transformer0.0951190.1437450.011135
s&p_500yang_zhang1PatchTST-lite0.0978090.1468950.011331
s&p_500yang_zhang5LSTM0.2137540.3009340.057485
s&p_500yang_zhang5CNNLSTM0.2035550.285910.051547
s&p_500yang_zhang5Transformer0.2211650.3031010.05141
s&p_500yang_zhang5PatchTST-lite0.2264510.3084450.053766
s&p_500yang_zhang22LSTM0.5047650.6668040.311311
s&p_500yang_zhang22CNNLSTM0.4990610.6584740.297989
s&p_500yang_zhang22Transformer0.5002130.6591520.280434
s&p_500yang_zhang22PatchTST-lite0.516950.6931520.323325
nasdaq_100close1LSTM1.8460542.3777584.0135
nasdaq_100close1CNNLSTM1.8575552.3824254.35926
nasdaq_100close1Transformer1.8757762.3804524.597229
nasdaq_100close1PatchTST-lite1.8510632.3734454.46793
nasdaq_100close5LSTM1.8687262.3870984.695791
nasdaq_100close5CNNLSTM1.8397942.3941483.875657
nasdaq_100close5Transformer1.8189342.3751253.565255
nasdaq_100close5PatchTST-lite1.8229052.381623.837673
nasdaq_100close22LSTM1.918532.4171235.293182
nasdaq_100close22CNNLSTM1.9217112.4231895.222672
nasdaq_100close22Transformer1.8833822.4166314.564707
nasdaq_100close22PatchTST-lite1.8699282.4081494.328153
nasdaq_100parkinson1LSTM0.6564740.8362340.48399
nasdaq_100parkinson1CNNLSTM0.6721870.853340.519447
nasdaq_100parkinson1Transformer0.6561460.8357820.495191
nasdaq_100parkinson1PatchTST-lite0.6516140.8349360.506241
nasdaq_100parkinson5LSTM0.712560.9071540.619259
nasdaq_100parkinson5CNNLSTM0.7205310.9171350.66056
nasdaq_100parkinson5Transformer0.7210020.9114270.577625
nasdaq_100parkinson5PatchTST-lite0.7218520.9219640.697614
nasdaq_100parkinson22LSTM0.7583660.9647480.703118
nasdaq_100parkinson22CNNLSTM0.7716270.9826710.721489
nasdaq_100parkinson22Transformer0.7898261.0064270.827204
nasdaq_100parkinson22PatchTST-lite0.7755390.9784180.641463
nasdaq_100yang_zhang1LSTM0.0957470.1415250.010995
nasdaq_100yang_zhang1CNNLSTM0.0903250.1356260.00953
nasdaq_100yang_zhang1Transformer0.0902330.1335950.009185
nasdaq_100yang_zhang1PatchTST-lite0.0844790.1268260.00867
nasdaq_100yang_zhang5LSTM0.2013090.2836410.049458
nasdaq_100yang_zhang5CNNLSTM0.1965770.2819390.049162
nasdaq_100yang_zhang5Transformer0.1963170.2766750.043372
nasdaq_100yang_zhang5PatchTST-lite0.2050440.2803980.044174
nasdaq_100yang_zhang22LSTM0.4686090.6292630.281611
nasdaq_100yang_zhang22CNNLSTM0.4983240.6569970.290477
nasdaq_100yang_zhang22Transformer0.4789470.6290710.262585
nasdaq_100yang_zhang22PatchTST-lite0.5007960.6710270.334713
dow_jonesclose1LSTM1.8478022.4436654.426976
dow_jonesclose1CNNLSTM1.8619482.4441424.7542
dow_jonesclose1Transformer1.834362.4314584.090214
dow_jonesclose1PatchTST-lite1.8387222.4422734.400191
dow_jonesclose5LSTM1.8540532.451594.648342
dow_jonesclose5CNNLSTM1.8404712.4455384.327708
dow_jonesclose5Transformer1.8471942.4345964.547064
dow_jonesclose5PatchTST-lite1.8605482.4536515.127373
dow_jonesclose22LSTM1.9037212.45945.574053
dow_jonesclose22CNNLSTM1.8658012.4482774.67001
dow_jonesclose22Transformer1.8379642.4433014.284449
dow_jonesclose22PatchTST-lite1.8460412.4491094.300733
dow_jonesparkinson1LSTM0.664040.8374850.481757
dow_jonesparkinson1CNNLSTM0.6662950.8392950.468118
dow_jonesparkinson1Transformer0.667930.8436830.510677
dow_jonesparkinson1PatchTST-lite0.6741270.8462060.460885
dow_jonesparkinson5LSTM0.7148680.9057260.609055
dow_jonesparkinson5CNNLSTM0.7209860.9113460.629025
dow_jonesparkinson5Transformer0.7198540.9058980.592983
dow_jonesparkinson5PatchTST-lite0.7243230.9127570.635256
dow_jonesparkinson22LSTM0.7651730.9602670.612206
dow_jonesparkinson22CNNLSTM0.7812410.9835210.658715
dow_jonesparkinson22Transformer0.763930.9711180.835195
dow_jonesparkinson22PatchTST-lite0.763660.9686940.647512
dow_jonesyang_zhang1LSTM0.0858280.1252420.008233
dow_jonesyang_zhang1CNNLSTM0.0851220.1283430.008456
dow_jonesyang_zhang1Transformer0.0928880.135160.009149
dow_jonesyang_zhang1PatchTST-lite0.0857930.1261920.008315
dow_jonesyang_zhang5LSTM0.1839670.2545890.038657
dow_jonesyang_zhang5CNNLSTM0.1893080.2654280.043682
dow_jonesyang_zhang5Transformer0.1940170.2757150.042837
dow_jonesyang_zhang5PatchTST-lite0.194780.2792340.046453
dow_jonesyang_zhang22LSTM0.4557680.6048790.226007
dow_jonesyang_zhang22CNNLSTM0.4554060.6029580.230627
dow_jonesyang_zhang22Transformer0.4564960.6017090.223307
dow_jonesyang_zhang22PatchTST-lite0.4642640.6206340.264777
Table A4. Diebold–Mariano Test Results for Advanced Forecasting Models.
Table A4. Diebold–Mariano Test Results for Advanced Forecasting Models.
IndexRVHModel1Model2DM_Statp_Value
s&p_500close1LSTMCNNLSTM−4.05481 5.02   ×   10 5
s&p_500close1LSTMTransformer6.614859 3.72   ×   10 11
s&p_500close1LSTMPatchTST-lite2.2676910.023348
s&p_500close1CNNLSTMTransformer7.853962 4.00   ×   10 15
s&p_500close1CNNLSTMPatchTST-lite4.36369 1.28   ×   10 5
s&p_500close1TransformerPatchTST-lite−2.494690.012607
s&p_500close5LSTMCNNLSTM−0.262530.792914
s&p_500close5LSTMTransformer−1.43260.151972
s&p_500close5LSTMPatchTST-lite2.1330380.032922
s&p_500close5CNNLSTMTransformer−1.29940.193806
s&p_500close5CNNLSTMPatchTST-lite2.1315520.033044
s&p_500close5TransformerPatchTST-lite6.246839 4.19   ×   10 10
s&p_500close22LSTMCNNLSTM0.1972850.843604
s&p_500close22LSTMTransformer−0.474910.634848
s&p_500close22LSTMPatchTST-lite−1.464170.143146
s&p_500close22CNNLSTMTransformer−0.409410.682241
s&p_500close22CNNLSTMPatchTST-lite−1.229870.218746
s&p_500close22TransformerPatchTST-lite−1.486870.13705
s&p_500parkinson1LSTMCNNLSTM0.8556710.39218
s&p_500parkinson1LSTMTransformer−1.461780.143801
s&p_500parkinson1LSTMPatchTST-lite−3.690610.000224
s&p_500parkinson1CNNLSTMTransformer−2.026160.042749
s&p_500parkinson1CNNLSTMPatchTST-lite−4.11184 3.93   ×   10 5
s&p_500parkinson1TransformerPatchTST-lite−1.907790.056418
s&p_500parkinson5LSTMCNNLSTM−4.75083 2.03   ×   10 6
s&p_500parkinson5LSTMTransformer−2.019150.043472
s&p_500parkinson5LSTMPatchTST-lite−1.027730.304077
s&p_500parkinson5CNNLSTMTransformer2.8381390.004538
s&p_500parkinson5CNNLSTMPatchTST-lite2.9695840.002982
s&p_500parkinson5TransformerPatchTST-lite1.2244020.220801
s&p_500parkinson22LSTMCNNLSTM−3.260760.001111
s&p_500parkinson22LSTMTransformer−1.851240.064134
s&p_500parkinson22LSTMPatchTST-lite−1.965860.049315
s&p_500parkinson22CNNLSTMTransformer−0.278710.780467
s&p_500parkinson22CNNLSTMPatchTST-lite−1.12490.260633
s&p_500parkinson22TransformerPatchTST-lite−1.705890.088029
s&p_500yang_zhang1LSTMCNNLSTM6.645132 3.03   ×   10 11
s&p_500yang_zhang1LSTMTransformer4.87451 1.09   ×   10 6
s&p_500yang_zhang1LSTMPatchTST-lite4.128009 3.66   ×   10 5
s&p_500yang_zhang1CNNLSTMTransformer−4.96016 7.04   ×   10 7
s&p_500yang_zhang1CNNLSTMPatchTST-lite−4.48589 7.26   ×   10 6
s&p_500yang_zhang1TransformerPatchTST-lite−0.787630.430911
s&p_500yang_zhang5LSTMCNNLSTM3.0537550.00226
s&p_500yang_zhang5LSTMTransformer0.9353020.349633
s&p_500yang_zhang5LSTMPatchTST-lite0.6414250.521246
s&p_500yang_zhang5CNNLSTMTransformer0.0257890.979426
s&p_500yang_zhang5CNNLSTMPatchTST-lite−0.465980.64123
s&p_500yang_zhang5TransformerPatchTST-lite−1.162360.24509
s&p_500yang_zhang22LSTMCNNLSTM1.7018040.088792
s&p_500yang_zhang22LSTMTransformer1.7735490.076138
s&p_500yang_zhang22LSTMPatchTST-lite−0.547160.584271
s&p_500yang_zhang22CNNLSTMTransformer1.0115790.31174
s&p_500yang_zhang22CNNLSTMPatchTST-lite−1.214840.224426
s&p_500yang_zhang22TransformerPatchTST-lite−2.50740.012162
nasdaq_100close1LSTMCNNLSTM−2.907020.003649
nasdaq_100close1LSTMTransformer−6.05367 1.42   ×   10 9
nasdaq_100close1LSTMPatchTST-lite−3.579790.000344
nasdaq_100close1CNNLSTMTransformer−1.35990.173862
nasdaq_100close1CNNLSTMPatchTST-lite−0.598930.54922
nasdaq_100close1TransformerPatchTST-lite0.9917230.321333
nasdaq_100close5LSTMCNNLSTM4.444677 8.80   ×   10 6
nasdaq_100close5LSTMTransformer6.772077 1.27   ×   10 11
nasdaq_100close5LSTMPatchTST-lite7.034635 2.00   ×   10 12
nasdaq_100close5CNNLSTMTransformer3.2477490.001163
nasdaq_100close5CNNLSTMPatchTST-lite0.2544860.79912
nasdaq_100close5TransformerPatchTST-lite−2.502270.01234
nasdaq_100close22LSTMCNNLSTM0.7518860.45212
nasdaq_100close22LSTMTransformer4.855544 1.2   ×   10 6
nasdaq_100close22LSTMPatchTST-lite5.676986 1.37   ×   10 8
nasdaq_100close22CNNLSTMTransformer3.6257740.000288
nasdaq_100close22CNNLSTMPatchTST-lite4.26292 2.02   ×   10 5
nasdaq_100close22TransformerPatchTST-lite2.2246860.026102
nasdaq_100parkinson1LSTMCNNLSTM−6.68295 2.34   ×   10 11
nasdaq_100parkinson1LSTMTransformer−1.758780.078615
nasdaq_100parkinson1LSTMPatchTST-lite−4.12464 3.71   ×   10 5
nasdaq_100parkinson1CNNLSTMTransformer2.7363590.006212
nasdaq_100parkinson1CNNLSTMPatchTST-lite1.6090560.107604
nasdaq_100parkinson1TransformerPatchTST-lite−1.639460.101117
nasdaq_100parkinson5LSTMCNNLSTM−2.887320.003885
nasdaq_100parkinson5LSTMTransformer2.7750870.005519
nasdaq_100parkinson5LSTMPatchTST-lite−3.868480.00011
nasdaq_100parkinson5CNNLSTMTransformer3.4585120.000543
nasdaq_100parkinson5CNNLSTMPatchTST--lite−2.224420.02612
nasdaq_100parkinson5TransformerPatchTST-lite−4.08435 4.42   ×   10 5
nasdaq_100parkinson22LSTMCNNLSTM−0.459790.645669
nasdaq_100parkinson22LSTMTransformer−3.764840.000167
nasdaq_100parkinson22LSTMPatchTST-lite2.5890520.009624
nasdaq_100parkinson22CNNLSTMTransformer−1.73180.08331
nasdaq_100parkinson22CNNLSTMPatchTST-lite1.4895990.13633
nasdaq_100parkinson22TransformerPatchTST-lite5.309007 1.10   ×   10 7
nasdaq_100yang_zhang1LSTMCNNLSTM3.2244190.001262
nasdaq_100yang_zhang1LSTMTransformer3.4796250.000502
nasdaq_100yang_zhang1LSTMPatchTST-lite5.313043 1.08   ×   10 7
nasdaq_100yang_zhang1CNNLSTMTransformer1.8130450.069825
nasdaq_100yang_zhang1CNNLSTMPatchTST-lite2.9463050.003216
nasdaq_100yang_zhang1TransformerPatchTST-lite2.0261790.042746
nasdaq_100yang_zhang5LSTMCNNLSTM0.2714970.786009
nasdaq_100yang_zhang5LSTMTransformer1.6854780.091896
nasdaq_100yang_zhang5LSTMPatchTST-lite1.6256180.104031
nasdaq_100yang_zhang5CNNLSTMTransformer1.6785370.093242
nasdaq_100yang_zhang5CNNLSTMPatchTST-lite1.6897020.091085
nasdaq_100yang_zhang5TransformerPatchTST-lite−0.496220.61974
nasdaq_100yang_zhang22LSTMCNNLSTM−0.67930.49695
nasdaq_100yang_zhang22LSTMTransformer1.3283080.184076
nasdaq_100yang_zhang22LSTMPatchTST-lite−1.985040.04714
nasdaq_100yang_zhang22CNNLSTMTransformer1.5763980.114934
nasdaq_100yang_zhang22CNNLSTMPatchTST-lite−1.365170.172199
nasdaq_100yang_zhang22TransformerPatchTST-lite−2.806790.005004
dow_jonesclose1LSTMCNNLSTM−2.496160.012555
dow_jonesclose1LSTMTransformer1.8028490.071412
dow_jonesclose1LSTMPatchTST-lite0.2813140.77847
dow_jonesclose1CNNLSTMTransformer6.027225 1.67   ×   10 9
dow_jonesclose1CNNLSTMPatchTST-lite2.268390.023305
dow_jonesclose1TransformerPatchTST-lite−1.500880.133387
dow_jonesclose5LSTMCNNLSTM1.3499980.177017
dow_jonesclose5LSTMTransformer0.6287070.529541
dow_jonesclose5LSTMPatchTST-lite−2.804390.005041
dow_jonesclose5CNNLSTMTransformer−1.672810.094365
dow_jonesclose5CNNLSTMPatchTST-lite−2.246260.024687
dow_jonesclose5TransformerPatchTST-lite−2.176850.029492
dow_jonesclose22LSTMCNNLSTM6.345317 2.22   ×   10 10
dow_jonesclose22LSTMTransformer8.178957 4.44   ×   10 16
dow_jonesclose22LSTMPatchTST-lite7.197178 6.15   ×   10 13
dow_jonesclose22CNNLSTMTransformer2.3434710.019105
dow_jonesclose22CNNLSTMPatchTST-lite2.3587350.018337
dow_jonesclose22TransformerPatchTST-lite−0.168680.86605
dow_jonesparkinson1LSTMCNNLSTM4.47613 7.60   ×   10 6
dow_jonesparkinson1LSTMTransformer−4.05704 4.97   ×   10 5
dow_jonesparkinson1LSTMPatchTST-lite2.7750050.00552
dow_jonesparkinson1CNNLSTMTransformer−5.7337 9.83   ×   10 9
dow_jonesparkinson1CNNLSTMPatchTST-lite0.978730.327713
dow_jonesparkinson1TransformerPatchTST-lite7.044831 1.86   ×   10 12
dow_jonesparkinson5LSTMCNNLSTM−1.406250.159649
dow_jonesparkinson5LSTMTransformer1.2181640.223162
dow_jonesparkinson5LSTMPatchTST-lite−1.336560.181365
dow_jonesparkinson5CNNLSTMTransformer1.7544440.079355
dow_jonesparkinson5CNNLSTMPatchTST-lite−0.608480.542869
dow_jonesparkinson5TransformerPatchTST-lite−1.975760.048182
dow_jonesparkinson22LSTMCNNLSTM−1.947760.051444
dow_jonesparkinson22LSTMTransformer−2.201360.027711
dow_jonesparkinson22LSTMPatchTST-lite−1.072180.283641
dow_jonesparkinson22CNNLSTMTransformer−1.47780.139462
dow_jonesparkinson22CNNLSTMPatchTST-lite0.2169190.828272
dow_jonesparkinson22TransformerPatchTST-lite2.6200950.008791
dow_jonesyang_zhang1LSTMCNNLSTM−0.968380.332854
dow_jonesyang_zhang1LSTMTransformer−2.778780.005456
dow_jonesyang_zhang1LSTMPatchTST-lite−0.347660.728096
dow_jonesyang_zhang1CNNLSTMTransformer−3.557540.000374
dow_jonesyang_zhang1CNNLSTMPatchTST-lite0.7288850.466072
dow_jonesyang_zhang1TransformerPatchTST-lite3.1625850.001564
dow_jonesyang_zhang5LSTMCNNLSTM−2.596260.009425
dow_jonesyang_zhang5LSTMTransformer−1.150660.249871
dow_jonesyang_zhang5LSTMPatchTST-lite−3.274930.001057
dow_jonesyang_zhang5CNNLSTMTransformer0.1689150.865864
dow_jonesyang_zhang5CNNLSTMPatchTST-lite−0.820350.412014
dow_jonesyang_zhang5TransformerPatchTST-lite−1.615330.106239
dow_jonesyang_zhang22LSTMCNNLSTM−1.395440.162883
dow_jonesyang_zhang22LSTMTransformer0.2555070.798332
dow_jonesyang_zhang22LSTMPatchTST-lite−1.661040.096706
dow_jonesyang_zhang22CNNLSTMTransformer0.6965430.486089
dow_jonesyang_zhang22CNNLSTMPatchTST-lite−1.507360.131719
dow_jonesyang_zhang22TransformerPatchTST-lite−2.210770.027052

References

  1. Andersen, T. G., Bollerslev, T., Diebold, F. X., & Ebens, H. (2001). The distribution of realized stock return volatility. Journal of Financial Economics, 61(1), 43–76. [Google Scholar] [CrossRef]
  2. Andersen, T. G., Bollerslev, T., Diebold, F. X., & Labys, P. (2003). Modeling and forecasting realized volatility. Econometrica, 71(2), 579–625. [Google Scholar] [CrossRef]
  3. Asgharian, H., Hou, A. J., & Javed, F. (2013). The importance of macroeconomic variables in forecasting stock return variance: A GARCH-MIDAS approach. Journal of Forecasting, 32(7), 600–612. [Google Scholar] [CrossRef]
  4. Barndorff-Nielsen, O., & Shephard, N. (2004). Power and Bipower variation with stochastic volatility and jumps. Journal of Financial Econometrics, 2(1), 1–37. [Google Scholar] [CrossRef]
  5. Barndorff-Nielsen, O. E., & Shephard, N. (2002). Econometric analysis of realized volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B, 64(2), 253–280. Available online: https://www.jstor.org/stable/3088799 (accessed on 25 September 2025). [CrossRef]
  6. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307–327. [Google Scholar] [CrossRef]
  7. Borovykh, A., Bohte, S., & Oosterlee, C. W. (2017). Conditional time series forecasting with convolutional neural networks. arXiv, arXiv:1703.04691. [Google Scholar] [CrossRef]
  8. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. Holden-Day. [Google Scholar]
  9. Brugiere, P., & Turinici, G. (2025). Transformer for time series: An application to the S&P500. In K. Arai (Ed.), Advances in information and communication (Vol. 1285). Springer. [Google Scholar] [CrossRef]
  10. Chun, D., Cho, H., & Ryu, D. (2025). Volatility forecasting and volatility-timing strategies: A machine learning approach. Research in International Business and Finance, 75, 102723. [Google Scholar] [CrossRef]
  11. Corsi, F. (2005). Measuring and modelling realized volatility: From tick-by-tick to long memory [Ph.D. thesis, Faculty of Economics, University of Lugano]. Available online: https://susi.usi.ch/usi/documents/317904 (accessed on 20 September 2025).
  12. Corsi, F. (2009). A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics, 7(2), 174–196. [Google Scholar] [CrossRef]
  13. Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253–263. [Google Scholar] [CrossRef]
  14. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4), 987–1007. [Google Scholar] [CrossRef]
  15. Engle, R. F., Ghysels, E., & Sohn, B. (2013). Stock market volatility and macroeconomic fundamentals. Review of Economics and Statistics, 95(3), 776–797. Available online: http://www.jstor.org/stable/43554794 (accessed on 20 September 2025). [CrossRef]
  16. Ersin, Ö. Ö., & Bildirici, M. (2023). Financial volatility modeling with the GARCH-MIDAS-LSTM approach: The effects of economic expectations, geopolitical risks and industrial production during COVID-19. Mathematics, 11(8), 1785. [Google Scholar] [CrossRef]
  17. Ferreira, I. H., & Medeiros, M. C. (2021). Modeling and forecasting intraday market returns: A machine learning approach. arXiv, arXiv:2112.15108. [Google Scholar] [CrossRef]
  18. Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232. [Google Scholar] [CrossRef]
  19. Harikumar, Y., & Muthumeenakshi, M. (2025). An innovative study on stock price prediction for investment decision through ARIMA and LSTM with recurrent neural network. New Mathematics and Natural Computation, 21(3), 763–783. [Google Scholar] [CrossRef]
  20. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  21. Investing.com. (2025). Historical data for S&P 500, NASDAQ 100, and Dow Jones industrial average. Available online: https://www.investing.com/ (accessed on 5 September 2025).
  22. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: A new approach. Econometrica, 59(2), 347–370. [Google Scholar] [CrossRef]
  23. Nie, Y., Haixu, Nguyrn, N. N., Sinthong, P., & Kalagnanam, J. (2023). A time series is worth 64 words: Long-term forecasting with transformers. arXiv, arXiv:2211.14730. [Google Scholar] [CrossRef]
  24. Parkinson, M. (1980). The extreme value method for estimating the variance of the rate of return. Journal of Business, 53(1), 61–65. Available online: https://www.jstor.org/stable/2352357 (accessed on 15 September 2025). [CrossRef]
  25. Patton, A. J. (2011). Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics, 160(1), 246–256. [Google Scholar] [CrossRef]
  26. Patton, A. J., & Sheppard, K. (2009). Optimal combinations of realised volatility estimators. International Journal of Forecasting, 25(2), 218–238. Available online: https://www.sciencedirect.com/science/article/pii/S0169207009000107#sec2 (accessed on 15 September 2025). [CrossRef]
  27. Patton, A. J., & Sheppard, K. (2015). Good volatility, bad volatility: Signed jumps and the persistence of volatility. Review of Economics and Statistics, 97(3), 683–697. [Google Scholar] [CrossRef]
  28. Shi, X. J., Chen, Z. R., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28, 802–810. [Google Scholar] [CrossRef]
  29. Souto, H. G., & Moradi, A. (2024). Can transformers transform financial forecasting? China Finance Review International. ahead-of-print. [Google Scholar] [CrossRef]
  30. Tauchen, G., & Zhou, H. (2011). Realized jumps on financial markets and predicting credit spreads (FEDS Working Paper 2006-35). SSRN. [Google Scholar] [CrossRef]
  31. Taylor, S. J. (2005). Asset price dynamics, volatility, and prediction. Princeton University Press. [Google Scholar]
  32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention Is All You Need. arXiv, arXiv:1706.03762. [Google Scholar] [CrossRef]
  33. Virk, N., Javed, F., Awartani, B., & Hyde, S. (2024). A reality check on the GARCH-MIDAS volatility models. European Journal of Finance, 30(6), 575–596. [Google Scholar] [CrossRef]
  34. Yang, D., & Zhang, Q. (2000). Drift-independent volatility estimation based on high, low, open, and close prices. Journal of Business, 73(3), 477–491. [Google Scholar] [CrossRef]
  35. Zeng, Z., Kaur, R., Siddagangappa, S., Rahimi, S., Balch, T., & Veloso, M. (2023). Financial time series forecasting using CNN and transformer. arXiv, arXiv:2304.04912. [Google Scholar] [CrossRef]
  36. Zhang, L., Mykland, P. A., & Aït-Sahalia, Y. (2005). A tale of two time scales: Determining Integrated volatility with noisy high-frequency data. Journal of the American Statistical Association, 100(472), 1394–1411. Available online: https://www.jstor.org/stable/27590680 (accessed on 20 September 2025).
  37. Zhang, Y., Zhang, T., & Hu, J. (2025). Forecasting stock market volatility using CNN-BiLSTM-attention model with mixed-frequency data. Mathematics, 13(11), 1889. [Google Scholar] [CrossRef]
  38. Zhang, Z., Chen, B., Zhu, S., & Langrené, N. (2025). Quantformer: From attention to profit with a quantitative transformer trading strategy. arXiv, arXiv:2404.00424. [Google Scholar] [CrossRef]
Figure 1. Framework of the comparative volatility forecasting architecture.
Figure 1. Framework of the comparative volatility forecasting architecture.
Jrfm 18 00685 g001
Figure 2. Forecast accuracy of classical volatility models based on mean errors (MAE, RMSE, QLIKE). Mean forecasting errors of ARIMA(1,0,1), GARCH(1,1), and HAR-RV models across RV estimators (Close, Parkinson, Yang–Zhang), averaged across indices and forecast horizons (h = 1, 5, 22). Lower bars indicate higher predictive accuracy. HAR-RV consistently achieves the lowest MAE and RMSE values, followed by GARCH(1,1), while ARIMA(1,0,1) performs weakest across all metrics. A full comparison of MAE, RMSE, and QLIKE for all classical models and all RV estimators is reported in Appendix A Table A1. Figures produced in Python 3.10 using Matplotlib 3.10.7.
Figure 2. Forecast accuracy of classical volatility models based on mean errors (MAE, RMSE, QLIKE). Mean forecasting errors of ARIMA(1,0,1), GARCH(1,1), and HAR-RV models across RV estimators (Close, Parkinson, Yang–Zhang), averaged across indices and forecast horizons (h = 1, 5, 22). Lower bars indicate higher predictive accuracy. HAR-RV consistently achieves the lowest MAE and RMSE values, followed by GARCH(1,1), while ARIMA(1,0,1) performs weakest across all metrics. A full comparison of MAE, RMSE, and QLIKE for all classical models and all RV estimators is reported in Appendix A Table A1. Figures produced in Python 3.10 using Matplotlib 3.10.7.
Jrfm 18 00685 g002
Figure 3. Diebold–Mariano p-values across forecast horizons for classical volatility models (ARIMA(1,0,1), HAR-RV, and GARCH(1,1)). Darker colors correspond to smaller p-values, indicating stronger statistical significance as shown by the continuous color scale. (p ≈ 0). The detailed DM results for the classical models are reported in Appendix A Table A2. Figures produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2) for uniform visualization.
Figure 3. Diebold–Mariano p-values across forecast horizons for classical volatility models (ARIMA(1,0,1), HAR-RV, and GARCH(1,1)). Darker colors correspond to smaller p-values, indicating stronger statistical significance as shown by the continuous color scale. (p ≈ 0). The detailed DM results for the classical models are reported in Appendix A Table A2. Figures produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2) for uniform visualization.
Jrfm 18 00685 g003
Figure 4. Overfitting diagnostics of classical volatility models (ARIMA, HAR-RV). Panel (a) shows Train/Validation ratios across horizons (h = {1, 5, 22}), with boxplots centered near the 1.0 benchmark. Panel (b) tracks mean ratio trends; the red dashed line denotes the benchmark value of 1.0, indicating the ideal train/validation ratio. Panel (c) provides a heatmap of average values in the 0.17–0.28 range, indicating stable and conservative fits. Figures produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2).
Figure 4. Overfitting diagnostics of classical volatility models (ARIMA, HAR-RV). Panel (a) shows Train/Validation ratios across horizons (h = {1, 5, 22}), with boxplots centered near the 1.0 benchmark. Panel (b) tracks mean ratio trends; the red dashed line denotes the benchmark value of 1.0, indicating the ideal train/validation ratio. Panel (c) provides a heatmap of average values in the 0.17–0.28 range, indicating stable and conservative fits. Figures produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2).
Jrfm 18 00685 g004
Figure 5. Overfitting diagnostics for the classical GARCH(1,1) model across U.S. equity indices (S&P 500, NASDAQ 100, DJIA). Each row corresponds to one index. The left panels display the evolution of estimated GARCH parameters ( ω ,   α 1 ,   β 1 ) under expanding-window re-estimation, capturing parameter stability over time. The red dashed horizontal line denotes the persistence boundary α 1 + β 1 = 1 . The right panels compare Train and Test QLIKE losses, providing evidence on out-of-sample generalization performance.. All indices satisfy α 1 +   β 1 < 1 and exhibit stable generalization. Figures produced in Python 3.10 using Matplotlib (version 3.10.7).
Figure 5. Overfitting diagnostics for the classical GARCH(1,1) model across U.S. equity indices (S&P 500, NASDAQ 100, DJIA). Each row corresponds to one index. The left panels display the evolution of estimated GARCH parameters ( ω ,   α 1 ,   β 1 ) under expanding-window re-estimation, capturing parameter stability over time. The red dashed horizontal line denotes the persistence boundary α 1 + β 1 = 1 . The right panels compare Train and Test QLIKE losses, providing evidence on out-of-sample generalization performance.. All indices satisfy α 1 +   β 1 < 1 and exhibit stable generalization. Figures produced in Python 3.10 using Matplotlib (version 3.10.7).
Jrfm 18 00685 g005
Figure 6. Representative forecast panels from classical volatility models (ARIMA(1,0,1), GARCH(1,1), and HAR-RV) across selected U.S. indices. The blue curve denotes the realized (true) variance and serves as the benchmark series. HAR-RV exhibits the closest alignment with realized variance, followed by GARCH(1,1), while ARIMA(1,0,1) demonstrates weaker responsiveness to volatility dynamics. Numerical forecasting results corresponding to these panels are reported in Appendix A Table A1. Figures produced in Python 3.10 using Matplotlib (version 3.10.7).
Figure 6. Representative forecast panels from classical volatility models (ARIMA(1,0,1), GARCH(1,1), and HAR-RV) across selected U.S. indices. The blue curve denotes the realized (true) variance and serves as the benchmark series. HAR-RV exhibits the closest alignment with realized variance, followed by GARCH(1,1), while ARIMA(1,0,1) demonstrates weaker responsiveness to volatility dynamics. Numerical forecasting results corresponding to these panels are reported in Appendix A Table A1. Figures produced in Python 3.10 using Matplotlib (version 3.10.7).
Jrfm 18 00685 g006
Figure 7. Forecast accuracy of advanced models across horizons (h = 1, 5, 22). Transformer and PatchTST-lite architectures show the strongest overall accuracy and stability, while CNN-LSTM remains the least robust across horizons. All numerical forecast error values (MAE, RMSE, QLIKE) are reported in Appendix A Table A3, where results are provided for each index, RV estimator, and forecast horizon. Figures produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2).
Figure 7. Forecast accuracy of advanced models across horizons (h = 1, 5, 22). Transformer and PatchTST-lite architectures show the strongest overall accuracy and stability, while CNN-LSTM remains the least robust across horizons. All numerical forecast error values (MAE, RMSE, QLIKE) are reported in Appendix A Table A3, where results are provided for each index, RV estimator, and forecast horizon. Figures produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2).
Jrfm 18 00685 g007
Figure 8. Pairwise DM p-values for advanced models across forecast horizons (h = 1, h = 22, and mean across horizons) (Panels (ac) correspond to h = 1, h = 22, and the average across horizons, respectively. Rows and columns denote the compared models, and each cell reports the corresponding DM p-value, with color intensity reflecting the magnitude of the p-value as indicated by the color scale. All numerical DM statistics and associated p-values underlying Figure 8 are reported in Appendix A Table A4. Figures produced in Python 3.10 using Seaborn (version 0.13.2) and Matplotlib (3.10.7) (Matplotlib backend).
Figure 8. Pairwise DM p-values for advanced models across forecast horizons (h = 1, h = 22, and mean across horizons) (Panels (ac) correspond to h = 1, h = 22, and the average across horizons, respectively. Rows and columns denote the compared models, and each cell reports the corresponding DM p-value, with color intensity reflecting the magnitude of the p-value as indicated by the color scale. All numerical DM statistics and associated p-values underlying Figure 8 are reported in Appendix A Table A4. Figures produced in Python 3.10 using Seaborn (version 0.13.2) and Matplotlib (3.10.7) (Matplotlib backend).
Jrfm 18 00685 g008
Figure 9. Overfitting diagnostics across forecast horizons for deep volatility models. Panel (a) displays the mean Train/Test loss ratio with uncertainty bands, while Panels (bd) show representative learning curves for short (h = 1), medium (h = 5), and long (h = 22) horizons. Panels produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2).
Figure 9. Overfitting diagnostics across forecast horizons for deep volatility models. Panel (a) displays the mean Train/Test loss ratio with uncertainty bands, while Panels (bd) show representative learning curves for short (h = 1), medium (h = 5), and long (h = 22) horizons. Panels produced in Python 3.10 using Matplotlib (version 3.10.7) and Seaborn (version 0.13.2).
Jrfm 18 00685 g009
Figure 10. Forecasted log-variance dynamics across indices and horizons. The figure shows the predicted log-variance trajectories ( l o g σ 2 ) produced by different DL models (LSTM, CNN-LSTM, Transformer, and PatchTST-lite) compared with RV. Across all panels, the models capture the main cyclical patterns and volatility clusters, especially during major market events (e.g., 2020–2022). At the 1-day horizon (Panel (a)), forecasts closely track realized variance with minimal smoothing, while at 5- and 22-day horizons (Panels (b,c)), the predictions become smoother due to horizon-driven aggregation effects. All numerical forecast error values (MAE, RMSE, QLIKE) are reported in Appendix A Table A3, where results are provided for each index, RV estimator, and forecast horizon. Figures produced in Python 3.10 using Matplotlib (version 3.10.7).
Figure 10. Forecasted log-variance dynamics across indices and horizons. The figure shows the predicted log-variance trajectories ( l o g σ 2 ) produced by different DL models (LSTM, CNN-LSTM, Transformer, and PatchTST-lite) compared with RV. Across all panels, the models capture the main cyclical patterns and volatility clusters, especially during major market events (e.g., 2020–2022). At the 1-day horizon (Panel (a)), forecasts closely track realized variance with minimal smoothing, while at 5- and 22-day horizons (Panels (b,c)), the predictions become smoother due to horizon-driven aggregation effects. All numerical forecast error values (MAE, RMSE, QLIKE) are reported in Appendix A Table A3, where results are provided for each index, RV estimator, and forecast horizon. Figures produced in Python 3.10 using Matplotlib (version 3.10.7).
Jrfm 18 00685 g010
Figure 11. SHAP-based explainability for volatility forecasting errors. Panels (A,B) show QLIKE: mean(SHAP) ranking and beeswarm summary. Panels (C,D) show DM: mean(SHAP) ranking and beeswarm summary. Across both metrics, R V _ c l o s e emerges as the most influential factor, followed by ARIMA(1,0,1) and HAR-RV, while H has a smaller but steady effect. Index variables make negligible contributions. Warmer tones correspond to higher feature values; positive SHAP values indicate higher forecast error, whereas negative values reflect improvements in accuracy. Figures produced in Python 3.10 using the SHAP library (version 0.49.1) and Matplotlib (version 3.10.7).
Figure 11. SHAP-based explainability for volatility forecasting errors. Panels (A,B) show QLIKE: mean(SHAP) ranking and beeswarm summary. Panels (C,D) show DM: mean(SHAP) ranking and beeswarm summary. Across both metrics, R V _ c l o s e emerges as the most influential factor, followed by ARIMA(1,0,1) and HAR-RV, while H has a smaller but steady effect. Index variables make negligible contributions. Warmer tones correspond to higher feature values; positive SHAP values indicate higher forecast error, whereas negative values reflect improvements in accuracy. Figures produced in Python 3.10 using the SHAP library (version 0.49.1) and Matplotlib (version 3.10.7).
Jrfm 18 00685 g011
Table 1. Summary of Related Studies on Volatility and Time-Series Forecasting Model.
Table 1. Summary of Related Studies on Volatility and Time-Series Forecasting Model.
Authors (Year)ModelsMain Findings
I. Forecasting-Oriented Studies
Ersin and Bildirici (2023)GARCH-MIDAS-LSTMHybrid GARCH–LSTM improves long-horizon forecasts; nonlinear terms enhance accuracy
Asgharian et al. (2013)GARCH-MIDASGARCH outperforms ARIMA; macro factors improve long-term prediction
Virk et al. (2024)GARCH-MIDASMacro variables help long-run forecasts but may induce overfitting
Harikumar and Muthumeenakshi (2025)ARIMA vs. LSTMARIMA may outperform LSTM short-term; trade-off between linearity and adaptability
Brugiere and Turinici (2025) TransformerAttention captures both short- and long-term patterns; superior to classical models
Souto and Moradi (2024)Transformer variants (Informer, Autoformer, PatchTST)Improved scalability and long-horizon accuracy
Z. Zhang et al. (2025) Quantformer (Transformer + sentiment)Integrates sentiment and investor behavior; best predictive accuracy
Zeng et al. (2023)CNN—TransformerCombines local and global patterns; outperforms ARIMA and DeepAR
Chun et al. (2025)ML (RF, LSTM, GARCH, HAR-RV)ML models outperform classical ones; strong volatility-timing ability
II. Economic Volatility and Realized-Volatility Measurement Studies
L. Zhang et al. (2005)Multi-grid RVMulti-grid RV reduces microstructure noise
Patton and Sheppard (2009)Combined RV estimatorsAveraging across estimators improves forecast accuracy
O. Barndorff-Nielsen and Shephard (2004)Power & Bipower variationJump-robust RV estimators
Tauchen and Zhou (2011) Bipower variationDetects volatility jumps; improves credit spread forecasts
Source: Author’s compilation.
Table 2. Data Collection and Preprocessing.
Table 2. Data Collection and Preprocessing.
Preprocessing StepDescription of Procedure
1. Data sourceDaily open-high-low-close (OHLC) prices for the S&P 500, NASDAQ 100, and DJIA were retrieved from Investing.com (2025) (https://www.investing.com, accessed on 5 September 2025) for the period 2000–2025 at a 1-day frequency.
2. Log-Return CalculationLogarithmic returns from daily closing prices used to capture continuous-compounding effects.
3. Realized Variance (RV)Constructed using three estimators: Close-to-Close, Parkinson (range-based), and Yang-Zhang (volatility with overnight adjustment).
4. Log Transformation Applied to RV (log(RV) to stabilize variance, mitigate skewness, and enhance model convergence.
5. Target VariableThe dependent variable is   l o g ( R V t ) for all models and forecast horizons.
6. Forecast HorizonThree horizons: h = 1, 5, 22 days (representing daily, weekly, and monthly volatility forecasts).
7. Sample6450 daily observations per index, yielding ≈ 6330 supervised sequences using a 120-day lookback window.
Source: Author’s compilation.
Table 3. Comparative Overview of Volatility Forecasting Models.
Table 3. Comparative Overview of Volatility Forecasting Models.
Model GroupKey Representative ModelsMain StrengthsMain Limitations
Classical Econometric ModelsARIMA GARCH, HAR-RVSimple, interpretable and theoretically grounded; capture volatility clustering and persistence with limited data.Assume linearity; weak for nonlinear or asymmetric, and regime shifts; limited long-memory adaption.
DLLSTM, CNN-LSTMCapture nonlinear and long-term dependencies (LSTM) and localized patterns (CNN); flexible for multivariate inputs.Require large datasets; risk of overfitting; high computational cost; less interpretable.
TRF ModelsTransformer, PatchTST-liteModel long-range dependencies efficiently; scalable; robust, and state-of-the-art for time-series forecasting. Still emerging in volatility research; need tuning and computational; limited interpretability (black-box).
Source: Author’s compilation.
Table 4. Subsample robustness: best model by regime (QLIKE loss).
Table 4. Subsample robustness: best model by regime (QLIKE loss).
SubsampleBest Classical (QLIKE)Best DL (QLIKE)
Pre-GFC (2000–2006)HAR-RV (low error)LSTM/CNN-LSTM
GFC (2007–2009)GARCH(1,1)Transformer
Post-GFC (2010–2019)HAR-RVTransformer
COVID-19 and post-COVID-19 (2020–2025)GARCH(1,1)Transformer
The subsample rankings in Table 4 are derived directly from the QLIKE losses generated in Python for each model and each historical regime. For every subsample window (Pre-GFC, GFC, Post-GFC, COVID-19), the model with the lowest average out-of-sample QLIKE value was selected as the best-performing specification. Source: Author’s compilation.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Taneva-Angelova, G.; Granchev, D. Deep Learning and Transformer Architectures for Volatility Forecasting: Evidence from U.S. Equity Indices. J. Risk Financial Manag. 2025, 18, 685. https://doi.org/10.3390/jrfm18120685

AMA Style

Taneva-Angelova G, Granchev D. Deep Learning and Transformer Architectures for Volatility Forecasting: Evidence from U.S. Equity Indices. Journal of Risk and Financial Management. 2025; 18(12):685. https://doi.org/10.3390/jrfm18120685

Chicago/Turabian Style

Taneva-Angelova, Gergana, and Dimitar Granchev. 2025. "Deep Learning and Transformer Architectures for Volatility Forecasting: Evidence from U.S. Equity Indices" Journal of Risk and Financial Management 18, no. 12: 685. https://doi.org/10.3390/jrfm18120685

APA Style

Taneva-Angelova, G., & Granchev, D. (2025). Deep Learning and Transformer Architectures for Volatility Forecasting: Evidence from U.S. Equity Indices. Journal of Risk and Financial Management, 18(12), 685. https://doi.org/10.3390/jrfm18120685

Article Metrics

Back to TopTop