Previous Article in Journal
Topology-Informed Financial Network Approach to Portfolio Optimization Using Fuzzy Decision-Making and Genetic Algorithms: Evidence from the Istanbul Stock Exchange
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Public-Data Causal Multiscale Wavelet Spillover Learning for Stock Index Volatility Forecasting and Risk Early Warning

1
Sino-European School of Technology, Shanghai University, Shanghai 200444, China
2
SHU-UTS SILC Business School, Shanghai University, Jiading District, Shanghai 201800, China
*
Author to whom correspondence should be addressed.
Risks 2026, 14(6), 129; https://doi.org/10.3390/risks14060129
Submission received: 23 April 2026 / Revised: 26 May 2026 / Accepted: 28 May 2026 / Published: 4 June 2026

Abstract

Accurate volatility forecasting and timely risk early warning are foundational requirements of financial risk management: Value-at-Risk estimates, portfolio risk limits, derivative hedging ratios, and stress-test scenario calibrations all depend on forward-looking volatility signals that remain reliable when market conditions depart from average. This paper develops a public-data causal multiscale wavelet spillover learning (CMWSL) framework that jointly addresses stock-index volatility forecasting and high-volatility early warning under strict walk-forward evaluation. CMWSL integrates three components: a heterogeneous autoregressive (HAR) persistence block as the dominant linear baseline, causal stationary wavelet transform (SWT) summaries that encode within-index multiscale market dynamics, and a cross-index spillover layer that tests whether medium- and long-scale wavelet energy from peer indices carries incremental risk-relevant information. The empirical analysis covers the S&P 500, Nasdaq-100, and Dow Jones Industrial Average over a 2513-step out-of-sample evaluation period from 2016 to 2025, with forecast horizons h { 1 , 5 , 10 } and OHLC-based volatility targets. All preprocessing, wavelet decomposition, calibration rules, and warning thresholds are re-estimated inside each rolling training window to eliminate look-ahead bias. HAR remains the strongest average model in the main Rogers–Satchell specification, confirming that daily index volatility risk is highly persistence-driven. The multiscale extension delivers statistically significant improvements at longer horizons, in richer public macro-financial information environments, and under the Parkinson target. Clark–West tests detect significant spillover gains in five of nine index–horizon cells (CW = 4.83 , p < 0.001 for S&P 500 at h = 10 ). Critically, tail-conditioned and rolling-window diagnostics show that multiscale and cross-index gains concentrate in upper-volatility regimes and synchronized stress episodes—precisely the conditions in which risk management decisions are most consequential. For market-risk early warning, a logistic classifier built on the same causal feature pipeline delivers the most stable precision–recall performance across all settings, providing an interpretable and operationally auditable alert mechanism suitable for practical risk monitoring.

1. Introduction

Volatility forecasting is a foundational input to financial risk management. Forward-looking volatility estimates are directly embedded in Value-at-Risk calculations, portfolio risk-limit setting, derivative pricing and dynamic hedging, margin and collateral protocols, and stress-test scenario calibration. The operational challenge is that daily stock-index volatility is generated by several concurrent mechanisms: strong short-run persistence, medium-horizon adjustment to macro-financial conditions, and episodic synchronized stress propagation across markets. A practical risk-monitoring system must therefore do more than fit average conditions. It must remain strictly causal to preserve the integrity of real-time risk assessments, beat credible persistence-based benchmarks to justify adoption costs, and continue to generate reliable warning signals precisely when market conditions become most unstable—and when the consequences of a missed or false alert are most severe (Corsi 2008; Song et al. 2022; Zeng et al. 2024).
This requirement immediately creates a tension. On the one hand, parsimonious linear models such as HAR remain surprisingly strong in daily volatility forecasting, especially at the one-day horizon (Branco et al. 2024; Corsi 2008). On the other hand, a purely linear persistence view is too narrow to describe the full structure of market risk. Volatility clustering is not generated at a single frequency, implied-volatility measures and technical indicators interact nonlinearly, and stress episodes often unfold through cross-market transmission rather than through one index in isolation. The practical question is therefore not whether nonlinear models can be made more complex than HAR. It is whether a carefully designed multiscale representation can extract additional predictive information without sacrificing causality, interpretability, or reproducibility.
Three literature streams motivate this question. First, decomposition-based studies show that volatility prediction can benefit from multiscale analysis when signals evolve at heterogeneous frequencies (Ma and Zhang 2025; Souropanis and Vivian 2023; Su et al. 2025; Zhuo and Morimoto 2024). Second, machine-learning studies show that nonlinear learners can exploit richer public information sets, including implied volatility, technical signals, and macro-financial factors (Chun et al. 2025; Díaz et al. 2024; Zhang et al. 2023). Third, spillover studies document that market stress propagates across indices, although that propagation is usually modelled in the time domain rather than as explicit frequency-specific transmission (Son et al. 2023). The literature on volatility warning and extreme-risk identification adds a fourth strand by showing that forecasting and alert generation should be connected more tightly than they often are in practice (Purnell et al. 2024; Ran et al. 2024; Ren et al. 2024).
Despite this progress, four gaps remain. First, many studies benchmark against weak baselines, which makes it difficult to determine whether the proposed method adds value beyond the persistence structure already captured by HAR. Second, wavelet features are usually extracted within one index only; relatively little evidence is available on whether peer-index wavelet components carry incremental predictive content at specific frequencies. Third, the regression task of volatility forecasting and the classification task of risk warning are commonly treated as separate pipelines even though they rely on the same chronology and largely the same public information; this separation can create internally inconsistent risk systems in which the forecast and the alert do not operate on the same information clock. Fourth, reproducibility is often limited by proprietary data, ambiguous release timing for predictors, or full-sample preprocessing that quietly introduces look-ahead bias.
This paper addresses those gaps by proposing a public-data causal multiscale wavelet spillover learning (CMWSL) framework for stock-index volatility forecasting and risk early warning. CMWSL is built around three linked components. The first is a HAR persistence block that preserves the strongest linear benchmark in the literature. The second is a causal stationary wavelet transform (SWT) representation that summarizes within-index short-, medium-, and longer-run dynamics using rolling scale-wise statistics. The third is a spillover layer that appends medium- and long-scale wavelet energy from peer indices in order to test whether cross-index information enters through specific frequency bands rather than only through contemporaneous correlations.
The central research question is deliberately narrow and testable: when do multiscale features and peer-index wavelet spillovers add value beyond persistence in stock-index volatility forecasting? The paper does not seek to prove blanket dominance of a new model class. Instead, it asks whether the incremental value of multiscale learning is conditional on forecast horizon, target construction, information richness, and market state. This framing is more appropriate for risk management than a simple average-score horse race, because the operational value of a model often appears precisely when markets depart from their typical regime.
The article formalizes that question through the following hypothesis:
Cross-Index Wavelet Spillover Hypothesis. Medium- and long-scale wavelet detail components (levels d2 and d3) of peer equity indices contain incremental predictive information for a target index’s future average volatility, beyond the information contained in the target index’s own persistence terms and within-index multiscale representation. This incremental value is state-dependent and horizon-dependent, and can be identified through nested out-of-sample forecast comparison.
To test this hypothesis, the empirical design uses a three-step controlled ablation: HAR as the persistence benchmark, wavelet_lightgbm as the within-index multiscale extension, and spillover_lightgbm as the same model further augmented with peer-index d2/d3 wavelet energy features. The study relies only on public daily data for the S&P 500, Nasdaq-100, and Dow Jones Industrial Average, and evaluates all models under a 2016–2025 walk-forward protocol with strict causal feature construction. In addition to the regression task, the same pipeline is used to generate high-volatility warning labels and probability-based alerts, allowing forecast quality and warning quality to be studied within one coherent framework (Purnell et al. 2024; Ran et al. 2024; Ren et al. 2024).

Related Literature and Research Positioning

The paper sits at the intersection of four literature streams. First, persistence-based volatility models establish that any serious forecasting framework must compete against parsimonious linear benchmarks rather than circumvent them (Branco et al. 2024; Corsi 2008; Song et al. 2022). Second, wavelet and decomposition studies show that scale-specific summaries can improve volatility prediction when the data-generating process contains heterogeneous temporal frequencies (Ma and Zhang 2025; Souropanis and Vivian 2023; Su et al. 2025; Zhuo and Morimoto 2024), but most of this work remains within-market. Third, machine-learning and graph-based studies model cross-market dependence and nonlinear interaction structures (Chun et al. 2025; Díaz et al. 2024; Son et al. 2023; Song et al. 2023), though mainly in the time domain rather than through explicit frequency-specific spillover channels. Fourth, early-warning studies translate continuous volatility and stress signals into actionable risk states (Purnell et al. 2024; Ran et al. 2024; Ren et al. 2024), but often as a downstream add-on rather than as part of the main forecasting design. Recent review evidence suggests that multi-frequency and cross-market synthesis remains a live research gap (Leushuis and Petkov 2026). Table 1 positions the present study relative to these strands.
Relative to this literature, the contribution of the paper is fourfold.
  • It proposes an integrated causal pipeline for market risk management that links volatility forecasting and high-volatility early warning within a single public-data, walk-forward framework—eliminating the information gap that typically separates risk quantification from risk alert generation.
  • It formalizes cross-index wavelet spillover as a nested incremental-information problem, enabling a traceable assessment of which frequency-specific cross-market channels contribute to portfolio risk exposure beyond local persistence.
  • It evaluates model performance through risk-management-relevant diagnostics—tail-conditioned, market-state-conditioned, rolling-window, and stress-event-based decompositions—that reveal exactly when multiscale representation delivers actionable improvements over linear benchmarks.
  • It delivers the empirical study as reproducible risk-forecasting research built entirely on public market and macro-financial data, with deterministic scripts and reviewer-ready experiment artifacts consistent with a transparent, auditable risk-model evaluation standard.
These contributions imply a deliberately disciplined empirical claim. The article does not argue that nonlinear multiscale learning should replace HAR everywhere. Instead, it argues that persistence remains the correct baseline at short horizons, while causal wavelet representation and peer-index spillover provide useful incremental information when the forecasting problem becomes less myopic, the information environment becomes richer, and market stress becomes more synchronized. That is precisely the regime in which practical risk monitoring is most demanding—and the setting to which this paper’s contributions are directly addressed.
The remainder of the paper is organized as follows. Section 2 describes the market sample, target construction, and predictor blocks. Section 3 presents the CMWSL framework. Section 4 details the walk-forward evaluation protocol and ablation design. Section 5 reports the empirical evidence. Section 6 interprets the findings, discusses risk management implications, and notes limitations; Section 7 concludes.

2. Data

2.1. Market Sample and Public Data Design

The empirical design is built around one principle: every variable used in the forecasting and warning system must be publicly accessible, chronologically alignable, and reproducible without proprietary terminals or vendor-specific feeds. This requirement is not cosmetic. In volatility forecasting, reported gains can be materially affected by hidden differences in data timing, revision handling, and sample construction. A public-data design therefore sharpens both the scientific and practical value of the study.
The market sample contains the S&P 500, Nasdaq-100, and Dow Jones Industrial Average. These three indices provide a compact but informative cross-section for the problem studied here. The S&P 500 serves as the broad U.S. market benchmark; the Nasdaq-100 represents a technology- and growth-concentrated segment with distinct volatility dynamics; and the Dow Jones Industrial Average provides a price-weighted blue-chip benchmark. Together they form a natural setting for testing both within-index multiscale dynamics and cross-index spillover. Daily OHLC prices are collected from Stooq. Public macro-financial predictors, implied-volatility proxies, rates, term spreads, financial conditions, and recession indicators are collected from FRED. ETF-based trading-volume proxies are drawn from SPY, QQQ, and DIA. The raw-data snapshot used by the experimental configuration is fixed to 17 March 2026 so that the reported results correspond to a stable and reproducible input state rather than a moving download target.
Table 2 summarizes the public data blocks used in the forecasting and warning pipeline.
The primary sample spans 25 February 2005 to 31 December 2025. This starting point balances three requirements: long enough history for rolling estimation and multiscale decomposition, comparable availability across the three indices, and consistent access to ETF-based liquidity proxies. The out-of-sample forecasting period begins on 4 January 2016, leaving a sufficiently long pre-2016 history for training and feature construction. All predictors are aligned to index trading days, and non-trading days are excluded from the panel. Daily and weekly macro-financial series are forward-filled only after they become historically observable. Monthly variables with ambiguous release timing are excluded from the main benchmark specification to avoid avoidable timing disputes in a causal design.

2.2. Volatility Target Construction

The target of interest is the future average stock-index volatility constructed from daily OHLC data. The main specification uses the Rogers–Satchell estimator because it retains range information while remaining more robust than close-to-close volatility under nonzero drift (Rogers and Satchell 1991). For each forecast horizon h { 1 , 5 , 10 } , the target is defined as the average volatility proxy over the next h trading days. This choice aligns the regression task with forward-looking risk monitoring rather than same-day measurement.
The target is modeled on the log ( 1 + y ) scale with a small positive constant for numerical stability and then transformed back to the original scale for evaluation. This preserves positivity while limiting the leverage of exceptionally large observations. The main target is complemented by a robustness specification based on the Parkinson estimator (Parkinson 1980). Using both Rogers–Satchell and Parkinson is substantively useful. If the proposed framework improves only under one narrowly chosen volatility proxy, its contribution would be weak. If its relative advantage persists when the target emphasizes high–low range information differently, that is stronger evidence that the gain comes from representation quality rather than from target-specific tuning. The classical OHLC estimators of Parkinson, Garman–Klass, Rogers–Satchell, and Yang–Zhang provide the statistical background for this design (Andersen et al. 2003; Garman and Klass 1980; Parkinson 1980; Rogers and Satchell 1991; Yang and Zhang 2000).

2.3. Predictor Architecture

The predictor set is organized into four blocks so that the empirical design can separate persistence, raw market information, public exogenous risk information, and multiscale representation.
The first block contains persistence features. These include the current volatility proxy and HAR-style aggregates over the last 1, 5, and 22 trading days. This block is essential because short-run volatility forecasting is dominated by persistence, and any richer model must prove that it adds value beyond this structure rather than simply re-encoding it.
The second block contains raw market-transition and technical signals. These include close-to-close returns, intraday returns, overnight gaps, RSI, MACD, ATR, and short-window stress summaries computed over 2-, 3-, and 5-day windows. This block is designed to capture local market transitions that may not be fully represented by the volatility target alone.
The third block contains public exogenous risk variables. These include implied-volatility series, interest-rate and spread variables, financial-conditions indicators, and a recession indicator. In the expanded-data robustness setting, this block is enlarged with credit-spread, policy-uncertainty, and volatility-term-structure variables to test whether the proposed framework benefits from a richer risk-information environment rather than only from a carefully limited predictor set.
The fourth block contains causal multiscale wavelet summaries. These are extracted from the target volatility series, absolute returns, and the index-specific implied-volatility proxy. This block is where the CMWSL framework departs most clearly from standard tabular forecasting. Instead of treating raw lags as the only temporal representation, it summarizes the recent scale-wise behaviour of volatility-related signals through causal wavelet coefficients and energy statistics. The modular block design is deliberate: it allows the empirical analysis to distinguish baseline persistence, conventional public information, and the incremental value of multiscale representation.
Two additional design choices deserve emphasis. First, the main specification uses only variables whose timing can be handled cleanly in a daily walk-forward setting; this reduces the scope for hidden look-ahead effects. Second, the robustness extensions are intentionally challenging: a broader public-information panel and an alternative OHLC target both raise the standard for claiming that multiscale learning provides genuine incremental value.

3. Methodology

This section presents the proposed causal multiscale wavelet spillover learning (CMWSL) framework. CMWSL is designed around a simple but demanding principle: any improvement over a strong volatility benchmark should be attributable to a clearly defined source of information rather than to opaque modelling complexity. For that reason, the framework is organized as a layered system. A persistence layer preserves the HAR benchmark structure. A multiscale layer extracts causal wavelet summaries from volatility-related input series. A spillover layer adds peer-index frequency-domain information. Finally, a warning layer translates the same causal feature space into high-volatility event probabilities. The full system is therefore designed for attribution as much as for prediction. To keep the exposition readable for finance-oriented readers, the main text emphasizes the economic role of each layer, while implementation-specific detail is summarized after the notation table and in Appendix A.

3.1. Problem Setup

CMWSL addresses two linked tasks: stock-index volatility forecasting as a regression problem and risk early warning as a classification problem. Let y i , t denote the observed daily volatility proxy for index i on day t. For a forecast horizon h, the forward-looking target is defined as
y i , t ( h ) = 1 h j = 1 h y i , t + j ,
so that the model predicts future average volatility rather than an in-sample quantity. This choice makes the regression task directly relevant to market-risk monitoring, where the object of interest is the volatility expected over the next few trading days rather than the volatility realized today.
Table 3 summarizes the main notation used in the paper.
For estimation, the target is transformed as
y ¯ i , t ( h ) = log 1 + y i , t ( h ) + ε ,
where ε > 0 is a small numerical constant. The back-transformed forecast is
y ^ i , t ( h ) = exp y ¯ ^ i , t ( h ) 1 ε .
The transformation preserves positivity and reduces the influence of extreme observations, which is useful when tree-based models are optimized under rolling samples containing both tranquil and crisis periods.
The main empirical target is the Rogers–Satchell volatility proxy,
y i , t R S = log H i , t C i , t log H i , t O i , t + log L i , t C i , t log L i , t O i , t ,
where O i , t , H i , t , L i , t , and C i , t denote open, high, low, and close prices, respectively. The robustness target is the Parkinson estimator,
y i , t P = 1 4 log 2 log H i , t L i , t 2 .
Both targets exploit OHLC information more efficiently than pure close-to-close variance, while remaining compatible with a positive-target forecasting framework.

3.2. Within-Index Causal Multiscale Representation

The key modelling premise is that stock-index volatility is generated by heterogeneous components evolving at different temporal scales. A single lag structure is therefore unlikely to capture the whole signal. CMWSL addresses this by applying a causal non-decimated stationary wavelet transform (SWT) to selected input series (Percival and Walden 2000). The SWT (also known as the à trous or undecimated wavelet transform) yields translation-invariant coefficient sequences { d t ( j ) } and { a t ( J ) } at each scale j. Crucially, the causal implementation adopted here computes all coefficients using only observations { s τ : τ t } , achieved by applying one-sided filter banks so that no future information enters any coefficient. The non-decimated structure preserves the original sampling rate at every scale, allowing direct alignment between wavelet features and future volatility targets without interpolation or re-indexing—a necessary property for a risk model that must be operationally deployable in real time.
The implementation uses a Symlet-4 basis with J = 3 decomposition levels. Wavelet features are computed only after a minimum history of 128 observations, using at most the most recent N = 256 observations at each forecast origin. Symlet-4 is chosen because it offers near-symmetry, limited phase distortion, and a support length short enough for causal rolling application while retaining enough vanishing moments to capture smooth local structure (Percival and Walden 2000; Souropanis and Vivian 2023). Three decomposition levels are used because they correspond approximately to 2-, 4-, and 8-day scales, which map naturally onto the daily, weekly, and biweekly persistence structure underlying HAR-type volatility dynamics (Andersen et al. 2003; Corsi 2008).
For a univariate input series s t , the multi-resolution representation is
s t = a t ( J ) + j = 1 J d t ( j ) ,
where a t ( J ) is the smooth approximation at level J and d t ( j ) is the detail component at scale j. CMWSL does not feed raw wavelet coefficients directly into the learner. Instead, it constructs a compact summary representation from the recent history of each coefficient sequence. For scale j and rolling window w, the main summary statistics are
μ t , w ( j ) = 1 w = 0 w 1 d t ( j ) , σ t , w ( j ) = 1 w = 0 w 1 d t ( j ) μ t , w ( j ) 2 ,
and
E t , w ( j ) = 1 w = 0 w 1 d t ( j ) 2 .
These statistics summarize local level, dispersion, and energy at each scale. In practice, this is preferable to using raw coefficients because it produces a stable, low-dimensional, and interpretable feature block suitable for rolling estimation.
Let the full predictor vector for index i at time t be
x i , t = x i , t H A R , x i , t t e c h , x i , t m a c r o , x i , t w a v ,
where x i , t H A R contains persistence terms, x i , t t e c h contains return-transition and technical-indicator features, x i , t m a c r o contains public exogenous risk variables, and x i , t w a v contains the SWT summary block. This flat, interpretable feature vector is the core within-index feature space used by wavelet_lightgbm.

3.3. Cross-Index Wavelet Spillover Features

The within-index wavelet representation is informative, but it treats each index as if its volatility dynamics were self-contained. That assumption is often too strong in market-risk applications, especially during system-wide stress. CMWSL therefore extends the within-index representation with a cross-index spillover layer built from peer-market wavelet energy.
Let I be the set of indices under study, I = { DJIA , NDQ , SPX } in the main experiment. For each source index k i , let d k , t ( j ) denote the scale-j wavelet detail coefficient extracted from source k’s time series. The spillover feature passed to the model for target index i is defined as
ϕ k i , t , w ( j ) = E k , t , w ( j ) = 1 w = 0 w 1 d k , t ( j ) 2 ,
where j { 2 , 3 } and w belongs to the same set of rolling windows used for within-index wavelet summaries. The focus on levels j = 2 and j = 3 is deliberate. These correspond roughly to weekly and biweekly-to-monthly fluctuations, the scales at which slower cross-market transmission is more plausible than at the highest-frequency noise level (Son et al. 2023; Souropanis and Vivian 2023).
The spillover block is constructed from three source series for each peer index: the realized volatility proxy y k , t R S , the absolute close-to-close return | r k , t c c | , and the index-specific implied-volatility proxy. After strict causal alignment, meaning that only information observed by time t is used to forecast y i , t ( h ) , the empirical implementation appends 126 distinct spillover columns to the panel. An active-column filter is applied in each rolling training window, retaining only columns with sufficient non-missing coverage. This avoids self-reference artifacts and prevents the effective predictor dimension from being inflated by structurally empty variables.
The extended predictor vector for the spillover model is therefore
x i , t s p i l l = x i , t H A R , x i , t t e c h , x i , t m a c r o , x i , t w a v , ϕ k i , t ,
where ϕ k i , t collects all active peer-index spillover features. Here and throughout, bold symbols denote vectors. The critical identification feature of the empirical design is that spillover_lightgbm is otherwise identical to wavelet_lightgbm; any difference in out-of-sample performance is therefore attributable to the added spillover block rather than to a different learner or different hyperparameters.

3.4. Forecasting Layer

The forecasting layer combines a strong persistence benchmark with nonlinear interaction learning. The HAR block is defined as
x i , t H A R = y i , t , 1 5 = 0 4 y i , t , 1 22 = 0 21 y i , t ,
which preserves the benchmark view that volatility persistence is heterogeneous across daily, weekly, and monthly horizons (Corsi 2008). Short-run transition features are defined from OHLC prices as
r i , t c c = log C i , t log C i , t 1 , r i , t o c = log C i , t log O i , t ,
g i , t c o = log O i , t log C i , t 1 .
These terms capture close-to-close movement, intraday dynamics, and overnight gaps, all of which are relevant for daily index-volatility formation.
At the forecasting stage, a horizon-specific LightGBM model is fitted on the log-transformed training target over the rolling window and used to generate a one-step-ahead out-of-sample forecast. The spillover variant uses the same procedure with the spillover-augmented feature vector.
LightGBM (Ke et al. 2017) is used as the main nonlinear learner for three reasons. First, the rolling protocol requires hundreds of repeated refits, so computationally light models are preferred to heavyweight sequence models. Second, gradient-boosted trees handle mixed feature blocks and nonlinear interactions without requiring feature rescaling assumptions. Third, split-based feature attribution and SHAP analysis remain straightforward, which is important because the contribution of the paper depends on identifying what kind of information drives forecast gains rather than only on reporting a rank order of models.
Because volatility targets are strictly nonnegative and QLIKE is sensitive to unrealistically small predictions, the framework applies a horizon-specific floor calibration:
y ˜ i , t ( h ) = max y ^ i , t ( h ) , q α , h train ,
where q α , h train is a low empirical quantile estimated from the current training window. In the implementation, the floor quantile is 0.05 for h = 1 and 0.01 for h = 5 and h = 10 . This calibration improves numerical stability without using any future information.
Figure 1 summarizes the complete data-to-feature-to-prediction pipeline. An expanded module-level diagram is provided in the appendix for readers who want a more granular view of the persistence, raw-risk, and multiscale branches.

3.5. Warning Layer

The risk-warning task converts future volatility realizations into an operational high-risk label. For each rolling training window, the warning threshold τ i , t ( h ) is defined as the 90th percentile of in-window future volatility targets. The binary event label is
z i , t ( h ) = I y i , t ( h ) > τ i , t ( h ) .
This design makes the event definition adaptive to level differences across indices and to changing market regimes.
The main warning specification uses a logistic model estimated on current market- risk features:
x i , t w a r n = x i , t r a w , s i , t s t r e s s ,
where s i , t s t r e s s collects short-window stress summaries derived from the public predictor set. The probability of a future high-volatility event is modeled as
Pr z i , t ( h ) = 1 x i , t w a r n = 1 1 + exp β 0 + β x i , t w a r n .
The warning layer is estimated on the same rolling training window and uses the same causally aligned public information set as the forecasting layer, ensuring that the two tasks remain operationally consistent rather than being calibrated on different information regimes. The final warning decision is then
z ^ i , t ( h ) = I p ^ i , t ( h ) c i , h ,
with the threshold chosen by
c i , h = arg max c [ 0 , 1 ] F β y t r a i n , I ( p ^ t r a i n c ) , β = 2 .
Using β = 2 reflects the asymmetry of practical warning problems, where missed high-volatility events are usually more costly than additional false alarms.
Finally, the main regression loss used throughout the paper is QLIKE,
L Q L I K E ( y , y ^ ) = log ( y ^ ) + y y ^ ,
which is standard for volatility forecasting because it rewards both accuracy and calibration on the positive target scale (Patton 2011). This is particularly appropriate here because the empirical objective is not merely to reduce squared error, but to produce stable positive volatility forecasts that remain usable inside a market-risk monitoring system.
Table 4 summarizes the leakage-free rolling implementation of CMWSL.

4. Experimental Design

The experimental design is structured to answer three empirical questions. First, how difficult is the forecasting problem once HAR is treated as a serious baseline rather than as a placeholder comparator? Second, does the proposed multiscale representation add value uniformly, or only under specific horizons, targets, and market states? Third, does the spillover layer contribute incremental predictive information beyond within-index wavelet features? Every component of the evaluation protocol is chosen to answer these questions under a strict no-leakage standard.

4.1. Walk-Forward Protocol

All experiments follow a rolling walk-forward protocol. Each model is estimated on a fixed window of 1260 trading days (approximately five calendar years). After generating the out-of-sample forecast, the evaluation origin advances by one trading day. The evaluation period runs from 4 January 2016 to 31 December 2025, yielding 2513 out-of-sample steps in the main experiment. To reduce unnecessary computational noise while preserving chronology, model refitting is scheduled every five trading days within the rolling process.
Figure 2 illustrates the rolling evaluation structure. The causality requirement applies to the full pipeline, not only to the final model fit. Feature standardization, clipping rules, wavelet decomposition, threshold estimation, calibration floors, and warning thresholds are all computed strictly within the current training window. No full-sample preprocessing is used at any stage. This point is crucial because modest-looking violations of chronology, especially in feature construction or normalization, can materially exaggerate out-of-sample performance in financial machine-learning studies.

4.2. Forecast Horizons, Benchmarks, and Evaluation Metrics

The main forecast horizons are h = 1 , h = 5 , and h = 10 , corresponding to one trading day, one trading week, and approximately two trading weeks. These horizons separate the short-run persistence problem from the medium-horizon risk-monitoring problem. If a proposed method cannot compete at h = 1 , that is informative. If it becomes more competitive at h = 5 and h = 10 , that suggests it is capturing slower-moving structure that HAR alone cannot fully absorb.
The benchmark set is intentionally demanding. It includes naive historical-volatility baselines, HAR, raw LightGBM, and wavelet-enhanced LightGBM, as well as the spillover-augmented extension. Additional machine-learning and deep-learning baselines are included so that any performance gain can be attributed to multiscale feature design rather than to model capacity alone. The deep-learning benchmark set is intentionally conservative rather than exhaustive: its role is to test whether recurrent nonlinearity over standard volatility inputs overturns the feature-engineering hypothesis under a strict rolling-refit budget, not to claim a full sweep over every modern sequential architecture such as Transformer or Temporal Fusion Transformer variants.
The primary regression metric is QLIKE because it is widely used in volatility forecasting and is sensitive to both prediction error and scale miscalibration for positive targets (Patton 2011). RMSE, MAE, and out-of-sample R 2 are reported as supporting metrics. The out-of-sample coefficient of determination is defined as
R O O S 2 = 1 t ( y t y ^ t ) 2 t ( y t y ^ t b e n c h ) 2 .
For the warning task, PR-AUC is the primary metric because the high-volatility class is intentionally rare under the rolling 90th-percentile rule. ROC-AUC, Brier score, precision, and recall are reported as complementary measures. The Brier score is
Brier = 1 T t = 1 T ( p t z t ) 2 ,
where p t is the predicted warning probability and z t is the realized event label. The threshold-dependent warning quantities are
Precision ( c ) = T P ( c ) T P ( c ) + F P ( c ) , Recall ( c ) = T P ( c ) T P ( c ) + F N ( c ) ,
and
F β ( c ) = ( 1 + β 2 ) Precision ( c ) Recall ( c ) β 2 Precision ( c ) + Recall ( c ) .
The operating threshold for the logistic warning model is chosen by maximizing F 2 , reflecting the fact that missed high-volatility episodes are typically more costly than additional false alarms in applied risk surveillance (Saito and Rehmsmeier 2015).

4.3. Three-Step Ablation Design

The core methodological claim of the paper is evaluated through a three-step ablation ladder:
  • Step 1: HAR → Wavelet-LGB
    This comparison tests whether within-index multiscale representation adds predictive value beyond a strong persistence benchmark. It asks whether scale-specific information from the target index’s own history contains a useful signal beyond the standard 1-, 5-, and 22-day persistence aggregates.
  • Step 2: Wavelet-LGB → Spillover-LGB
    This is the central test of the Cross-Index Wavelet Spillover Hypothesis. The only difference between the two models is the addition of the spillover block ϕ k i , t . Any improvement observed here is therefore attributable specifically to peer-index frequency-domain information.
  • Full gap: HAR → Spillover-LGB
    This comparison summarizes the cumulative effect of moving from a pure persistence benchmark to the full CMWSL representation.
The value of this ladder is that it prevents the paper from making vague claims about hybrid modelling. Each modelling increment has an identifiable information interpretation: persistence, within-index multiscale representation, and cross-index spillover.

4.4. Statistical Inference

Each index–horizon cell is evaluated with explicit out-of-sample forecast comparison tests. The Diebold–Mariano (DM) test is used for non-nested comparisons (Diebold and Mariano 1995). The Clark–West (CW) test is used as the primary inferential tool when the larger model nests the smaller one (Clark and West 2007). In the present design, Step 1 is treated with DM because the comparison is conceptually between different representation families, while Step 2 is explicitly nested and therefore primarily evaluated with CW. Reporting both tests increases transparency, but CW is the correct test for the spillover claim because it adjusts for the parameter-estimation noise introduced by the extra regressors.
The DM test is implemented as a two-tailed Newey–West HAC procedure with lag T 1 / 3 . The CW test is implemented one-tailed with the larger model as model b, matching the directional hypothesis that spillover information should improve, rather than simply alter, predictive performance. The nine index–horizon cells are reported separately. No multiple-testing correction is imposed because the spillover analysis is not presented as an exploratory search over arbitrary cells; it is a pre-specified directional test with full cell-level disclosure in the Section 5.

4.5. Robustness Strategy

Two robustness designs are used to test whether the main findings survive beyond one carefully selected benchmark environment. The first augments the predictor set with a broader block of public risk indicators—credit spreads, policy uncertainty variables, and term-structure signals—asking whether multiscale representation becomes more useful in a richer, stress-aligned environment. The second substitutes the Parkinson estimator for the Rogers–Satchell target, asking whether model rankings change when the volatility proxy draws more heavily on OHLC range information.
Together, these designs provide a stronger standard than a single-sample comparison. A method that performs well only under one volatility proxy and one feature set is difficult to generalize. A method whose relative advantage becomes visible precisely when the information environment is richer, or when the target contains stronger OHLC range information, offers a more credible contribution to risk forecasting.

5. Results

The results are organized around the central empirical question of the paper: under what conditions does CMWSL add predictive value beyond strong persistence benchmarks? Read in that way, the evidence is coherent across all experiments. Three patterns recur. First, one-day index volatility is strongly persistence-driven, and HAR remains the benchmark that must be taken seriously. Second, the value of multiscale representation rises as the task becomes less myopic, the information set becomes richer, or the volatility target contains more informative OHLC structure. Third, cross-index spillover gains are not uniform; they are concentrated in broad-market stress states and in the upper part of the volatility distribution. The section therefore starts with conditional diagnostics, then turns to average performance, robustness designs, risk-warning evidence, the spillover ablation, and interpretability checks.

5.1. Conditional Evidence Before Average Rankings

Figure 3 shows why the paper should not be read as a simple mean-rank comparison. Figure 3a indicates that the multiscale model does not improve monotonically over HAR as realized volatility rises. Instead, its advantage is strongest in the calmest deciles and again in the most turbulent decile, while HAR remains strongest in the middle range, where volatility is persistent but not yet severely stressed. This U-shaped pattern is economically interpretable. When the market is in a routine regime, short-horizon persistence remains hard to beat. When the market becomes extremely calm or clearly stressed, however, richer scale-specific information becomes more useful. Figure 3b reveals a parallel result for spillover. The gain from adding peer-index wavelet features is concentrated in the upper volatility deciles, especially at h = 1 and h = 5 , which is consistent with a state-dependent transmission mechanism rather than a uniform cross-market effect.
Figure 4 shows that the same pattern persists in calendar time. In the pooled 126-day rolling comparison, Wavelet-LGB stays below HAR in only 8.1 % of the h = 1 windows, but this share rises to 50.7 % at h = 5 and 59.0 % at h = 10 . The horizon effect is therefore not an artefact of full-sample averaging; it becomes more stable as the target moves away from the one-day persistence problem. Figure 4b adds the spillover perspective: Spillover-LGB stays below the within-index wavelet model in 66.8 % , 55.7 % , and 50.0 % of the broad-market rolling windows at h = 1 , h = 5 , and h = 10 , respectively, with the clearest advantages clustered around the 2020 and 2022 stress windows. These diagnostics justify the central framing of the paper: the value of CMWSL is conditional, and those conditions align with economically meaningful market states.

5.2. Main Forecasting Results: Strong Benchmark, Narrow Margins

Table 5 and Figure 5 and Figure 6 establish the baseline difficulty of the forecasting problem. In the main Rogers–Satchell specification, HAR remains the best average model on QLIKE, with an average value of 8.789 , compared with 8.724 for HV-22, 8.681 for raw LightGBM, and 8.649 for wavelet_lightgbm. This is a substantive result rather than a negative one. It confirms that daily stock-index volatility remains dominated by persistence, and that any paper claiming strong performance gains should be judged against that fact.
At the cell level, however, the picture is more nuanced. HAR wins five of the nine index–horizon contests, while LightGBM and wavelet_lightgbm each win two. The multiscale model is strongest for DJIA at h = 10 and Nasdaq-100 at h = 10 , while raw LightGBM wins Nasdaq-100 at h = 5 and S&P 500 at h = 10 . Figure 6 is especially informative because it shows that several medium- and longer-horizon contests are narrow even when HAR remains the formal winner. The main sample should therefore be interpreted as showing two things simultaneously: short-horizon persistence is extremely hard to beat, but multiscale representation becomes meaningfully competitive once the horizon lengthens.
The scale of the unconditional gains should also be interpreted carefully. In the main Rogers–Satchell setting, the full-sample average QLIKE difference between HAR and wavelet_lightgbm is modest, so the paper does not claim large economic gains from unconditional model replacement. Instead, the practical value of CMWSL lies in its conditional performance: it becomes relatively more useful in the longer-horizon and higher-stress settings where market-risk decisions depend less on next-day persistence and more on the build-up, transmission, and persistence of stress.

5.3. Richer Public Information Strengthens the Multiscale Design

The expanded public-information setting provides stronger support for CMWSL. After adding public credit-spread, policy-uncertainty, and volatility-term-structure variables, the average QLIKE of wavelet_lightgbm improves to 9.570 , much closer to HV-22 ( 9.618 ) and HAR ( 9.595 ) . In this richer environment, the multiscale model wins four of the nine index–horizon cells, including DJIA at h = 10 , Nasdaq-100 at h = 10 , and S&P 500 at both h = 5 and h = 10 . This shift is central to the paper’s interpretation. It suggests that the wavelet layer is not merely fitting noise in the main sample; it becomes more useful precisely when the public information set contains broader market-risk content that benefits from multiscale organization.
Table 6 summarizes the average rankings across the main and robustness settings. Figure 7 and Figure 8 reinforce this interpretation. The comparison against HAR improves not only on average, but also in the cell-by-cell pattern. Appendix diagnostic figures show the same effect in time-series traces: isolated spikes still often favor HAR, but the wavelet-enhanced model tracks stress build-up and normalization more smoothly at h = 5 and h = 10 . Additional appendix diagnostics further indicate that the strongest associations with future volatility lie in longer-scale detail components and that macro-financial information becomes the most useful feature block once the public-information panel is expanded.

5.4. Alternative Target and Risk-Warning Evidence

The Parkinson robustness experiment sharpens the same message. Under this alternative OHLC target, wavelet_lightgbm achieves the best average QLIKE at 9.314 , slightly outperforming LightGBM ( 9.309 ) and HAR ( 9.302 ) . The raw boosting model still wins five of the nine cells, so the result should not be overstated. But the average lead of the wavelet model is important because it indicates that CMWSL is more useful when the target itself contains richer range information than in the baseline Rogers–Satchell setting.
Table 7 reports the main risk-warning results under the Rogers–Satchell target.
For the risk early-warning task, the clearest evidence comes from the interpretable logistic classifier estimated on raw market-risk features. In the main specification, logistic_raw reaches an average PR-AUC of 0.528 , compared with 0.478 for the naive threshold rule, and also achieves a higher average ROC-AUC. In the expanded public-information setting, its PR-AUC rises further to 0.593 , again the strongest result among the warning models. Under the Parkinson target, logistic_raw remains the top model on PR-AUC at 0.559 . Figure 9 makes this ranking visible in precision–recall space, while Figure 10 shows that elevated warning scores cluster around realized high-volatility episodes rather than appearing uniformly through time.
Table 8 shows that warning quality should also be interpreted as a timing problem. Using a five-day pre-event detection window, logistic_raw detects a larger share of events than the threshold rule at h = 1 and h = 10 and also provides longer median lead times. The price is a higher alert burden, measured by false-alarm days per event. At h = 5 , the threshold rule remains more economical and slightly stronger in terms of hit rate. The practical implication is that the logistic warning model is more aggressive and earlier, whereas the threshold rule is sparser and more conservative. An appendix diagnostic figure visualizes this operating frontier at the index–horizon level.
These timing results make the economic significance of the warning layer more concrete. At h = 1 , the hit rate increases from 0.613 to 0.821 and the median lead time extends from 4.0 to 5.0 days when moving from the threshold rule to logistic_raw. At h = 10 , the hit rate increases from 0.556 to 0.602 and the median lead again reaches 5.0 days. These gains are not free because false alarms increase materially, but they are economically meaningful in settings where missing a high-volatility episode is more costly than responding to an additional alert.
Additional DM and CW benchmark-comparison tests are reported in the appendix. The overall message remains disciplined: there is no blanket nonlinear dominance, but there is clear evidence that incremental predictive content appears when the horizon lengthens, the information set broadens, or the market enters more stressed states.

5.5. Cross-Index Spillover Ablation Results

Table 9 reports the full three-step ablation evidence over the 2016–2025 out-of-sample evaluation. This is the most direct test of the paper’s novelty claim because it asks whether peer-index frequency-domain information adds value beyond within-index multiscale representation. An appendix figure reports the corresponding QLIKE model rankings as grouped bars.
Figure 11 summarizes the significance pattern for the ablation tests, and Figure 12 reports the corresponding gain ladder by index and horizon.

5.5.1. Step 1: HAR Versus Within-Index Wavelet Representation

The first ablation step shows that within-index wavelet features add value selectively rather than uniformly. For DJIA, HAR significantly outperforms Wavelet-LGB at h = 1 (DM t = 3.67 , p < 0.001 ) and h = 5 ( p = 0.019 ), confirming that short-run persistence dominates for this broad, price-weighted index. The relative advantage narrows at h = 10 , where Wavelet-LGB posts a slightly better QLIKE, though not significantly so. For Nasdaq-100, HAR dominates strongly at h = 1 (DM t = 4.88 , p < 0.001 ), but the relation reverses at h = 10 ( p = 0.001 ), indicating richer long-horizon multiscale structure in the technology-heavy index. For S&P 500, HAR wins significantly at h = 1 , while the medium- and long-horizon cells are statistically inconclusive. Step 1 therefore supports the first part of the paper’s argument: wavelet representation is most useful when the problem moves away from the one-day persistence regime.

5.5.2. Step 2: Within-Index Wavelet Versus Spillover-Augmented Wavelet

The second ablation step is the core test of the Cross-Index Wavelet Spillover Hypothesis. The two-sided DM test does not reach p < 0.05 in any of the nine cells, which is not surprising in a nested setting where the larger model pays an estimation-noise penalty. The CW test, which is specifically designed for nested comparisons, paints a more informative picture.
Five of the nine cells are significant at p < 0.05 under CW. DJIA shows significant gains at h = 1 (CW t = 1.75 , p = 0.040 ) and h = 5 (CW t = 2.61 , p = 0.005 ). S&P 500 shows significant gains at h = 1 ( p = 0.019 ), h = 5 ( p = 0.026 ), and most strongly at h = 10 (CW t = 4.83 , p < 0.001 ). The S&P 500 h = 10 cell is the clearest positive result in the article: peer-index d2/d3 wavelet energy provides information about future broad-market volatility that is not already contained in the target index’s own persistence and multiscale history.
Nasdaq-100 behaves differently. None of its three cells is significant under CW, and the h = 5 cell even shows a negative CW statistic. This absence of spillover response is informative in its own right. It suggests that the volatility dynamics of a technology-concentrated index occupy a different frequency-domain regime from those of the broad-market indices used as peers. The same weekly-to-biweekly peer-market channels that help forecast DJIA and S&P 500 do not transfer cleanly to Nasdaq-100.
The regime-conditioned evidence in the appendix sharpens this interpretation. For the pooled DJIA and S&P 500 sample, spillover gains are strongest in recession and high-VIX windows at h = 1 and h = 5 . Combined with Figure 3 and Figure 4, the result supports a clear interpretation: peer-index wavelet information is most useful when broad-market stress is elevated, synchronized, and still unfolding.

5.5.3. Why DM and CW Differ

The divergence between DM and CW is not contradictory. DM tests realized loss differences directly and does not correct for the attenuation induced by estimating additional regressors in the larger model. CW adds the usual nested-model adjustment term ( y ^ 1 , t y ^ 2 , t ) 2 , which raises power when the larger model contains real but noisy incremental information (Clark and West 2007). In the present setting, the pattern of positive QLIKE gains combined with CW significance indicates that the spillover features contain useful predictive signal, even though that signal is modest and partially offset by estimation noise in finite rolling samples.

5.5.4. Takeaway from the Ablation

The spillover ladder establishes three regularities. HAR remains the dominant short-horizon benchmark. Within-index wavelet representation adds value selectively, mainly at longer horizons. Cross-index d2/d3 spillover features provide statistically significant incremental information for DJIA and S&P 500 at multiple horizons, with the S&P 500 h = 10 cell providing the strongest evidence. This is the most novel result of the paper because it isolates a specific type of cross-market information that matters precisely where time-domain persistence becomes less sufficient.

5.6. Interpretability and Deep-Learning Benchmarking

To examine whether the gains of CMWSL are driven by economically meaningful inputs rather than by opaque tree interactions, we compute SHAP values (Lundberg and Lee 2017) for a full-window wavelet_lightgbm model fitted at the start of the test period for the S&P 500 index at both h = 1 and h = 5 .
Figure 13 documents the horizon-dependent shift in feature importance. At h = 1 , HAR persistence lags and short-run technical signals such as MACD and ATR dominate the ranking, consistent with a regime where next-day volatility is primarily determined by the recent level and trajectory of the series. At h = 5 , wavelet energy variables at the d2 and d3 scales of the VIX-related and Rogers–Satchell input series move to the top of the ranking, while the relative importance of raw persistence lags declines. This transition mirrors the pattern documented in Figure 3 and Figure 4, and directly supports the paper’s central interpretation that medium-scale frequency-domain structure becomes decision-relevant precisely as the forecast horizon extends into the range most useful for portfolio risk monitoring.
Figure 14 shows the distribution of SHAP contributions for the S&P 500 at h = 5 .

5.7. LSTM Baseline Comparison

To test whether the performance of wavelet_lightgbm reflects feature quality rather than only model flexibility, we include a compact LSTM baseline evaluated under the same 1260-day rolling protocol. The LSTM is trained on the three HAR inputs, uses 16 hidden units with chronological early stopping, and targets the same log1p-transformed realized variance as the tree-based models. This HAR-LSTM specification asks whether a neural sequence model can extract gains from the standard persistence feature set without the wavelet layer.
Table 10 shows that the LSTM is broadly comparable to HAR at h = 1 but remains behind wavelet_lightgbm at h 5 , where the multiscale feature advantage is most pronounced. This benchmark should be interpreted as a controlled comparison rather than as a complete deep-learning horse race. Its role is to show that recurrent nonlinearity over the standard HAR feature set does not by itself overturn the main finding. The paper therefore does not argue that tree boosting dominates every modern sequential architecture; it argues that the incremental value of CMWSL comes primarily from causal multiscale and spillover-aware feature design.

6. Discussion

The results support a clear but disciplined interpretation. CMWSL does not overturn the well-known fact that daily stock-index volatility is strongly persistence-driven. HAR remains the right benchmark, and in the main Rogers–Satchell specification, it remains the strongest average model. That finding should not be minimized. It is precisely because the benchmark is strong that the conditional gains documented in this paper matter. A credible contribution in this literature is not universal dominance over HAR, but a precise account of when richer representation begins to add value.
The first substantive implication concerns the forecast horizon. The results consistently show that the relative value of multiscale representation rises as the task moves from one-day forecasting toward five- and ten-day horizons. This pattern has a natural economic interpretation. At the shortest horizon, the next day’s volatility is heavily determined by immediate persistence and local market continuation. At longer horizons, however, medium-scale adjustment processes, information diffusion, and broader market-risk conditions have more time to matter. The wavelet layer is useful precisely because it organizes those medium-scale dynamics without discarding the strong persistence structure already captured by HAR.
The second implication concerns information richness. CMWSL performs more strongly when the predictor set is expanded with public credit-spread, policy-uncertainty, and volatility-term-structure variables, and when the target is changed from Rogers–Satchell to Parkinson volatility. These are not incidental robustness wins. Together they suggest that the multiscale layer becomes more valuable when the forecasting environment contains more heterogeneous information or when the target places more weight on informative OHLC range variation. In other words, the wavelet design is not acting as a generic complexity booster; it is acting as an information organizer.
The third implication concerns the nature of cross-market dependence. The spillover results indicate that peer-market information is not uniformly useful. Its contribution is concentrated in broad-market indices and in stressed states, and it appears most clearly through the d2 and d3 wavelet bands. This point matters because it refines the usual spillover narrative. Existing work often treats cross-market dependence as a broad time-domain phenomenon. The present evidence suggests a more specific view: a meaningful part of risk transmission across indices occurs through medium- and long-scale channels associated with weekly to monthly reallocation, gradual risk repricing, and synchronized stress adjustment. That interpretation is especially compelling for DJIA and S&P 500, and much less so for the technology-concentrated Nasdaq-100.

6.1. Economic Interpretation of Frequency-Dependent Spillovers

The results support a deeper financial-economics interpretation than a purely technical reading would suggest. The concentration of predictive spillover at the d2 and d3 scales indicates that stock-index integration is not expressed most clearly through next-day volatility contagion, but through slower-moving channels that unfold over several trading days. Plausible mechanisms include institutional portfolio rebalancing, volatility-targeting and risk-parity deleveraging, dealer hedging of index derivatives, ETF and index-arbitrage activity, margin-related liquidity adjustment, and delayed assimilation of macro-financial information. All of these processes operate on a slower clock than overnight return continuation, which helps explain why cross-index information enters most clearly through medium- and long-scale wavelet energy rather than through the highest-frequency band (Son et al. 2023; Souropanis and Vivian 2023; Zeng et al. 2024).
This interpretation also has implications for the theory of financial market integration. For the broad U.S. indices, the evidence suggests that integration is strongest not merely as contemporaneous co-movement, but as shared exposure to medium-horizon risk repricing. During stressed periods, investors, dealers, and leveraged funds rebalance common market exposures across broad equity benchmarks over several days rather than instantaneously. That mechanism is consistent with the stronger spillover response found for DJIA and S&P 500. By contrast, the weaker spillover response of Nasdaq-100 indicates partial segmentation: a technology-concentrated index remains more exposed to sector-specific earnings narratives, duration-sensitive valuation repricing, and innovation-cycle news than to the broad-market integration channel that dominates at d2/d3 scales.
Put differently, CMWSL does not merely reveal that peer markets matter. It suggests that the economically relevant transmission mechanism is frequency-dependent. The broad-market integration process becomes visible when volatility is allowed to propagate through medium-horizon scales, which is precisely why the spillover layer adds most value when market stress is synchronized rather than idiosyncratic.
The divergence between DM and CW in the spillover comparison fits this interpretation. DM is conservative in nested settings because the larger model pays an estimation-noise penalty that is not explicitly corrected. CW is designed for exactly this situation. The fact that CW detects significant incremental spillover content where positive QLIKE gains are also observed implies that the peer-index wavelet block contains real predictive information, but that the signal is moderate rather than overwhelming. This combination strengthens rather than weakens the paper’s claims: the evidence is robust enough to survive a correctly specified nested-model test, while remaining within the range that is realistic for financial time-series data of this kind.
The warning results lead to a related practical conclusion. The best warning model in the paper is not the most elaborate one. A simpler logistic classifier built on the same causal feature pipeline delivers the most stable precision–recall performance and the clearest operating trade-off between earlier detection and alert burden. This is important for market-risk practice. Warning systems are often judged not only by discrimination, but also by whether they can be audited, calibrated, and implemented by human decision makers. The logistic model is therefore a feature rather than a limitation of the paper’s design.
The economic significance of the forecasting gains should also be stated with care. We agree that some unconditional QLIKE differences are numerically small, and the paper should not present them as large economic gains. In the main sample, the average gap between HAR and wavelet_lightgbm is modest; even in the Parkinson robustness setting, the average lead of the wavelet model over HAR remains narrow. For that reason, the practical significance of CMWSL should not be framed as universal replacement of the benchmark. Its value is conditional: it appears in the states in which risk management decisions are most consequential, namely medium-horizon forecasting under elevated and synchronized stress, together with earlier warning detection at the cost of a transparent increase in alert burden.

6.2. Implications for Financial Risk Management Practice

The results carry several direct implications for market risk management practice, beyond their contribution to the academic forecasting literature.
First, the state-dependent pattern of CMWSL gains has immediate relevance for risk model governance. The finding that multiscale and cross-index improvements concentrate in upper-volatility regimes and synchronized stress windows implies that practitioners can adopt CMWSL as a selective overlay: using the HAR baseline under normal market conditions and switching to the full multiscale pipeline when macro-financial indicators or rolling volatility diagnostics signal an elevated-stress environment. This regime-conditional deployment is consistent with existing internal model validation frameworks that require documented conditions under which supplementary models become operative.
Second, the risk early warning classifier delivers an operationally tractable alert mechanism. The precision–recall trade-off documented across settings allows a risk manager or chief risk officer to calibrate the alert threshold to match the institution’s tolerance for missed warnings versus false positives. Unlike black-box neural network warning systems, the logistic classifier built on CMWSL features is directly interpretable: feature coefficients indicate which wavelet bands and macro-financial variables are currently driving the alert probability, which supports both model validation and regulatory communication. This property is especially relevant under risk management frameworks that require human-intelligible model outputs.
Third, the public-data design strengthens the case for deployment. All predictors—including the Rogers–Satchell and Parkinson volatility targets, wavelet features, and macro-financial enrichment variables—are derived from publicly available daily market and FRED data. This means that the pipeline can be reproduced, back-tested, and independently verified without access to proprietary intraday feeds. For institutions subject to model risk management guidelines (such as the Basel Committee’s principles for sound model risk management, or the U.S. SR 11-7 guidance), a transparent, reproducible, public-data pipeline with documented walk-forward evaluation records is directly aligned with validation requirements.
Fourth, the spillover evidence has portfolio-level implications. The finding that S&P 500 volatility benefits most from DJIA and Nasdaq wavelet spillover at the ten-day horizon suggests that cross-index risk transmission at medium frequencies is a material input to multi-asset portfolio risk estimation. Portfolio risk managers monitoring equity correlations may find that adding medium-scale wavelet energy from peer indices to their risk factor models improves variance forecast accuracy during stress windows—which is precisely when diversification benefits are most likely to break down and portfolio Value-at-Risk estimates are most likely to underestimate actual exposure.

6.3. What the Paper Contributes to the Literature

Taken together, the findings support three contributions to the volatility-forecasting and risk-monitoring literature.
First, the paper contributes a stricter evaluation standard. By taking HAR seriously, using public data only, and enforcing chronology through the entire feature pipeline, it reduces the scope for inflated claims. This matters because many apparent machine-learning improvements in financial forecasting disappear once the benchmark and evaluation protocol become more demanding.
Second, the paper contributes a more specific understanding of multiscale information. The wavelet block is not helpful everywhere, but it becomes valuable in exactly the settings where a risk manager would expect linear persistence to become less sufficient: longer horizons, richer information environments, and stressed states. That conditional structure is a more useful contribution than a marginal full-sample average improvement would have been.
Third, the paper provides direct evidence on cross-index spillover at specific wavelet frequencies. The S&P 500 h = 10 cell, and the broader DJIA and S&P 500 CW pattern, confirm that peer-index d2/d3 wavelet energy adds predictive content beyond the target index’s own history. This is the study’s strongest novel result: it isolates a specific transmission channel rather than merely asserting that peer markets matter.

6.4. Limitations and Next Steps

The study has several limitations. The analysis is restricted to three major U.S. equity indices, so the spillover findings should not be over-generalized to other markets or asset classes. The public-data design improves reproducibility, but it excludes proprietary intraday information that might sharpen very short-horizon forecasting or clarify cross-market lead–lag effects. Some extended macro variables are not treated with full vintage-level release timing in the main benchmark, so the expanded-data setting should be interpreted as a demanding robustness design rather than as the cleanest causal baseline. In addition, the robustness exercises do not yet use the full 2016–2025 evaluation length of the main specification.
These limitations point directly to future work. The most important extension is to test the same spillover design on international index panels and cross-asset systems, where asynchronous trading hours and stronger lead–lag channels could make frequency-specific spillover even more informative. A second extension is to lengthen the robustness windows so that the richer-information and Parkinson-target experiments match the full main evaluation period. A third is to complement the current cell-by-cell DM and CW analysis with model-confidence-set procedures. Finally, broader sequential benchmark sets, including Transformer or Temporal Fusion Transformer variants, could be evaluated, but only if they are held to the same standards of chronology, calibration, rolling-refit fairness, and public-data reproducibility used in the present paper.

7. Conclusions

This paper presents CMWSL, a public-data causal multiscale wavelet spillover learning framework for stock-index volatility forecasting and risk early warning. The framework was designed to answer a narrow but important question: when does a causal multiscale representation add useful predictive information beyond strong persistence benchmarks? The answer emerging from the evidence is clear and disciplined.
First, daily index volatility remains strongly persistence-driven. HAR is not a weak comparator but the correct baseline, and it remains the strongest average model in the main Rogers–Satchell specification. Second, within-index multiscale wavelet representation becomes more valuable as the forecasting problem becomes less myopic. The gains are more visible at h = 5 and h = 10 , in richer public-information environments, and under the Parkinson target. Third, the spillover extension provides genuine incremental information for broad-market indices. Clark–West tests detect significant gains in five of nine index–horizon cells, with the strongest evidence for the S&P 500 at h = 10 (CW = 4.83 , p < 0.001 ). Tail-conditioned and rolling-window diagnostics further show that these gains are concentrated in upper-volatility regimes and synchronized stress windows rather than in average conditions. Economically, this pattern is consistent with a market-integration channel that operates primarily through multi-day common-risk repricing rather than through uniform next-day contagion.
For financial risk management practice, these results carry a clear operational message. Under normal market conditions, HAR-class persistence models remain the most reliable and interpretable baseline for daily volatility risk measurement. The CMWSL extension is most valuable as a medium-horizon risk-monitoring overlay—activated by elevated macro-financial stress signals, widening credit spreads, or synchronized cross-index volatility clustering—where its frequency-specific spillover and multiscale features capture risk dynamics that a purely linear model misses. The integrated early-warning classifier provides a directly auditable, threshold-calibrated alert mechanism that connects the forecasting pipeline to actionable risk management decisions. The paper therefore does not claim large unconditional economic gains from replacing the baseline; it claims conditional economic value in the states and horizons in which risk oversight is most difficult and most consequential.
Several extensions are natural. International index panels, cross-asset systems, longer robustness windows, and formal model-confidence-set analysis would all strengthen external validity. More specialized sequence models may also be worth testing if they are evaluated under the same chronology and reproducibility constraints. For the present paper, however, the main conclusion is already well supported: causal multiscale wavelet spillover learning offers a useful, transparent, and reproducible extension to persistence-based market risk monitoring, and its value is most visible—and most consequential for risk management practice—when cross-market stress becomes more synchronized, more state-dependent, and more challenging for linear risk models to track in real time.

Author Contributions

Conceptualization, H.L., Y.S. and A.J.; methodology, H.L.; software, H.L.; validation, H.L. and Y.S.; formal analysis, H.L.; investigation, H.L.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L., Y.S. and A.J.; visualization, H.L.; supervision, A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw market data used in this study are publicly accessible from Stooq https://stooq.com (accessed on 17 March 2026) and the Federal Reserve Economic Data (FRED) platform https://fred.stlouisfed.org (accessed on 17 March 2026). The repository snapshot used to generate this manuscript contains a structured reviewer reproducibility package with dependency specifications, experiment configuration files, data-construction scripts, evaluation launchers, merged prediction outputs, and figure-generation scripts sufficient to regenerate the reported tables and figures. These materials are organized in the project repository and can be shared directly with editors and reviewers during evaluation.

Acknowledgments

The authors thank colleagues and reviewers for their constructive suggestions during the development of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ATRAverage true range
CMWSLCausal multiscale wavelet spillover learning
DMDiebold–Mariano
EGARCHExponential generalized autoregressive conditional heteroskedasticity
ETFExchange-traded fund
FREDFederal Reserve Economic Data
GARCHGeneralized autoregressive conditional heteroskedasticity
HARHeterogeneous autoregressive
OHLCOpen–high–low–close
PR-AUCArea under the precision–recall curve
QLIKEQuasi-likelihood loss
RSRogers–Satchell
SWTStationary wavelet transform

Appendix A. Supplementary Experimental Notes

This appendix collects implementation details that support reproducibility but would otherwise interrupt the flow of the main text.

Appendix A.1. Feature Blocks and Practical Implementation

The final predictor design is organized into persistence, technical, macro-financial, and multiscale blocks. The persistence block contains current volatility, 5-day average volatility, and 22-day average volatility. The technical block contains close-to-close returns, intraday returns, overnight gaps, ATR, RSI, MACD, and short-window stress summaries. The macro-financial block contains index-specific implied-volatility proxies together with rates, term spreads, financial conditions, recession information, and, in the expanded setting, credit spreads, policy uncertainty, and volatility-term-structure variables. The multiscale block contains causal wavelet summaries extracted from the target volatility proxy, absolute returns, and implied-volatility series.
Table A1 lists the fixed implementation choices used in the reproducible pipeline.
Table A1. Key implementation choices used in the final reproducible pipeline. Each setting is fixed before the final out-of-sample evaluation.
Table A1. Key implementation choices used in the final reproducible pipeline. Each setting is fixed before the final out-of-sample evaluation.
ComponentFinal ChoicePractical Rationale
Core sampleSPX, NDQ, and DJI daily panelsBalances market representativeness, data length, and replication cost
Main volatility proxyRogers–SatchellUses OHLC information while remaining stable under a public daily-data design
Robustness targetParkinsonTests whether conclusions depend on the main OHLC proxy
Wavelet basis and depthSymlet-4, J = 3 Produces a compact short/medium/long decomposition without over-expanding the feature space
Wavelet extraction windowMinimum 128 and maximum 256 observationsPreserves causal computation while avoiding excessive edge instability
Rolling estimation window1260 trading daysApproximates a five-year learning window for stable walk-forward estimation
Refit frequencyEvery 5 trading daysControls runtime while keeping model updates frequent enough for daily deployment
Warning thresholdRolling 90th percentile of future volatilityAdapts to cross-index scale differences and regime shifts
Decision threshold F 2 -optimal cutoffPrioritizes recall because missed high-risk episodes are more costly than false alarms
Primary evaluation metricQLIKEAppropriate for positive volatility forecasts and imperfect volatility proxies

Appendix A.2. Rolling Evaluation Details

Every rolling split follows the same sequence: construct all features from the trailing estimation window, estimate the horizon-specific regression model, apply the QLIKE-oriented calibration floor, compute the rolling high-risk threshold, estimate the warning model, and then score the next out-of-sample observation. No preprocessing step is fitted on the full sample. This includes scaling, clipping, wavelet extraction, calibration-floor estimation, and warning-threshold selection.

Appendix A.3. Notation, Metrics, and Testing Notes

The notation table in the main text is intentionally compact. In implementation terms, the most important distinction is between raw predictors, multiscale summaries, and post-calibration forecasts. Raw predictors are observable market and macro-financial variables. Multiscale summaries are deterministic transformations of causal windows of these variables. Post-calibration forecasts are the quantities actually sent to the QLIKE evaluator and the warning layer. This distinction matters because only the final step changes the loss geometry without changing the underlying training target.
The evaluation metrics also answer different questions. QLIKE emphasizes scale-sensitive volatility calibration, RMSE and MAE measure point error magnitude, PR-AUC measures event ranking under class imbalance, and Brier score evaluates probability calibration. DM tests speak to realized loss differences, whereas CW tests speak to incremental predictive content in nested settings. The article reports both because neither on its own is sufficient to characterize the contribution of a public-data hybrid model.

Appendix A.4. Additional Robustness Interpretation

The main robustness exercises serve different purposes. The expanded public risk-data experiment tests whether the proposed multiscale design benefits from richer external predictors that remain publicly accessible. The Parkinson experiment tests whether the main conclusion depends on a specific OHLC volatility proxy. Together, these extensions indicate that the proposed framework should be interpreted as a public-data integration pipeline rather than a single narrowly optimized model.

Appendix A.5. Additional Architecture and Diagnostic Figures

Figure A1 provides the module-level expansion of the main architecture used in Section 3. Figure A2Figure A4 report secondary diagnostics that support, but do not define, the main computational claims in Section 5. Figure A5 and Figure A6 provide additional warning-frontier and spillover-ablation views.
Figure A1. Module-level expansion of the CMWSL pipeline architecture. The figure details the four sequential layers described in Section 3: the HAR persistence block, the causal stationary-wavelet-transform multiscale layer with scale-wise summary statistics, the cross-index spillover layer, and the horizon-specific forecast and warning heads. The mapping from raw features to final predictions is fully causal and deterministic given the rolling window.
Figure A1. Module-level expansion of the CMWSL pipeline architecture. The figure details the four sequential layers described in Section 3: the HAR persistence block, the causal stationary-wavelet-transform multiscale layer with scale-wise summary statistics, the cross-index spillover layer, and the horizon-specific forecast and warning heads. The mapping from raw features to final predictions is fully causal and deterministic given the rolling window.
Risks 14 00129 g0a1
Figure A2. Representative out-of-sample forecast traces in the main and expanded-data specifications. The wavelet-enhanced model follows the build-up and decay of stress episodes more smoothly at h = 5 and h = 10 , whereas isolated daily spikes still tend to favour the parsimonious HAR benchmark.
Figure A2. Representative out-of-sample forecast traces in the main and expanded-data specifications. The wavelet-enhanced model follows the build-up and decay of stress episodes more smoothly at h = 5 and h = 10 , whereas isolated daily spikes still tend to favour the parsimonious HAR benchmark.
Risks 14 00129 g0a2
Figure A3. Multiscale signal map in the expanded public risk-data panel. Each cell reports the median absolute Spearman association between a wavelet feature group and future volatility.
Figure A3. Multiscale signal map in the expanded public risk-data panel. Each cell reports the median absolute Spearman association between a wavelet feature group and future volatility.
Risks 14 00129 g0a3
Figure A4. Surrogate leave-one-block-out contribution map in the expanded public risk-data panel. Each cell reports the change in out-of-sample R 2 from removing one feature block from a fixed train/test surrogate regression on block embeddings.
Figure A4. Surrogate leave-one-block-out contribution map in the expanded public risk-data panel. Each cell reports the change in out-of-sample R 2 from removing one feature block from a fixed train/test surrogate regression on block embeddings.
Risks 14 00129 g0a4
Figure A5. Event-based warning frontier at the index–horizon level. The horizontal axis reports false-alarm days per event on a log scale, the vertical axis reports the five-day event hit rate, and marker size scales with the mean lead time. Grey line segments connect the two warning rules within the same index–horizon cell.
Figure A5. Event-based warning frontier at the index–horizon level. The horizontal axis reports false-alarm days per event on a log scale, the vertical axis reports the five-day event hit rate, and marker size scales with the mean lead time. Grey line segments connect the two warning rules within the same index–horizon cell.
Risks 14 00129 g0a5
Figure A6. QLIKE comparison across indices and forecast horizons for the three models in the spillover ablation. Asterisks above the Spillover-LGB bar indicate Clark–West significance at the corresponding cell.
Figure A6. QLIKE comparison across indices and forecast horizons for the three models in the spillover ablation. Asterisks above the Spillover-LGB bar indicate Clark–West significance at the corresponding cell.
Risks 14 00129 g0a6

Appendix A.6. Additional Statistical Tables

Table A2 reports the additional forecast-comparison tests, and Table A3 reports regime-conditioned spillover gains.
Table A2. Forecast comparison tests against HAR in the main specification.
Table A2. Forecast comparison tests against HAR in the main specification.
IndexHorizonModelDM StatDM p-ValueCW StatCW p-Value
DJIA1LightGBM3.5720.00042.0600.0197
DJIA1Wavelet-LightGBM3.3440.00082.4400.0073
DJIA5LightGBM1.9250.05422.3330.0098
DJIA5Wavelet-LightGBM1.4580.14482.5030.0062
DJIA10LightGBM−0.2280.81942.9010.0019
DJIA10Wavelet-LightGBM−1.1580.24672.9830.0014
Nasdaq-1001LightGBM2.9270.00342.9350.0017
Nasdaq-1001Wavelet-LightGBM3.9670.00013.5420.0002
Nasdaq-1005LightGBM−0.9900.32244.5160.0000
Nasdaq-1005Wavelet-LightGBM−0.3530.72394.9270.0000
Nasdaq-10010LightGBM−1.0800.28014.6820.0000
Nasdaq-10010Wavelet-LightGBM−2.1450.03204.8210.0000
S&P 5001LightGBM2.6390.00832.6410.0041
S&P 5001Wavelet-LightGBM2.6140.00902.6690.0038
S&P 5005LightGBM1.0040.31552.4890.0064
S&P 5005Wavelet-LightGBM0.9000.36832.6780.0037
S&P 50010LightGBM-0.1800.85722.7840.0027
S&P 50010Wavelet-LightGBM0.2020.83982.7360.0031
Table A3. Broad-market regime-conditioned spillover gains. Entries report ΔQLIKE = Spillover-LGB minus Wavelet-LGB for DJIA and S&P 500 pooled observations. Negative values indicate that cross-index spillover features improve accuracy. Gold shading with bold type marks the largest improvement within each horizon.
Table A3. Broad-market regime-conditioned spillover gains. Entries report ΔQLIKE = Spillover-LGB minus Wavelet-LGB for DJIA and S&P 500 pooled observations. Negative values indicate that cross-index spillover features improve accuracy. Gold shading with bold type marks the largest improvement within each horizon.
Regime h = 1 h = 5 h = 10
Bottom 20% VIX−0.0420.009−0.004
COVID 2020–2021−0.3770.1220.011
Expansion−0.054−0.016−0.006
Post-2022−0.028−0.109−0.002
Recession−0.674−0.3610.145
Top 20% VIX−0.069−0.045−0.002

Appendix A.7. Project Assets

The core manuscript results are generated from three merged result packages: the main Rogers–Satchell specification, the expanded public risk-data specification, and the Parkinson specification. The manuscript figures and tables are produced from these merged packages through deterministic scripts stored in the project repository. This setup is intended to make the final article easy to audit, rerun, and extend.

References

  1. Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. 2003. Modeling and forecasting realized volatility. Econometrica 71: 579–625. [Google Scholar] [CrossRef]
  2. Branco, Rafael R., Alexandre Rubesam, and Mauricio Zevallos. 2024. Forecasting realized volatility: Does anything beat linear models? Journal of Empirical Finance 78: 101524. [Google Scholar] [CrossRef]
  3. Chun, Dohyun, Hoon Cho, and Doojin Ryu. 2025. Volatility forecasting and volatility-timing strategies: A machine learning approach. Research in International Business and Finance 75: 102723. [Google Scholar] [CrossRef]
  4. Clark, Todd E., and Kenneth D. West. 2007. Approximately normal tests for equal predictive accuracy in nested models. Journal of Econometrics 138: 291–311. [Google Scholar] [CrossRef]
  5. Corsi, Fulvio. 2008. A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics 7: 174–96. [Google Scholar] [CrossRef]
  6. Diebold, Francis X., and Roberto S. Mariano. 1995. Comparing predictive accuracy. Journal of Business & Economic Statistics 13: 253–63. [Google Scholar] [CrossRef]
  7. Díaz, Juan D., Erwin Hansen, and Gabriel Cabrera. 2024. Machine-learning stock market volatility: Predictability, drivers, and economic value. International Review of Financial Analysis 94: 103286. [Google Scholar] [CrossRef]
  8. Garman, Mark B., and Michael J. Klass. 1980. On the estimation of security price volatilities from historical data. The Journal of Business 53: 67–78. [Google Scholar] [CrossRef] [PubMed]
  9. Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc., vol. 30, pp. 3146–54. [Google Scholar]
  10. Leushuis, Radmir Mishelevich, and Nicolai Petkov. 2026. Advances in forecasting realized volatility: A review of methodologies. Financial Innovation 12: 14. [Google Scholar] [CrossRef]
  11. Lundberg, Scott M., and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc., vol. 30, pp. 4765–74. [Google Scholar]
  12. Ma, Xiangkai, and Huaxiong Zhang. 2025. Time series forecasting method based on multi-scale feature fusion and autoformer. Applied Sciences 15: 3768. [Google Scholar] [CrossRef]
  13. Parkinson, Michael. 1980. The extreme value method for estimating the variance of the rate of return. The Journal of Business 53: 61–65. [Google Scholar] [CrossRef]
  14. Patton, Andrew J. 2011. Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics 160: 246–56. [Google Scholar] [CrossRef]
  15. Percival, Donald B., and Andrew T. Walden. 2000. Wavelet Methods for Time Series Analysis. Cambridge: Cambridge University Press. [Google Scholar] [CrossRef]
  16. Purnell, Daren, Amir Etemadi, and John Kamp. 2024. Developing an early warning system for financial networks: An explainable machine learning approach. Entropy 26: 796. [Google Scholar] [CrossRef] [PubMed]
  17. Ran, Meng, Zhenpeng Tang, Yuhang Chen, and Zhiqi Wang. 2024. Early warning of systemic risk in stock market based on EEMD-LSTM. PLoS ONE 19: e0300741. [Google Scholar] [CrossRef] [PubMed]
  18. Ren, Tingting, Shaofang Li, and Siying Zhang. 2024. Stock market extreme risk prediction based on machine learning: Evidence from the american market. The North American Journal of Economics and Finance 74: 102241. [Google Scholar] [CrossRef]
  19. Rogers, L. Christopher G., and Stephen E. Satchell. 1991. Estimating variance from high, low and closing prices. The Annals of Applied Probability 1: 504–12. [Google Scholar] [CrossRef]
  20. Saito, Takaya, and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10: e0118432. [Google Scholar] [CrossRef]
  21. Son, Bumho, Yunyoung Lee, Seongwan Park, and Jaewook Lee. 2023. Forecasting global stock market volatility: The impact of volatility spillover index in spatial-temporal graph-based model. Journal of Forecasting 42: 1539–59. [Google Scholar] [CrossRef]
  22. Song, Yuping, Bolin Lei, Xiaolong Tang, and Chen Li. 2023. Volatility forecasting for stock market index based on complex network and hybrid deep learning model. Journal of Forecasting 43: 544–66. [Google Scholar] [CrossRef]
  23. Song, Yuping, Xiaolong Tang, Hemin Wang, and Zhiren Ma. 2022. Volatility forecasting for stock market incorporating macroeconomic variables based on GARCH-MIDAS and deep learning models. Journal of Forecasting 42: 51–59. [Google Scholar] [CrossRef]
  24. Souropanis, Ioannis, and Andrew Vivian. 2023. Forecasting realized volatility with wavelet decomposition. Journal of Empirical Finance 74: 101432. [Google Scholar] [CrossRef]
  25. Su, Junqi, Raymond Y.K. Lau, Yuefeng Du, Jia Yu, and Hui Zhang. 2025. A novel hybrid framework for stock price prediction integrating adaptive signal decomposition and multi-scale feature extraction. Applied Sciences 15: 12450. [Google Scholar] [CrossRef]
  26. Yang, Dennis, and Qiang Zhang. 2000. Drift independent volatility estimation based on high, low, open, and close prices. The Journal of Business 73: 477–92. [Google Scholar] [CrossRef] [PubMed]
  27. Zeng, Qing, Xinjie Lu, Jin Xu, and Yu Lin. 2024. Macro-driven stock market volatility prediction: Insights from a new hybrid machine learning approach. International Review of Financial Analysis 96: 103711. [Google Scholar] [CrossRef]
  28. Zhang, Chao, Yihuang Zhang, Mihai Cucuringu, and Zhongmin Qian. 2023. Volatility forecasting with machine learning and intraday commonality. Journal of Financial Econometrics 22: 492–530. [Google Scholar] [CrossRef]
  29. Zhuo, Yue, and Takayuki Morimoto. 2024. A hybrid model for forecasting realized volatility based on heterogeneous autoregressive model and support vector regression. Risks 12: 12. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the proposed CMWSL stock-index volatility forecasting and risk-warning framework.
Figure 1. Overall architecture of the proposed CMWSL stock-index volatility forecasting and risk-warning framework.
Risks 14 00129 g001
Figure 2. Rolling walk-forward evaluation protocol. The blue bar denotes the full daily market sample, the brown bar marks the design and model-freeze period, and the gold bar marks the rolling out-of-sample evaluation period. Each step advances the forecast origin by one trading day. The training window (1260 days, shaded) rolls forward to generate the out-of-sample forecast sequence spanning 2016–2025. All feature construction, model fitting, warning-threshold calibration, and evaluation are performed strictly within the boundary of each training window, ensuring full causal integrity.
Figure 2. Rolling walk-forward evaluation protocol. The blue bar denotes the full daily market sample, the brown bar marks the design and model-freeze period, and the gold bar marks the rolling out-of-sample evaluation period. Each step advances the forecast origin by one trading day. The training window (1260 days, shaded) rolls forward to generate the out-of-sample forecast sequence spanning 2016–2025. All feature construction, model fitting, warning-threshold calibration, and evaluation are performed strictly within the boundary of each training window, ensuring full causal integrity.
Risks 14 00129 g002
Figure 3. Quantile-conditioned ΔQLIKE surfaces. Panel (a) reports Wavelet-LightGBM minus HAR averaged across all indices in the full 2016–2025 package. Panel (b) reports Spillover-LGB minus Wavelet-LGB for the pooled DJIA and S&P 500 spillover package. Negative values indicate that the more advanced model improves accuracy.
Figure 3. Quantile-conditioned ΔQLIKE surfaces. Panel (a) reports Wavelet-LightGBM minus HAR averaged across all indices in the full 2016–2025 package. Panel (b) reports Spillover-LGB minus Wavelet-LGB for the pooled DJIA and S&P 500 spillover package. Negative values indicate that the more advanced model improves accuracy.
Risks 14 00129 g003
Figure 4. Time-varying gain paths based on 126-day rolling average ΔQLIKE. Panel (a) reports Wavelet-LGB minus HAR averaged across all indices in the full 2016–2025 package. Panel (b) reports Spillover-LGB minus Wavelet-LGB for the pooled DJIA and S&P 500 spillover package. Negative values indicate that the more advanced model improves accuracy. Shaded regions denote pooled top-decile VIX stress windows.
Figure 4. Time-varying gain paths based on 126-day rolling average ΔQLIKE. Panel (a) reports Wavelet-LGB minus HAR averaged across all indices in the full 2016–2025 package. Panel (b) reports Spillover-LGB minus Wavelet-LGB for the pooled DJIA and S&P 500 spillover package. Negative values indicate that the more advanced model improves accuracy. Shaded regions denote pooled top-decile VIX stress windows.
Risks 14 00129 g004
Figure 5. QLIKE across indices and forecast horizons in the main Rogers–Satchell specification. Lower values indicate better performance.
Figure 5. QLIKE across indices and forecast horizons in the main Rogers–Satchell specification. Lower values indicate better performance.
Risks 14 00129 g005
Figure 6. Winner-and-margin map for the main Rogers–Satchell specification. Each tile reports the winner and its QLIKE margin over the runner-up. Only three model colors appear in the tiles because HV-22 does not win any index–horizon cell in this specification.
Figure 6. Winner-and-margin map for the main Rogers–Satchell specification. Each tile reports the winner and its QLIKE margin over the runner-up. Only three model colors appear in the tiles because HV-22 does not win any index–horizon cell in this specification.
Risks 14 00129 g006
Figure 7. Average QLIKE across the main and robustness settings. Lower values indicate better performance.
Figure 7. Average QLIKE across the main and robustness settings. Lower values indicate better performance.
Risks 14 00129 g007
Figure 8. Cell-level QLIKE differences relative to HAR. Negative values indicate that the nonlinear model outperforms HAR.
Figure 8. Cell-level QLIKE differences relative to HAR. Negative values indicate that the nonlinear model outperforms HAR.
Risks 14 00129 g008
Figure 9. Precision–recall comparison for the warning task using the S&P 500 at h = 1 . The black dotted horizontal line marks the no-skill baseline implied by the event prevalence.
Figure 9. Precision–recall comparison for the warning task using the S&P 500 at h = 1 . The black dotted horizontal line marks the no-skill baseline implied by the event prevalence.
Risks 14 00129 g009
Figure 10. Event-aligned warning diagnostics for the S&P 500 during the 2020 stress period.
Figure 10. Event-aligned warning diagnostics for the S&P 500 during the 2020 stress period.
Risks 14 00129 g010
Figure 11. Statistical evidence heatmap for each ablation step. (Left): Diebold–Mariano two-tailed test for Step 1 (HAR vs. Wavelet-LGB). (Right): Clark–West one-tailed test for Step 2 (Wavelet-LGB vs. Spillover-LGB). Teal tones indicate that the advanced model wins at the stated significance level; navy tones indicate the simpler model wins. Colour intensity scales with 1 p . Asterisks denote statistical significance: *** p < 0.001 , ** p < 0.01 , * p < 0.05 , and . p < 0.10 .
Figure 11. Statistical evidence heatmap for each ablation step. (Left): Diebold–Mariano two-tailed test for Step 1 (HAR vs. Wavelet-LGB). (Right): Clark–West one-tailed test for Step 2 (Wavelet-LGB vs. Spillover-LGB). Teal tones indicate that the advanced model wins at the stated significance level; navy tones indicate the simpler model wins. Colour intensity scales with 1 p . Asterisks denote statistical significance: *** p < 0.001 , ** p < 0.01 , * p < 0.05 , and . p < 0.10 .
Risks 14 00129 g011
Figure 12. QLIKE gain ladder per index. Each panel shows the incremental QLIKE gain (positive = improvement over baseline) for Step 1 (amber bars when DM-significant, light grey otherwise) and Step 2 (teal bars when CW-significant, light grey otherwise). Significance codes: *** p < 0.001 , ** p < 0.01 , * p < 0.05 , . p < 0.10 .
Figure 12. QLIKE gain ladder per index. Each panel shows the incremental QLIKE gain (positive = improvement over baseline) for Step 1 (amber bars when DM-significant, light grey otherwise) and Step 2 (teal bars when CW-significant, light grey otherwise). Significance codes: *** p < 0.001 , ** p < 0.01 , * p < 0.05 , . p < 0.10 .
Risks 14 00129 g012
Figure 13. Horizon-dependent SHAP feature attribution (top-12 mean absolute values) for wavelet_lightgbm (S&P 500). Navy bars denote wavelet features; grey bars denote non-wavelet features. The shift from h = 1 to h = 5 confirms that medium-scale frequency-domain information adds value selectively as the forecast horizon lengthens—consistent with the quantile-conditioned and rolling-window evidence in Figure 3 and Figure 4. (a) h = 1 : persistence and short-run signals dominate; wavelet features appear further down. (b) h = 5 : wavelet energy at d2/d3 scales rises sharply in the ranking.
Figure 13. Horizon-dependent SHAP feature attribution (top-12 mean absolute values) for wavelet_lightgbm (S&P 500). Navy bars denote wavelet features; grey bars denote non-wavelet features. The shift from h = 1 to h = 5 confirms that medium-scale frequency-domain information adds value selectively as the forecast horizon lengthens—consistent with the quantile-conditioned and rolling-window evidence in Figure 3 and Figure 4. (a) h = 1 : persistence and short-run signals dominate; wavelet features appear further down. (b) h = 5 : wavelet energy at d2/d3 scales rises sharply in the ranking.
Risks 14 00129 g013
Figure 14. SHAP beeswarm for wavelet_lightgbm (S&P 500, h = 5 ). Each point is one observation; colour encodes the feature value. SWT energy features at the d2/d3 decomposition levels rank among the most influential predictors, and high energy values consistently produce positive SHAP contributions, confirming that medium-scale frequency-domain volatility drives upward forecast revisions.
Figure 14. SHAP beeswarm for wavelet_lightgbm (S&P 500, h = 5 ). Each point is one observation; colour encodes the feature value. SWT energy features at the d2/d3 decomposition levels rank among the most influential predictors, and high energy values consistently produce positive SHAP contributions, confirming that medium-scale frequency-domain volatility drives upward forecast revisions.
Risks 14 00129 g014
Table 1. Positioning of the present study relative to representative prior works. ✓ = present; ∘ = partial; — = absent. Within- λ : within-index wavelet decomposition; Cross- λ : cross-index wavelet spillover features; Warning: integrated early-warning layer; Public: all data from public sources; Walk-fwd: rolling walk-forward evaluation; DL base.: deep-learning model included as baseline.
Table 1. Positioning of the present study relative to representative prior works. ✓ = present; ∘ = partial; — = absent. Within- λ : within-index wavelet decomposition; Cross- λ : cross-index wavelet spillover features; Warning: integrated early-warning layer; Public: all data from public sources; Walk-fwd: rolling walk-forward evaluation; DL base.: deep-learning model included as baseline.
StudyWithin- λ Cross- λ WarningPublicWalk-FwdDL Base.
Corsi (2008)
Souropanis and Vivian (2023)
Zhuo and Morimoto (2024)
Díaz et al. (2024)
Zhang et al. (2023)
Son et al. (2023)
Song et al. (2023)
Ran et al. (2024)
This study
Table 2. Public data blocks used in the baseline and robustness specifications. The table is organized by predictor block, source, and empirical role in the forecasting pipeline.
Table 2. Public data blocks used in the baseline and robustness specifications. The table is organized by predictor block, source, and empirical role in the forecasting pipeline.
BlockVariables/SymbolsSourceMain Role
Equity indicesS&P 500, Nasdaq-100, DJIAStooqDaily OHLC prices and volatility targets
Liquidity proxiesSPY, QQQ, DIAStooqETF-based volume and trading-activity proxies
Implied volatilityVIXCLS, VXNCLS, VXDCLSFRED (CBOE series)Index-specific forward-looking risk proxies
Rates and conditionsDFF, DGS10, T10Y3M, NFCI, USRECDFREDMacro-financial and regime controls
Expanded public risk dataDGS2, BAMLH0A0HYM2, BAMLC0A0CM, USEPUINDXD, RVXCLS, VXVCLSFREDCredit, policy-uncertainty, and volatility-term-structure robustness block
Table 3. Core notation used in the methodological formulation.
Table 3. Core notation used in the methodological formulation.
SymbolCategoryMeaning
iCross-sectional indexMarket index identifier ( SPX , NDQ , or DJI )
tTime indexForecast origin on trading day t
hForecast horizonPrediction horizon in trading days (1, 5, or 10)
y i , t ( h ) Forecast targetFuture average volatility over the next h trading days
y ¯ i , t ( h ) Transformed targetLog-transformed target used for model fitting
x i , t H A R Feature blockDaily, weekly, and monthly persistence features
x i , t r a w Feature blockTechnical, implied-volatility, and macro-financial predictors
x i , t w a v Feature blockCausal wavelet summaries from volatility and risk series
y ˜ i , t ( h ) Calibrated forecastPost-floor volatility prediction used for evaluation and warning linkage
τ i , t ( h ) Warning thresholdRolling in-window 90th percentile for defining high-risk states
p ^ i , t ( h ) Warning scorePredicted high-risk probability from the logistic warning layer
Table 4. Leakage-free pseudo-code of the rolling multiscale forecasting and warning procedure.
Table 4. Leakage-free pseudo-code of the rolling multiscale forecasting and warning procedure.
StepOperation
1For each index i and horizon h, define a rolling training window of 1260 trading days and a one-step-ahead out-of-sample forecast origin.
2Construct the OHLC-based volatility target y i , t ( h ) , persistence terms, technical indicators, implied-volatility proxies, and macro-financial predictors using only data available up to time t.
3Apply the causal non-decimated wavelet transform to selected high-value series and compute multiscale summaries such as recent coefficients, short-window means, standard deviations, and local energy terms.
4Form the feature vector x i , t = [ x i , t H A R , x i , t r a w , x i , t w a v ] and estimate the horizon-specific boosting model f h ( · ) on the transformed training target.
5Generate the raw forecast y ^ i , t ( h ) and apply the horizon-specific lower-tail calibration safeguard to obtain the final regression forecast y ˜ i , t ( h ) .
6Compute the rolling high-risk threshold τ i , t ( h ) , estimate the logistic warning model on the training window, and choose the decision threshold that maximizes the training-window F β score with β = 2 .
7Output the calibrated volatility forecast, the warning probability, and the warning label for the current out-of-sample date, then roll the window forward and repeat.
Table 5. Main regression QLIKE across indices and forecast horizons under the Rogers–Satchell target. Best, second-best, and third-best values within each index–horizon cell are marked by gold shading with bold type, peach shading with underline, and blue-gray shading, respectively.
Table 5. Main regression QLIKE across indices and forecast horizons under the Rogers–Satchell target. Best, second-best, and third-best values within each index–horizon cell are marked by gold shading with bold type, peach shading with underline, and blue-gray shading, respectively.
IndexModel h = 1 h = 5 h = 10
S&P 500HAR−9.0316−8.9372−8.8682
HV-22−8.9405−8.8601−8.7836
LightGBM−8.7386−8.9022−8.8749
Wavelet-LightGBM−8.7327−8.8937−8.8603
Nasdaq-100HAR−8.5635−8.4940−8.4316
HV-22−8.5049−8.4472−8.3873
LightGBM−8.4476−8.5125−8.4621
Wavelet-LightGBM−8.3827−8.5026−8.4923
DJIAHAR−9.0021−8.9139−8.8548
HV-22−8.9398−8.8619−8.7869
LightGBM−8.5186−8.8111−8.8636
Wavelet-LightGBM−8.2222−8.8649−8.8934
Table 6. Average QLIKE across the main and robustness settings. Ranking is applied within each row: gold shading with bold type marks the best model, peach shading with underline marks the second-best model, and blue-gray shading marks the third-best model.
Table 6. Average QLIKE across the main and robustness settings. Ranking is applied within each row: gold shading with bold type marks the best model, peach shading with underline marks the second-best model, and blue-gray shading marks the third-best model.
SettingHARLightGBMWavelet-LightGBM
Main RS−8.7885−8.6812−8.6494
Expanded Data−9.5949−9.5641−9.5704
Parkinson−9.3019−9.3091−9.3143
Table 7. Main warning results under the Rogers–Satchell target. PR-AUC is maximized and Brier score is minimized within each index–horizon cell. Gold shading with bold type marks the best value, and peach shading with underline marks the second-best value.
Table 7. Main warning results under the Rogers–Satchell target. PR-AUC is maximized and Brier score is minimized within each index–horizon cell. Gold shading with bold type marks the best value, and peach shading with underline marks the second-best value.
IndexModelPR(1)Brier(1)PR(5)Brier(5)PR(10)Brier(10)
S&P 500Naive Threshold0.48950.09480.50410.09850.44740.1041
Logistic-Raw0.52750.13270.50480.12710.55300.1143
Nasdaq-100Naive Threshold0.46520.10720.51410.10900.44960.1148
Logistic-Raw0.54960.14180.59600.12580.48520.1399
DJIANaive Threshold0.47330.09540.48820.09720.47120.0998
Logistic-Raw0.49000.14020.57820.10690.46720.1112
Table 8. Event-based warning timing summary using a five-day pre-event detection window. Hit rate and median lead are maximized; false-alarm days per event are minimized within each horizon. Gold shading with bold type marks the best value within each horizon and metric.
Table 8. Event-based warning timing summary using a five-day pre-event detection window. Hit rate and median lead are maximized; false-alarm days per event are minimized within each horizon. Gold shading with bold type marks the best value within each horizon and metric.
HorizonModelHit RateMedian LeadFA/EventEvents
h = 1 Logistic-Raw0.8215.0005.149460
Naive-Threshold0.6134.0001.005460
h = 5 Logistic-Raw0.5975.00016.558134
Naive-Threshold0.6493.8333.641134
h = 10 Logistic-Raw0.6025.00037.33081
Naive-Threshold0.5564.3336.46381
Table 9. Three-step ablation test results. Columns report mean QLIKE for each model, the Diebold–Mariano statistic and p-value for Step 1 (HAR vs. Wavelet-LGB) and Step 2 (Wavelet-LGB vs. Spillover-LGB), and the Clark–West statistic and p-value for the nested Step-2 comparison. DM tests are two-tailed with Newey–West HAC; CW test is one-tailed (Ho: spillover carries no incremental information). Significance codes: *** p < 0.001 , ** p < 0.01 , * p < 0.05 , . p < 0.10 . QLIKE convention: lower values indicate better forecasting accuracy.
Table 9. Three-step ablation test results. Columns report mean QLIKE for each model, the Diebold–Mariano statistic and p-value for Step 1 (HAR vs. Wavelet-LGB) and Step 2 (Wavelet-LGB vs. Spillover-LGB), and the Clark–West statistic and p-value for the nested Step-2 comparison. DM tests are two-tailed with Newey–West HAC; CW test is one-tailed (Ho: spillover carries no incremental information). Significance codes: *** p < 0.001 , ** p < 0.01 , * p < 0.05 , . p < 0.10 . QLIKE convention: lower values indicate better forecasting accuracy.
Mean QLIKEStep 1 DMStep 2 DMStep 2 CW (Nested)
Index h HARWav-LGBSpill-LGB t DM p t DM p t CW p
DJIA h = 1 9.0021 8.1817 8.2476 3.666 0.0002 *** 0.317 0.7515 1.749 0.0401 *
h = 5 8.9139 8.6564 8.7303 2.356 0.0185 * 1.427 0.1536 2.613 0.0045 **
h = 10 8.8548 8.9037 8.8932 1.522 0.1280 0.812 0.4166 1.394 0.0816 .
Nasdaq-100 h = 1 8.5635 8.2771 8.2892 4.881 <0.0001 *** 0.207 0.8363 0.293 0.3847
h = 5 8.4940 8.5154 8.5027 1.308 0.1909 0.983 0.3256 1.521 0.9358
h = 10 8.4316 8.5149 8.5103 3.417 0.0006 *** 0.625 0.5318 0.977 0.1644
S&P 500 h = 1 9.0316 8.8211 8.8841 2.637 0.0084 ** 1.353 0.1761 2.077 0.0189 *
h = 5 8.9372 8.8741 8.8444 1.046 0.2957 0.548 0.5838 1.945 0.0259 *
h = 10 8.8682 8.8970 8.9141 0.772 0.4402 1.199 0.2307 4.827 <0.0001 ***
Table 10. QLIKE comparison of benchmark models and the HAR-LSTM deep-learning baseline under the same 1260-day rolling walk-forward protocol. The LSTM is trained on the three HAR features (1-, 5-, and 22-day realized variance) with 16 hidden units and chronological early stopping. Best values per index–horizon cell are marked by gold shading with bold type; second-best values are marked by peach shading with underline. Italics identify model names.
Table 10. QLIKE comparison of benchmark models and the HAR-LSTM deep-learning baseline under the same 1260-day rolling walk-forward protocol. The LSTM is trained on the three HAR features (1-, 5-, and 22-day realized variance) with 16 hidden units and chronological early stopping. Best values per index–horizon cell are marked by gold shading with bold type; second-best values are marked by peach shading with underline. Italics identify model names.
IndexModel h = 1 h = 5 h = 10
S&P 500HAR0.52040.31680.3085
HV-220.61220.39390.3932
LightGBM0.81490.35180.3019
wavelet_lightgbm0.82100.36030.3165
LSTM0.60720.39460.3510
Nasdaq-100HAR0.47980.27630.2733
HV-220.53860.32310.3176
LightGBM0.59550.25780.2428
wavelet_lightgbm0.66070.26770.2126
LSTM0.51760.30970.2910
DJIAHAR0.51690.31440.3048
HV-220.57950.36650.3727
LightGBM1.00230.41730.2960
wavelet_lightgbm1.30000.36340.2662
LSTM0.59570.36530.3430
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, H.; Shen, Y.; Jiang, A. Public-Data Causal Multiscale Wavelet Spillover Learning for Stock Index Volatility Forecasting and Risk Early Warning. Risks 2026, 14, 129. https://doi.org/10.3390/risks14060129

AMA Style

Liu H, Shen Y, Jiang A. Public-Data Causal Multiscale Wavelet Spillover Learning for Stock Index Volatility Forecasting and Risk Early Warning. Risks. 2026; 14(6):129. https://doi.org/10.3390/risks14060129

Chicago/Turabian Style

Liu, Hengyan, Yisu Shen, and Aiping Jiang. 2026. "Public-Data Causal Multiscale Wavelet Spillover Learning for Stock Index Volatility Forecasting and Risk Early Warning" Risks 14, no. 6: 129. https://doi.org/10.3390/risks14060129

APA Style

Liu, H., Shen, Y., & Jiang, A. (2026). Public-Data Causal Multiscale Wavelet Spillover Learning for Stock Index Volatility Forecasting and Risk Early Warning. Risks, 14(6), 129. https://doi.org/10.3390/risks14060129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop