Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis

Aggarwal, Priyanka; Danila, Nevi; Suprihadi, Eddy; Manish, Manoj Kumar

doi:10.3390/forecast8020019

Open AccessArticle

Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis

¹

Finance Department, College of Business Administration, Prince Sultan University, Riyadh 11586, Saudi Arabia

²

Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), Batu Pahat 86400, Johor, Malaysia

³

Wholesale Credit & Market Risk, Hon Kong and Shanghai Banking Corporation Limited, RMZ Futura, 148/I, Bannerghatta Road, Bilekahalli, Bengaluru 560076, Karnataka, India

^*

Author to whom correspondence should be addressed.

Forecasting 2026, 8(2), 19; https://doi.org/10.3390/forecast8020019

Submission received: 26 January 2026 / Revised: 17 February 2026 / Accepted: 22 February 2026 / Published: 24 February 2026

(This article belongs to the Special Issue Advanced Forecasting in an Era of Uncertainty and Its Impact on Strategic Investment Decisions)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Structural breaks significantly alter the relationship between oil shocks, macroeconomic variables, and Saudi stock market returns.
Machine learning models, particularly XGBoost, consistently outperform traditional econometric benchmarks across regimes and forecast horizons.

What are the implications of the main findings?

Regime-aware forecasting is essential for capturing oil–equity dynamics in oil-dependent emerging markets.
XGBoost-based forecasts provide superior economic value, yielding higher risk-adjusted returns for investors and policymakers.

Abstract

This study investigates regime-dependent forecasting of the Saudi stock market by combining macro-controlled dependence analysis with nonlinear predictive modeling. Using daily data from September 2010 to August 2025, we analyze the interaction between the Tadawul All Share Index (TASI) returns and crude oil returns while controlling for inflation and interest-rate dynamics. A four-variable VAR with macro controls is estimated separately in pre- and post-COVID regimes to characterize directional predictability and changes in transmission lags. We then evaluate out-of-sample return forecasting performance across econometric benchmarks (ARIMA, ARIMAX, and VAR) and machine learning models (LSTM and XGBoost) under a strictly time-ordered expanding-window design with sequential train/validation/test partitioning. The results indicate that traditional linear benchmarks exhibit limited predictive ability in both regimes, with negative out-of-sample explanatory power. By contrast, XGBoost delivers the strongest overall performance, achieving positive out-of-sample R² in both regimes (0.046 in pre-COVID and 0.010 in post-COVID), together with the lowest forecast errors (RMSE = 0.0081 pre-COVID; 0.0078 post-COVID). Interpretability analysis further reveals a regime-sensitive shift in drivers: short-horizon equity lag dynamics dominate during stable periods, whereas oil-related and macro-financial variables gain importance under turbulent conditions. Economic-value evaluation supports the practical relevance of these gains, showing that XGBoost-based signals yield superior risk-adjusted trading outcomes and remain favorable under downside-risk and drawdown-based assessment. Overall, these findings highlight that forecasting in oil-linked emerging markets is inherently regime-dependent and that nonlinear ensemble learners, particularly XGBoost, provide a more robust and economically meaningful approach under structural change.

Keywords:

Saudi stock market; crude oil shocks; structural breaks; Granger causality; LSTM; XGBoost

1. Introduction

Financial forecasting in oil-dependent economies is inherently exposed to structural instability because asset prices are jointly shaped by global commodity shocks, domestic macroeconomic conditions, and shifts in policy regimes. This challenge is particularly relevant for Saudi Arabia, where equity-market dynamics have historically been intertwined with crude oil movements and where the oil–stock transmission mechanism is expected to evolve under major disruptions. Empirically, prior evidence for Saudi and related oil-linked contexts consistently supports economically meaningful oil–equity linkages, although the direction and strength of these linkages can vary across periods and market conditions [1,2,3,4,5]. The COVID-19 shock further amplified volatility and may have altered cross-market feedback, implying that forecasting accuracy and model reliability cannot be assessed under a single stable data-generating process.

Despite extensive research on the oil–stock relationship, a key empirical gap remains in how dependence diagnostics are operationalized into predictive model design under regime change. Many studies examine causality or dependence as a preliminary step and then proceed to forecasting using econometric or machine learning models as a separate stage, leaving unclear how evidence from causality tests should formally constrain the specification of nonlinear learners. As a result, integration is often sequential rather than structurally coupled, and model inputs, lag structures, and feature sets can appear ad hoc when dependence mechanisms differ across regimes. This limitation becomes more consequential during crisis episodes, where oil shocks, macro conditions, and policy dynamics can jointly reshape market drivers and the effective information set available for forecasting.

In parallel, recent advances in machine learning for financial forecasting emphasize flexible nonlinear learners that can capture interaction effects and regime-sensitive predictors. Ensemble-based methods, in particular, have gained prominence because they can model nonlinear dependencies and adapt to changing feature relevance under structural shifts [6,7]. Moreover, empirical evidence suggests that machine learning models can uncover regime-sensitive determinants that are not well summarized by static relationships, motivating explicit attention to time variation in drivers and predictor importance [8,9]. However, most regime-aware applications still do not clearly formalize how dependence evidence (e.g., lag length and directional predictability) should translate into machine learning specification choices.

Motivated by these gaps, this study develops a regime-aware empirical framework for forecasting the Saudi equity market while explicitly linking macro-controlled dependence evidence to predictive specification. The first objective is to characterize the direction and persistence of predictive linkages between crude oil and the Saudi stock market under macroeconomic controls and to determine whether these linkages evolve across pre- and post-COVID regimes [1,2,3,4,5]. The second objective is to evaluate whether nonlinear machine learning models improve forecasting accuracy and economic relevance relative to standard econometric benchmarks under structural breaks, consistent with the growing role of modern ML ensembles in financial prediction [6,7]. The third objective is to formalize how regime-specific dependence evidence can guide the specification of machine learning inputs—particularly lag structures and predictor sets—thereby reducing arbitrary design choices and strengthening interpretability in regime-shifting environments [8,9].

Finally, it is advisable to note that emerging neural forecasting designs increasingly explore multiscale representations and self-similar structures, including fractal-inspired architectures and FractalNet-LSTM variants, which may offer an alternative perspective on capturing time-series complexity [10,11]. While these architectures are beyond the core comparative scope of this study, they motivate future work on integrating regime-aware dependence diagnostics with more structurally expressive neural models.

The rest of this paper provides a literature review in Section 2. Section 3 elaborates on data and methodology. Section 4 presents the results and discussion, while Section 5 concludes the study.

2. Literature Review

Studies on the correlation between oil prices and the stock market return have given different outcomes. It depends on whether the oil price shocks come from demand-driven or supply-driven factors [12,13]. When the demand for oil increases, it leads to a reduction in stock prices. In other words, there is a negative connection between oil and stock prices due to demand-driven shocks. The demand shocks are exerting a greater influence on the stock market compared to the supply side. Moreover, [14] suggested that demand shocks have a less significant impact on stock returns in the Canadian market compared to the Australian market. Nevertheless, the supply shocks are the more dominant reasons for oil-importing countries in Europe [15].

For oil-exporting nations, the relationship between the variables is positive. An increase in the demand for oil will lead to a corresponding increase in the demand for domestic currencies, resulting in higher profits and ultimately driving up stock prices [13,16]. Moreover, [17] suggested that the impact of oil price volatility on stock returns is indirectly through firm-level cash flows, financing costs, and finally, the shifts in investor sentiment. On the other hand, oil-importing countries experience the opposite situation. The rise in oil costs will negatively impact the companies’ performance, ultimately leading to a reduction in stock values. Ref. [18] reported that oil price changes had an asymmetric effect on the stock market of oil-importing countries, using a nonlinear panel ARDL technique.

Regarding Saudi Arabia and the GCC countries, several studies have been undertaken with different outcomes. Only the markets of Qatar, Oman, UAE, and Saudi Arabia in the GCC region show a substantial positive correlation between oil prices and stock index prices, while other nations do not exhibit this relationship [19]. Using different methodologies on the Saudi Arabian market, such as quantile regression and Granger causality methods, suggest similar conclusions [2,3,4,5,20]. Nevertheless, [21] reported no relationship between oil prices and stock prices. Instead, the NYSE emerged as the primary indicator for predicting the movement of the Saudi Arabian stock market (TASI) and oil prices. Additionally, [22] proposed that the relationship between the two variables is exclusively applicable in the long term. There is currently no empirical evidence to suggest that the oil price has any impact on the TASI in the short term.

More recently, [23] argued that a reversal of causality between oil price and stock market return existed during COVID-19. Furthermore, [24,25] confirmed that COVID-19 stimulated wild volatility shifts and contagion effects in global financial markets, including emerging economies. In Saudi Arabia, oil shocks intensified spillover channels between the oil and stock markets, increasing systemic vulnerabilities [26]. The findings suggest the need for regime-sensitive approaches to the study of the relationship between oil price and stock market returns.

Moving on to the forecasting of stock market returns, econometrics models, such as ARIMA, remain popular for modeling autocorrelation and short-term persistence, but the model loses its predictive power when it is challenged with instability and nonlinearity [27,28]. Moreover, VAR models allow for joint dynamics of oil and stock returns with macroeconomic controls, but the assumption of linearity reduces their relevance during structural breaks. The combination of ARIMA-family methods with machine learning has been proposed to produce more robust forecasts across horizons [29].

Machine learning, such as long short-term memory (LSTM) networks, is designed to address long-memory dependencies. The LSTM model is fitted to the forecasting of stock returns when incorporating the investor sentiment and policy intervention [30], while ensemble tree methods, particularly XGBoost, are powerful for their robustness against noise, computational efficiency, and interpretability via feature importance rankings [6]. Empirical studies emphasize the ability of XGBoost to identify core determinants of asset prices [8] and to provide improved accuracy when combined with feature selection and preprocessing methods [9]. Furthermore, [7] argued that XGBoost preserves its predictive accuracy and supplies the interpretable insights during volatility clustering and regime shifts.

Although significant advances have been made, gaps remain. Much of the existing literature examines either causality or forecasting in isolation, leaving limited integration between the two. Structural break dynamics—particularly those introduced by COVID-19—have been widely documented, but relatively few studies assess how these shocks reshape oil–equity causality in oil-dependent emerging markets such as Saudi Arabia. Moreover, while machine learning has demonstrated superior statistical performance, the economic value of forecasts in terms of trading profitability and risk-adjusted returns is seldom evaluated. These gaps provide the rationale for this study, which develops an integrated Granger–LSTM–XGBoost framework to jointly capture causal linkages, regime-dependent forecasting accuracy, and the practical utility of forecasts.

3. Methods

3.1. Research Design

The research design is shown in Figure 1 (Research Framework). The framework (Figure 1) starts with data collection of Saudi’s daily index stock (TASI), crude oil price, domestic inflation level, and interest rate policy during 2010–2025. These variables are chosen to capture financial dynamics and economic influence in an oil-dependent developing market. To describe structural instability, the samples are divided into two regimes, pre- and post-COVID, based on a formal breakpoint detection procedure. Then, the causality analysis is conducted using the VAR (Granger causality) model. For each regime, the framework applied two classes of the forecasting model. Conventional econometrics approaches—ARIMA, ARIMAX, and VAR—are used as a baseline to capture linear dependence and short-term autoregressive structure. Machine Learning methods—LSTM and XGBOOST—are applied for nonlinear interaction modeling, volatility clustering, and regime behavior, which are often missed from the econometrics model. The design of this dual model allows for comparison between the efficiency statistics of the parametric benchmark and data-driven adaptive capacity methods directly. The evaluation is done in two complementary dimensions: Statistics and Economy.

From the statistical perspective, forecasting performance is measured by RMSE, MAE, and R² to assess relative accuracy towards the naive baseline. From the economic perspective, forecasting is used in trading policy to evaluate direction accuracy, Sharpe ratio, and cumulative return. The dual-evaluation is implemented to make sure that model comparison not only has academic value but also gives practical insight for investors, regulators, and policy-makers.

3.2. Data Collection and Pre-Processing

The empirical analysis uses daily data covering the September 2010 to August 2025 period, incorporating both financial and macroeconomic variables relevant to Saudi Arabia. The daily Tadawul All Share Index (TASI) is employed as the primary indicator of equity market performance, while the daily Brent crude oil price is included to capture exposure to global commodity shocks. To reflect domestic macroeconomic conditions, monthly inflation and monthly interest rate series are incorporated as proxies for price stability and the monetary policy stance.

Because the financial variables are observed at a daily frequency, whereas the macroeconomic indicators are typically reported monthly, a frequency-matching procedure is implemented to align all variables on a common daily calendar prior to estimation. Specifically, each monthly macro-observation is carried forward and assigned to all trading days within the corresponding month (i.e., a stepwise daily series), ensuring that the daily VAR and forecasting models incorporate the information set that would have been available to market participants during that month. We acknowledge that such alignment necessarily reduces within-month variation in macro predictors and may attenuate very short-horizon causal responses; therefore, the interpretation of macro-driven dynamics is treated as reflecting low-frequency informational effects rather than intramonth shocks.

All financial series are transformed into continuously compounded log returns (first differences of natural logarithms) to improve stationarity and comparability across variables. For inflation and interest rates, first differences are applied where required to achieve stationarity, while the level series are retained for descriptive interpretation and economic discussion where appropriate.

3.3. Econometric Benchmark Models

This study uses three conventional econometric models, ARIMA, ARIMAX, and VAR (Granger Causality), as baseline forecasts. These models have a parsimonious (simple) structure and are easy to interpret for time series forecasting. Despite their known limitations in high volatility regimes, they remain useful as benchmarks.

The Autoregressive Integrated Moving Average (ARIMA) model, introduced by Box and Jenkins [31] remains one of the most frequently applied approaches in financial forecasting because of its ability to capture persistence and short-term dependencies [27]. Its general representation is given by

{\emptyset (L) (1 - L)}^{2} y_{t} = θ (L) ε_{t}

(1)

where

{\emptyset (L) = 1 - \emptyset_{1} L - \dots - \emptyset_{p} L}^{p}

is the autoregressive polynomial,

{θ (L) = 1 - θ_{1} L - \dots - θ_{q} L}^{q}

is the moving-average polynomial, d denotes the order of differencing, and

ε_{t}

is a white-noise disturbance. However, its reliance on linear dynamics restricts its performance when nonlinearity and volatility clustering dominate.

To incorporate external drivers, the ARIMAX model extends ARIMA by including exogenous regressors, allowing the influence of oil returns, inflation, and interest rates to be captured alongside autoregressive dynamics of TASI returns. The ARIMAX specification can be expressed as

y_{t} = α + \sum_{i = 1}^{p} \emptyset_{i} y_{t - i} + \sum_{j = 1}^{p} θ_{j} ε_{t - j} + γ X_{t} + ε_{t}

(2)

where

X_{t}

_t represents a vector of exogenous predictors. ARIMAX has been widely adopted in macro-finance forecasting, particularly in contexts where commodity shocks directly affect equity markets [28].

The Vector Autoregression (VAR) framework, proposed by [32] and formalized by [33], is used to model the joint dynamics of multiple endogenous variables. Its general form is

Z_{t} = A_{1} Z_{t - 1} + \dots + A_{p} Z_{t - p} + ε_{t}

(3)

where

Z_{t} = [y_{t}, o_{t}, c_{t}, r_{t}]

is the vector of endogenous variables consisting of Saudi stock returns, crude oil returns, inflation, and interest rates,

A_{i}

are coefficient matrices, and

ε_{t}

is a vector of innovations. In this study, VAR captures the feedback loops between Saudi stock returns, crude oil shocks, and macroeconomic variables. While VAR has been instrumental in identifying transmission channels in finance, its parameter intensity and reliance on linear interdependencies often limit robustness during crisis regimes.

To capture regime-dependent dynamics beyond fixed subsample splits, we additionally consider a Markov-switching specification in which the data-generating process alternates between latent states representing tranquil and turbulent market conditions. Let

s_{t} \in {1,2}

denote the unobserved regime at time t, governed by a first-order Markov chain with transition probabilities

P (s_{t} = j | s_{t - 1} = i) = p_{i j}

. Conditional on

s_{t}

, the return process follows a state-dependent mean and volatility structure, allowing both the conditional expectation and the variance to differ across regimes. Parameters are estimated via maximum likelihood using the Hamilton filter, and smoothed regime probabilities are used to characterize the timing and persistence of crisis versus non-crisis episodes. This model serves as an econometric benchmark for regime variation and provides a probabilistic alternative to deterministic regime partitioning.

3.4. Machine Learning Models

This study uses two machine learning models, LSTM and XGBoost, as a complement to econometric benchmarks. LSTM and XGBoost are widely recognized methods in financial forecasting. Both methods were chosen because they overcome the weaknesses of linear parametric approaches. They are capable of capturing nonlinear dynamics, volatility clustering, and complex interactions between predictor variables.

The Long Short-Term Memory (LSTM) network is a specialized type of recurrent neural network designed to overcome the vanishing gradient problem and capture long-term dependencies in sequential data. Its architecture is based on memory cells and gating mechanisms that regulate the flow of information. The core equations governing the LSTM cell are:

f_{t} = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f})

(4)

i_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i})

(5)

{\tilde{C}}_{t} = t a n h (W_{c} [h_{t - 1}, x_{t}] + b_{C})

(6)

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t}

(7)

o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o})

(8)

h_{t} = o_{t} ⊙ t a n h (C_{t})

(9)

where

x_{t}

is the input vector,

h_{t}

is the hidden state,

C_{t}

is the cell state, and σ(·) denotes the logistic sigmoid function. These equations enable the network to selectively retain or discard information, making LSTMs particularly effective at modeling persistence and volatility clustering in financial returns. Empirical studies demonstrate that LSTM models achieve superior forecasting performance in stock markets, particularly when incorporating investor sentiment and policy interventions [30].

The Extreme Gradient Boosting (XGBoost) algorithm, developed by [34], is an ensemble learning technique based on gradient-boosted decision trees. XGBoost iteratively combines weak learners to minimize a regularized objective function:

L = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{n} Ω (f_{k})

(10)

with

Ω (f_{k}) = γ T + \frac{1}{2} λ {‖w‖}^{2}

(11)

where l(·) is a convex loss function measuring the difference between observed values

y_{i}

and predictions

{\hat{y}}_{i}

,

f_{k}

represents the k-th regression tree, TT is the number of leaf nodes, ww is the vector of leaf weights, and γ, λ are regularization parameters. This formulation penalizes model complexity, improving generalization and reducing overfitting.

In financial applications, XGBoost has gained traction due to its robustness against noisy data, computational efficiency, and interpretability through feature importance rankings [6]. Recent studies highlight its adaptability in financial forecasting when combined with pre-processing and feature selection methods [9], as well as its ability to identify key determinants of asset pricing in complex market environments [8].

Both LSTM and XGBoost are well-suited to handle the nonlinear, nonstationary nature of Saudi equity and oil market dynamics. Their inclusion in this framework allows for a robust comparison against econometric benchmarks, enabling us to assess not only statistical performance but also the interpretive and economic values of machine learning forecasts.

To mitigate overfitting and preserve transparency in out-of-sample evaluation, all models are estimated under a strictly time-ordered forecasting protocol based on rolling (expanding) windows. Within each regime, observations are partitioned sequentially into training, validation, and test segments. Hyperparameters are selected exclusively on the validation segment, and final results are reported on the held-out test observations. This design avoids look-ahead bias, serves as a time-series analog to cross-validation, and ensures that reported R² and error metrics reflect genuine predictive performance rather than in-sample fit. For LSTM, overfitting is controlled through dropout and early stopping monitored on validation loss. For XGBoost, complexity is constrained via shrinkage and stochastic subsampling (learning rate, subsample, and column-subsample ratios), together with explicit regularization penalties (reg_alpha and reg_lambda) and standard tree-growth constraints (e.g., min_child_weight).

Because crisis conditions may alter the optimal bias–variance trade-off and therefore the stability of machine learning specifications, we further evaluate robustness across regimes using two complementary hyperparameter policies. Under a global policy, the same LSTM and XGBoost hyperparameters are held fixed across the pre- and post-COVID sub-samples; under a regime-specific policy, hyperparameters are re-tuned separately within each regime using the identical time-ordered validation protocol. The LSTM is implemented as a two-layer network with 64 hidden units per layer, trained with the Adam optimizer using mini-batches and early stopping, where the dropout rate follows the selected policy. XGBoost is parameterized by the number of trees, maximum depth, learning rate, subsample and column-subsample ratios, and explicit regularization penalties (reg_alpha and reg_lambda), alongside standard constraints such as min_child_weight. Comparing forecast accuracy and model ranking under fixed versus regime-adaptive tuning provides a direct stability check: changes in selected parameters are interpreted as shifts in complexity requirements under turbulent conditions, whereas persistence of performance ordering under both policies indicates that the main conclusions are not artifacts of regime-driven hyperparameter adjustment.

Empirically, LSTM improves upon linear benchmarks in the pre-COVID regime but exhibits weaker performance in the post-COVID period. This deterioration is consistent with a post-crisis environment characterized by stronger nonstationarity, heavier tails, and more abrupt volatility clustering, which can undermine gradient-based sequence learning when the effective signal-to-noise ratio deteriorates. To confirm that this pattern is not driven by a single configuration, we conduct a focused sensitivity analysis over key LSTM design choices—look-back window length, dropout intensity, and regime-specific re-tuning under the same validation protocol. While moderate adjustments can reduce absolute errors, the qualitative finding remains unchanged: LSTM displays greater instability and weaker economic performance in the turbulent regime relative to XGBoost. This suggests that tree-based boosting more effectively exploits sparse nonlinear interactions under crisis conditions, whereas recurrent sequence learning is more sensitive to distributional shifts and noise.

Across a grid of look-back windows (7–30 days) and dropout rates (0.1–0.3), LSTM performance is sensitive to hyperparameter choices, with substantial variation in pre-COVID errors and limited, incremental differences in the post-COVID regime. In the pre-COVID period, the best configuration (look-back = 14, dropout = 0.2) yields the lowest RMSE (0.0088) and the strongest economic outcomes (Sharpe = 0.0619; CumRet = 0.2982), whereas other settings deteriorate sharply, indicating instability to specification choices. In the post-COVID period, RMSE values remain clustered around 0.0075–0.0084 across configurations, but economic metrics vary materially, with the strongest results obtained at look-back = 21 and dropout = 0.3 (Sharpe = 0.0693; CumRet = 0.1683). Overall, these sensitivity checks confirm that the documented LSTM underperformance and reduced stability in the turbulent regime are not artifacts of a single look-back window or dropout setting, and they reinforce the main conclusion that XGBoost provides more consistent and robust performance around the COVID break. The full sensitivity results are reported in Appendix A (Table A1).

3.5. Evaluation Metrics and Economic Value

The performance of the forecasting models is assessed using both statistical accuracy measures and economic evaluation criteria. This dual evaluation framework ensures that models are not only compared in terms of predictive fit but also in terms of their relevance for financial decision-making.

For statistical accuracy, we employ three widely used measures: the root mean squared error (RMSE), mean absolute error (MAE), and the coefficient of determination (R²). RMSE and MAE quantify forecast errors in terms of magnitude, while R² captures the proportion of variance explained. These metrics are standard in forecast evaluation and provide complementary perspectives on predictive performance [35]. Formally, they are defined as

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{n}}

(12)

To formally test whether predictive differences between models are statistically significant, we implement the Diebold–Mariano (DM) test [36]. The DM statistic compares forecast error differentials across models and is expressed as

D M = \frac{\bar{d}}{\sqrt{\frac{2 π {\hat{f}}_{d} (0)}{T}}}

(13)

where

\bar{d}

is the mean loss differential between competing forecasts, and

{\hat{f}}_{d} (0)

is the spectral density at frequency zero. This test has become a benchmark for forecast comparison in finance and macroeconomics, ensuring that reported improvements are not due to random variation.

Beyond statistical accuracy, the economic value of forecasts is increasingly emphasized in financial econometrics [37]. Even weak statistical predictability can generate meaningful improvements in portfolio allocation and risk management. To capture this dimension, we compute the hit rate (percentage of correctly predicted return directions), the Sharpe ratio, and cumulative returns from trading strategies based on model forecasts. Formally, these are given by

S R = \frac{E [R_{p} - R_{f}]}{σ_{p}}

(14)

H R = \frac{1}{T} \sum_{t - 1}^{T} 1 {s i g n (y_{t}) = s i g n ({\hat{y}}_{t})}

(15)

where 1{·} is the indicator function.

The hit rate (HR) is particularly relevant for directional forecasting, while the Sharpe ratio (SR) adjusts returns for volatility, providing a risk-adjusted performance measure. Evaluating cumulative returns demonstrates the practical utility of forecasts for investors, extending beyond purely statistical measures.

This combination of statistical and economic metrics enables a comprehensive comparison of econometric and machine learning models. While econometric models often serve as baselines for statistical accuracy, machine learning methods are expected to yield not only superior fit but also tangible economic value, reinforcing their relevance for financial market participants.

An important pre-processing step is the transformation of financial series into continuously compounded log returns, defined as

r_{t} = \ln (\frac{P_{t}}{P_{t - 1}})

(16)

where P_t denotes the observed price or index level at time t. This ensures stationarity and comparability across variables before implementing break detection, causality analysis, and forecasting.

4. Results and Discussion

4.1. Descriptive Statistics and Preliminary Analysis

This section summarizes the main characteristics of the dataset, which includes the Saudi stock index (TASI), crude oil returns, domestic inflation rates, and interest rates from September 2010 to August 2025. Descriptive statistics for the variables are presented in Table 1, while the behavior of stock and crude oil returns, as well as their rolling volatility, is illustrated in Figure 2. A correlation overview is reported in Table 2 to motivate the subsequent regime-based and nonlinear modeling framework.

4.1.1. Summary Statistics and Distributional Properties

Table 1 reports the main descriptive statistics of the variables over the full 2010–2025 sample. Both financial return series exhibit pronounced non-Gaussian behavior. Crude oil returns display extreme leptokurtosis (excess kurtosis exceeding 50), while TASI returns also show substantial tail thickness (excess kurtosis exceeding 15), indicating fat tails and volatility clustering typical of commodity-linked emerging market environments. Both return series are negatively skewed, reflecting the dominance of downside risk and the prevalence of sharp corrections. In contrast, the macroeconomic variables are comparatively stable. Average inflation is close to 2% with moderate dispersion, and interest rates average around 2.6% with limited variation, consistent with monetary stability under the U.S. dollar peg. The strong contrast between the extreme variability in asset returns and the comparatively smooth behavior of macro indicators suggests that macro fundamentals may transmit to equity returns through slower, lagged, or regime-dependent channels rather than contemporaneous linear effects.

4.1.2. Return Dynamics and Volatility Clustering

Figure 2 visualizes the daily log returns and rolling volatility of Saudi equities and crude oil. The return series highlight that crude oil is substantially more volatile than TASI, with particularly large swings during the 2020 oil market collapse. The 60-day rolling standard deviation further reveals clear volatility clustering in both markets. Both oil and equity volatility spike sharply at the onset of the COVID-19 pandemic, consistent with a global repricing of risk. After the initial shock, oil volatility declines, whereas TASI volatility remains elevated for a longer period, indicating that financial stress in the Saudi stock market persists beyond the immediate normalization of oil-market turbulence. This divergence is consistent with the idea that crisis episodes may alter the oil–equity transmission mechanism and motivate regime-aware evaluation in subsequent sections.

4.1.3. Correlation Structure and Motivation for Nonlinear Models

Table 2 reports unconditional Pearson correlations among the key variables over the full sample. The correlation between TASI returns and crude oil returns is positive but modest (ρ ≈ 0.20), indicating that oil–equity co-movements exist on average but are far from perfectly synchronized. This is consistent with a time-varying linkage in which stronger coupling may emerge during macro-financial stress, while weaker dependence characterizes tranquil periods. Correlations between the macro variables and both return series are close to zero at the daily frequency, reflecting the slow-moving nature of inflation and interest-rate dynamics and the likelihood that their influence operates through lagged, nonlinear, or regime-specific channels rather than contemporaneous linear association. The modest negative correlation between inflation and interest rates is directionally consistent with policy adjustment to inflationary conditions, although the magnitude remains weak in unconditional terms. Overall, these preliminary patterns motivate the use of dynamic, regime-based, and nonlinear forecasting frameworks that can capture relationships not well summarized by static full-sample correlations.

4.1.4. Data Frequency Choice and Implications

Although higher-frequency (intraday) data can capture finer microstructure dynamics and short-lived volatility bursts, the empirical design in this study is conducted at the daily frequency for three reasons. First, the main explanatory channels—crude oil returns and domestic macro-financial controls—are consistently available and economically interpretable at daily (or lower) frequency over the full 2010–2025 horizon, which is essential for stable regime identification and long-span out-of-sample evaluation. Second, intraday series introduce market microstructure effects such as bid–ask bounce, non-synchronous trading, liquidity variation, and exchange-specific interruptions, which can distort volatility measurement and complicate comparability across regimes. Third, the economic-value assessment is implemented at a daily rebalancing horizon, so daily forecasts map directly into implementable trading decisions without imposing additional assumptions about intraday execution and transaction-cost modeling. Accordingly, the conclusions are framed at the daily investment horizon. Extending the framework to intraday frequency is a promising direction for future work, particularly to examine whether high-frequency signals improve very short-horizon volatility prediction during crisis episodes, while carefully addressing microstructure noise and ensuring consistent data availability across the full sample.

4.2. Structural Breaks and Regime Identification

Given the evidence of fat tails and volatility clustering, we examine whether the relationship between crude oil returns and Saudi stock returns remains stable or undergoes structural change during the COVID-19 crisis. To provide a robust regime-identification basis, we employ complementary diagnostics that target different forms of instability: (i) the Chow sup-F test to detect sharp local breaks, (ii) the CUSUM test to assess global parameter stability, and (iii) a two-state Markov-switching specification to identify latent volatility regimes. Table 3 summarizes the key break and regime statistics.

Table 3 consolidates the three complementary diagnostics used to motivate regime-aware analysis. The sup-F (Chow) peak identifies a sharp local instability centered on 26 March 2020, supporting a crisis-aligned pre/post split for downstream causality and forecasting. The CUSUM result, in contrast, assesses gradual parameter drift and does not reject global instability, which is consistent with the notion that the dominant disruption is concentrated around the COVID shock window rather than a smooth change over the full sample. Finally, the Markov-switching variance estimates quantify the magnitude of volatility-state separation, indicating that the high-volatility regime exhibits an order-of-magnitude larger variance than the tranquil regime, consistent with persistent turbulence following the pandemic onset.

4.2.1. Local Break Detection: sup-F (Chow) Test

The sup-F test results (Table 3 and Figure 3a) indicate a clear structural break on 26 March 2020, coinciding with the early pandemic period and the collapse in global oil demand. The peak test statistic strongly rejects parameter stability at that date, providing direct justification for partitioning the sample into pre-COVID and post-COVID subperiods for subsequent causality and forecasting analyses.

4.2.2. Parameter Stability: CUSUM Test

The CUSUM test does not reject the null of overall stability at conventional levels (Table 3). This outcome is consistent with the known properties of CUSUM, which is designed to detect persistent global drift in parameters and is therefore less sensitive to short-lived but extreme shocks, such as those occurring at the onset of the pandemic. Consequently, the CUSUM result does not contradict the sup-F evidence; rather, it suggests that the dominant instability is localized around the crisis window rather than a smooth, gradual drift throughout the sample.

4.2.3. Volatility Regimes: Markov-Switching Evidence

The Markov-switching results further corroborate a regime shift in early 2020. Smoothed probabilities (Figure 3b) show a sharp transition into a high-volatility state around the COVID-19 onset, with the residual variance rising from 2.97 × 10⁻⁶ in the low-volatility regime to 1.51 × 10⁻⁴ in the high-volatility regime—an increase of nearly fifty-fold (Table 3). Importantly, the high-volatility state persists well beyond the initial shock, indicating prolonged financial stress and supporting the use of regime-aware evaluation in the forecasting experiments.

4.2.4. Multiple Breaks Robustness: Bai–Perron Test

To address the possibility that structural change may occur through multiple shifts rather than a single discrete break, we additionally implement a Bai–Perron style multiple-break analysis as a robustness diagnostic. While the sup-F (Chow) procedure evaluates candidate dates for a prominent local break and the Markov-switching model summarizes regime behavior through latent volatility states, the Bai–Perron framework is designed to endogenously identify multiple breakpoint locations by selecting the number and timing of changes that improve fit under an information-criterion penalty. In the implementation used here, break selection is based on the Bayesian Information Criterion (BIC), reported in Table 4, which balances goodness-of-fit against the additional degrees of freedom introduced by adding breakpoints. This diagnostic is particularly relevant in financial series where crisis effects may unfold through sequential adjustments, policy responses, and delayed market repricing, implying that structural transitions can manifest as a cluster of breakpoints rather than a single date.

Table 4 reports the Bai–Perron breakpoint estimates for oil and TASI log returns, including the selected number of breaks, the estimated break dates, and the resulting segment boundaries. For oil returns, the procedure selects three breakpoints concentrated in the early-pandemic window, with estimated breaks on 4 March 2020, 26 April 2020, and 8 June 2020. This clustering of breaks around 2020 is consistent with the timing of the COVID-19 shock and the rapid sequence of demand collapse, policy interventions, and subsequent market normalization commonly observed in global oil markets. For TASI returns, the algorithm identifies a set of early-sample breaks in late February and early March 2011 (27 February 2011, 4 March 2011, and 10 March 2011), implying a short sequence of mean-shift segments at the beginning of the sample and a long stable segment thereafter. The same early-2011 pattern also appears in the joint specification, indicating that the initial portion of the sample contains a structural adjustment that affects the combined oil–equity system. Importantly, these detected breakpoints for TASI do not imply that the COVID split is the dominant breakpoint in the equity return series; rather, they support the broader premise that the data exhibit multiple structural adjustments and time variation, motivating regime-aware modeling and strict out-of-sample evaluation.

These multiple-break results strengthen the structural-change argument in two ways. First, they show that the COVID-era transition in oil markets is not well represented by a single instantaneous change; instead, the data support a short cluster of sequential breakpoints that collectively define the crisis transition. Second, they highlight that additional structural adjustments can exist outside the pandemic window—particularly early in the sample for TASI—reinforcing that financial relationships are not globally time-invariant. Nevertheless, we retain the pre-/post-COVID regime division for the main causality, forecasting, and economic-value experiments to preserve interpretability and to ensure direct comparability across models and evaluation protocols. The Bai–Perron evidence is therefore used as a robustness validation of structural segmentation rather than a replacement for the baseline split: the concentration of oil breakpoints around early 2020 corroborates the crisis-based regime framing for oil-driven transmission, while the presence of additional breaks motivates caution against assuming global stability and supports the regime-aware evaluation design adopted throughout this study.

4.3. Granger Causality (VAR) with Macro Controls

To test the direction of information flow between the oil and stock markets under changing macroeconomic conditions, we estimated a four-variable VAR model that included TASI returns, crude oil returns, first differences in inflation rates, and first differences in interest rates. These macro variables are incorporated as controls to reflect the domestic price environment and policy stance, but they are not assumed to be strictly exogenous. In Saudi Arabia’s institutional setting, inflation and interest rates may respond endogenously to oil-market conditions and equity-market dynamics; therefore, the VAR framework is used to jointly model feedback among financial and macroeconomic variables, and the Granger causality results are interpreted as conditional predictive relationships rather than structural causal effects.

Because the VAR is estimated at a daily frequency, while inflation and interest rates are typically observed monthly, the macroeconomic controls enter the model through the frequency-aligned daily representation described in Section 3.2. Specifically, each monthly macro-observation is assigned to all trading days within the corresponding month. As a result, the daily VAR captures the prevailing low-frequency macroeconomic information set rather than intramonth innovations, which may attenuate very short-horizon macro-driven effects. Accordingly, the causality and impulse-response findings reported below are interpreted with this frequency structure in mind.

Importantly, the VAR–Granger analysis is not employed solely as an interpretive econometric exercise but is used to structurally guide the specification of the subsequent machine learning forecasters. In each regime, the Granger causality outcomes are translated into a regime-specific predictor inclusion rule, whereby only variables and lag structures that exhibit predictive content for TASI returns—conditional on the macro controls embedded in the VAR system—are retained as external inputs to the forecasting models. Accordingly, the XGBoost specification is constructed using a causality-screened lag-feature set rather than a fixed, full-input design. For the LSTM, the same regime-specific inclusion rule governs which input channels are fed into the network, while the selected VAR lag order informs the effective look-back horizon used to form sequential inputs. This coupling implies that when the VAR indicates longer information transmission dynamics (i.e., a higher optimal lag), the LSTM is configured with a correspondingly longer temporal context, and when certain macro or oil lags are not supported by Granger evidence within a regime, they are excluded from the machine learning input space. As a result, the causality layer directly affects the predictor space and temporal memory structure of the nonlinear models, thereby moving beyond a purely sequential “diagnosis-then-forecast” workflow.

Table 5 reports that there is unidirectional causality from crude oil returns to TASI returns (F = 24.77; p < 0.001) before COVID-19, which is consistent with the Saudi economy’s dependence on oil. However, the relationship becomes bidirectional in the post-COVID-19 regime. Oil continues to influence stocks (F = 3.50; p < 0.001), and stocks Granger-cause oil (F = 10.08; p < 0.001). The optimal lag order increases from 1 day (pre-COVID) to 7 days (post-COVID), indicating a slower and more complex transmission mechanism during turbulent periods. Consistent with the sensitivity analysis described above, this qualitative causality pattern and the relative shift toward a longer lag structure in the post-COVID regime are preserved under alternative lag-order specifications, reinforcing that the inference is not an artifact of a single information criterion.

The Impulse Response Function (IRF) is demonstrated in Figure 4. In the pre-COVID regime, the response of TASI returns to the oil prices shock (with one standard deviation) is positive but fades quickly within about a week. The response was erratic, changing sign (oscillating) for almost three weeks before stabilizing in the post-COVID era. This pattern reflects a more volatile and persistent adjustment mechanism during the pandemic. Thus, we can conclude that a structural evolution has occurred in the oil–stock relationship. One-way transmission (oil → stocks) during a stable time to a two-way feedback loop with slower convergence. This suggests deeper integration but also increased vulnerability of the Saudi stock market during systemic crises.

These findings are consistent with prior evidence on time-varying oil–equity linkages in Saudi Arabia and related markets [1,2,3,4,5]. Several mechanisms may explain the regime-dependent causality patterns. In the pre-COVID period, Saudi Arabia’s macro-financial environment remained strongly oil-centric, such that oil-price shocks plausibly propagated to equity returns through fiscal revenue expectations, government spending capacity, and oil-sensitive corporate cash flows. Under this structure, the empirical one-way linkage from oil to stock returns is economically intuitive.

In the post-COVID regime, the transmission channels become more complex. Structural reforms associated with Vision 2030, changes in market composition and participation, and broader capital-market developments—including Aramco-related dynamics and growing non-oil sector activity—reduce the extent to which equity valuations mechanically mirror oil fundamentals. As the equity market increasingly reflects forward-looking expectations about domestic growth, financing conditions, and risk premia, stock return innovations can in turn convey information relevant for oil-price formation, contributing to a bidirectional relationship. From an external perspective, crisis and post-crisis environments are also characterized by stronger financialization of oil markets, in which portfolio rebalancing, hedging demand, and global risk sentiment can exert a larger influence on oil prices than contemporaneous supply–demand conditions. When equity volatility embeds macroeconomic and geopolitical risk, it can transmit to oil pricing through risk-premium adjustments and cross-asset positioning, reinforcing bidirectional oil–stock feedback in the post-COVID period. The next section, therefore, evaluates return forecasting performance using models that are robust to regime change and capable of capturing nonlinear interactions under volatile market conditions.

In addition to the baseline results, we evaluated hyperparameter stability across regimes by comparing a global model configuration (fixed hyperparameters applied to both subsamples) against a regime-specific configuration (hyperparameters re-tuned within each regime using the same time-ordered validation protocol). The comparative results indicate that the main performance ranking is preserved under both settings, suggesting that the reported regime differences are not driven by regime-specific hyperparameter re-optimization.

4.4. Forecasting Results

This section reports out-of-sample forecasting performance for daily TASI returns using traditional econometric benchmarks (ARIMA, ARIMAX, and VAR) and machine learning models (LSTM and XGBoost). Forecasts are conducted separately for the pre-COVID and post-COVID regimes based on the structural break evidence reported in Section 4.2. All results in Table 6 are generated under a time-ordered rolling (expanding-window) evaluation design with a dedicated validation segment for hyperparameter selection, ensuring that the reported metrics reflect genuine predictive performance without look-ahead bias. Out-of-sample R² is computed relative to a naïve historical-mean benchmark using the same expanding-window evaluation sample, and all metrics are reported for daily TASI returns (not price levels).

Table 6 shows that daily return forecasting remains challenging across both regimes, as reflected by out-of-sample R² values that are close to zero and often negative for several models. This pattern is common in financial return prediction because the conditional signal is weak relative to noise and becomes further obscured under volatility clustering and structural instability. In this setting, the linear econometric benchmarks (ARIMA, ARIMAX, and VAR) exhibit limited predictive power in both regimes, producing negative out-of-sample R² and error levels that are not meaningfully better than a naïve reference forecast. This outcome is consistent with evidence that linear autoregressive structures tend to deteriorate when nonlinear dynamics, heavy tails, and structural breaks dominate emerging market behavior [38,39], and it aligns with the broader view that conventional models can be fragile under instability and regime-dependent dynamics [40].

Moving to machine learning approaches, LSTM does not deliver systematic improvements over the econometric benchmarks under the return target, with negative out-of-sample R² in both regimes and only marginal differences in RMSE/MAE relative to ARIMA/ARIMAX/VAR. By contrast, XGBoost achieves the strongest performance in both regimes, yielding the lowest RMSE and MAE and the only consistently positive out-of-sample R² values (0.046 in pre-COVID and 0.010 in post-COVID). While the absolute magnitude of R² remains modest—as expected for daily returns—these results indicate that boosted trees can extract incremental predictive structure from lagged market dynamics and exogenous predictors even in a low signal-to-noise environment. Importantly, XGBoost’s performance is obtained under explicit regularization and complexity controls (e.g., shrinkage and subsampling with penalty terms), together with the same time-ordered validation protocol used for all models, which limits overfitting risk and strengthens interpretability of the out-of-sample comparison.

Figure 5 provides qualitative confirmation of the quantitative evidence in Table 6. The econometric benchmarks display weaker tracking of realized return movements, particularly around abrupt changes where linear dynamics fail to adapt to regime shifts. The machine learning models exhibit comparatively tighter alignment, although the post-COVID regime remains more difficult, with sharper fluctuations and more persistent deviations that reflect heightened nonstationarity and volatility clustering. Overall, the visual and numerical results jointly suggest that forecasting performance is regime-dependent in crisis-prone environments and that models capable of capturing nonlinear interactions tend to be more resilient. In economic terms, this supports the view that practical forecasting and risk-management systems should be adaptive rather than calibrated under a single stability assumption. Among the examined approaches, boosted tree ensembles remain effective because they flexibly partition the feature space and capture nonlinear interactions between lag structure, oil-related shocks, and macroeconomic conditions, helping preserve robustness under noise and distributional shifts. This interpretation is consistent with recent evidence that ensemble-based machine learning methods have gained prominence in financial forecasting settings characterized by nonlinearities and regime changes [6,7].

Table 7 further clarifies the regime-sensitive drivers learned by XGBoost. In the pre-COVID period, short-term lagged TASI information and moving-average signals dominate, indicating that near-term autoregressive patterns carry substantial predictive content under relatively calm conditions. In the post-COVID regime, the relative importance of crude oil returns and macroeconomic variables increases, consistent with a stronger role of external shocks and policy-linked conditions during turmoil. This pattern supports the argument that machine learning models can uncover time-varying determinants that are not well summarized by static correlations or linear benchmark structures. This finding is supported by [8,9], who argued that ML models can uncover regime-sensitive predictors.

Overall, the regime shift in feature dependence suggests that machine learning models not only improve predictive accuracy but also reflect changes in the dominant information channels over time. However, these macro variables should not be interpreted as purely exogenous drivers. In an open economy with a fixed exchange-rate arrangement, inflation and interest rates can plausibly adjust endogenously in response to oil-market shocks and financial conditions, implying that their importance during crisis periods may capture both direct macro effects and feedback from the broader oil–equity system. From a financial perspective, this reinforces that investment strategies and risk management must remain adaptive to structural change: in calm periods, lagged returns information may dominate predictive content, whereas during turmoil, oil shocks and macroeconomic conditions—interacting through feedback channels—become central determinants of return dynamics.

To directly address concerns regarding (i) the role of macro controls and (ii) the robustness of results to alternative regime split points, we extend the forecasting analysis with two additional experiments reported in Table 8 and Table 9. First, we conduct a macro-removal ablation in which inflation and interest rate predictors are excluded from the forecasting specification while all other settings are held constant. This test isolates whether the predictive gains attributed to macroeconomic conditions are robust or merely incidental. Second, we evaluate break-date sensitivity by repeating the full forecasting-and-trading evaluation under several plausible alternative COVID-era break dates surrounding the baseline split, thereby assessing whether the main conclusions are fragile to regime boundary selection.

As shown in Table 7, the effect of removing macro controls is heterogeneous across model classes, indicating that macro information is not uniformly exploited by all learners. While the RMSE changes are relatively small in magnitude, the risk-adjusted outcomes exhibit clearer differences. In particular, the LSTM and XGBoost experience declines in Sharpe when macro controls are removed, suggesting that these variables contribute to the quality of the forecast signal in a way that matters economically rather than only marginally improving point accuracy. This finding supports the interpretation that macro conditions become more relevant to the forecasting signal during turbulent periods, consistent with the broader regime narrative, while also highlighting that the contribution of macro predictors depends on model capacity and how nonlinear interactions are learned.

Table 8 indicates that the qualitative conclusions are not an artifact of a single break-date choice. Across alternative split points, the relative performance patterns remain broadly consistent, and the estimated economic outcomes remain within a comparable range, supporting the stability of the regime-based evaluation. While the exact magnitudes of RMSE, Sharpe, and cumulative return vary with the break date—as expected when the composition and length of subsamples change—the overall evidence continues to support the presence of regime-dependent dynamics and the need for forecasting models that remain effective under structural change.

4.5. Model Transparency and Feature Attribution via SHAP

To address concerns regarding the interpretability of tree-boosting models, we complement the predictive evaluation with a SHAP (SHapley Additive exPlanations) analysis for XGBoost. SHAP values are computed on the out-of-sample test observations within each regime, ensuring that feature attributions reflect the model’s realized forecasting behavior rather than in-sample fit. We summarize global feature contributions using mean absolute SHAP values, which quantify the average magnitude of each predictor’s marginal contribution to the model output and provide a transparent ranking of the drivers underlying XGBoost forecasts.

Figure 6 reports the SHAP importance profiles for the pre- and post-COVID regimes. In both regimes, the dominant contributor is roll_mean_tasi_5, indicating that short-horizon local persistence in the equity index is a key signal consistently exploited by the boosting model. In the pre-COVID regime, the remaining influential predictors are primarily short-lag market dynamics (tasi_lag_1, tasi_lag_3, tasi_lag_4, and tasi_lag_5), complemented by oil return lags (oil_lag_1 and related short oil lags) and technical summaries (ma5 and ma_ratio), together with a measurable macro channel via Inflation_rate. In the post-COVID regime, the explanatory structure remains economically coherent but exhibits a regime-dependent reweighting: Inflation_rate rises to a leading role, while oil_lag_1 and additional oil lags (e.g., oil_lag_4 and oil_lag_2) remain prominent, consistent with a more inflation- and energy-sensitive environment following the crisis. Overall, the SHAP results mitigate “black-box” concerns by demonstrating that XGBoost’s predictive gains are anchored in interpretable signals—market memory, oil-market information, and inflation—while allowing the relative importance of these channels to adapt across regimes in a transparent manner.

4.6. Robustness Checks

To verify the stability of the forecasting conclusions and to ensure that the reported performance is not driven by a single evaluation setting, we conduct robustness checks along four dimensions. First, we assess whether model rankings remain stable when the forecast horizon is extended beyond one-step-ahead prediction. Second, we apply pairwise Diebold–Mariano (DM) tests to evaluate whether observed differences in forecast errors are statistically meaningful. Third, to broaden the benchmark scope in a methodologically coherent way, we include an auxiliary econometric benchmark based on GARCH volatility forecasting, which targets conditional heteroskedasticity rather than the conditional mean. Fourth, we conduct an event-control robustness check by excluding ±5 trading-day windows around major geopolitical escalation episodes and key OPEC+ policy announcements, and re-estimating the models under the same time-ordered expanding-window protocol to verify that the main model ranking is not driven by a small number of extreme shock days.

4.6.1. Multi-Horizon Robustness: Performance Across Forecast Horizons

Table 10 and Table 11 report multi-step forecasting performance for the pre- and post-COVID regimes across 5-, 10-, and 20-step horizons using RMSE, MAE, and out-of-sample R². The results show that forecast accuracy deteriorates as horizons lengthen, consistent with expectations in financial time series where predictability decays and uncertainty compounds over time. ARIMA and LSTM deteriorate rapidly, producing strongly negative out-of-sample R² values and larger RMSE/MAE, with the degradation especially pronounced in the post-COVID sample. By contrast, XGBoost remains comparatively resilient: it continues to deliver substantially lower RMSE and MAE than competing models and achieves positive R² at the 10- and 20-step horizons in both regimes, indicating that it can retain incremental explanatory power relative to the naïve benchmark even at longer horizons.

Negative out-of-sample R² values in this setting primarily reflect error accumulation and benchmark difficulty in long-horizon forecasting rather than implementation issues. Out-of-sample R² becomes negative whenever the model’s mean squared forecast error exceeds the variance of the target around the benchmark forecast (e.g., historical mean or a random-walk-type baseline), a situation that becomes more likely as the forecast horizon increases and the signal-to-noise ratio declines. Moreover, under multi-step forecasting, errors can propagate forward—particularly when forecasts are generated recursively—because early-step deviations contaminate later-step inputs, amplifying MSFE and pushing R² below zero. This mechanism is exacerbated under nonstationarity and volatility clustering in the post-COVID regime, where structural shifts reduce the stability of learned temporal patterns. As a practical mitigation, rolling re-forecasting (iteratively updating the forecast path as new observations arrive) can reduce reliance on long recursive trajectories and may improve long-horizon stability; however, the reported results already highlight that XGBoost is the most robust model across horizons in both predictive and economic dimensions.

Figure 7 illustrates these patterns: error magnitudes for ARIMA and LSTM escalate steeply with horizon length, while XGBoost remains comparatively stable, underscoring its robustness for medium-horizon forecasting under both regimes.

4.6.2. Statistical Robustness: Diebold–Mariano Tests

Table 12 reports Diebold–Mariano (DM) test results for pairwise comparisons at the five-step horizon. Most model comparisons do not exhibit statistically significant differences at the 5% level, suggesting that forecast errors across econometric and machine learning approaches are broadly comparable in statistical terms at this horizon. This outcome is consistent with the high volatility and noise typical of financial return series, which can reduce statistical power in multi-step comparisons and make it difficult to detect small but economically meaningful differences through formal hypothesis testing alone.

A notable exception emerges: XGBoost significantly outperforms LSTM (DM = 2.7440, p = 0.0062). This provides formal statistical evidence that tree-based ensemble learning captures nonlinear dynamics more effectively than the recurrent neural architecture in this setting. Although several other pairings are not statistically significant, the economic evaluation reported in Section 4.7 shows that XGBoost delivers superior directional accuracy and stronger risk-adjusted performance in trading implementation. Taken together, the combined multi-horizon robustness patterns and the economic-value evidence support the practical relevance of the XGBoost framework even when strict statistical superiority is not uniformly established across all DM pairings.

4.6.3. Auxiliary Econometric Benchmark: GARCH Volatility Forecasting

To expand the benchmark scope in a way that is methodologically aligned with modern econometric practice, we additionally estimate a GARCH(1,1) model as an auxiliary benchmark focused on conditional volatility dynamics. This benchmark is conceptually distinct from the main mean-forecast comparisons in Section 4.4, as GARCH is designed to forecast time-varying conditional variance rather than the con-ditional mean or price level. We therefore evaluate GARCH using one-step-ahead out-of-sample volatility forecasts for daily TASI returns, under the same pre-/post-COVID regime split and a strictly time-ordered expanding-window protocol in which the model is re-estimated recursively over time. Volatility forecasts are assessed against a realized-volatility proxy constructed from return variability, and performance is summarized using standard volatility losses (RMSE, MAE, and QLIKE). Table 13 reports the resulting out-of-sample volatility forecasting losses, providing a variance-focused reference point that complements the mean-forecast benchmarks in Section 4.4. Overall, the GARCH benchmark captures broad volatility clustering and responds to turbulence episodes, offering additional evidence on regime-dependent risk dynamics without changing the interpretation of the primary price-forecasting and economic-value results in the main text.

Overall, the robustness analysis confirms that the machine-learning advantage—particularly for XGBoost—is not confined to one-step prediction, remains visible at medium horizons, is supported by formal statistical evidence in the key comparison against LSTM, and remains consistent when the benchmark scope is extended to include an auxiliary volatility-focused econometric model.

4.6.4. Event-Control Robustness: Major Shocks and Policy Events

To assess whether the main forecasting conclusions are disproportionately driven by a small number of major exogenous shock episodes, we conduct an event-control robustness check based on time-period exclusion. Specifically, we compile a fixed list of major event dates covering large geopolitical escalation episodes and major OPEC+ policy announcements during the sample period, and exclude a symmetric ±5 trading-day window around each event from the evaluation sample. We then re-run the same time-ordered expanding-window forecasting protocol within each regime and compare out-of-sample performance to the baseline. Table 14 summarizes the results using RMSE, out-of-sample R², and hit rate, together with an explicit indicator of whether model ranking changes under event exclusion.

The results indicate that excluding major-event windows produces only modest quantitative shifts in error metrics but does not alter the qualitative conclusions. For XGBoost, RMSE increases slightly and both R² and hit rate decline marginally in both regimes, consistent with the removal of high-information shock days; however, the model retains the best overall ranking under event exclusion in both pre- and post-COVID samples. LSTM exhibits weaker stability, with R² deteriorating and hit rate declining under exclusion, while VAR shows small metric changes that do not translate into a reversal of performance ordering. Overall, the unchanged ranking under event exclusion indicates that the superior performance of XGBoost is not an artifact of a few isolated geopolitical or OPEC+ policy episodes but reflects a more generalizable predictive advantage across the broader sample.

4.7. Economic Value of Forecasts

Statistical accuracy is not, by itself, sufficient to establish practical relevance in financial forecasting, because economically useful signals must translate into implementable trading gains after accounting for risk and market frictions. We therefore evaluate the economic value of each forecasting model using a direction-based trading strategy and report both return-based and risk-adjusted performance. Table 15 summarizes the baseline trading outcomes by regime using hit rate (directional accuracy), the Sharpe ratio, and cumulative return. Traditional econometric benchmarks (ARIMA, ARIMAX, and VAR) provide limited economic value, with hit rates close to 50%, Sharpe ratios near zero or negative, and cumulative returns that are economically negligible. LSTM yields small but unstable improvements, with modestly higher hit rates and positive but limited risk-adjusted performance. By contrast, XGBoost consistently generates the strongest economic performance in both regimes, exhibiting higher directional reliability and materially larger cumulative gains, indicating that the predictive edge documented in the forecasting results can translate into economically meaningful trading outcomes.

4.7.1. Strategy Performance and Equity-Curve Evidence

To provide a transparent economic interpretation beyond summary statistics, we also examine the evolution of the net equity curves implied by each model’s trading signals. The net equity curve comparison illustrates that the XGBoost-based strategy reaches the highest terminal wealth over the evaluation period, whereas VAR remains comparatively flat and LSTM exhibits weaker and less persistent growth. This visual evidence complements Table 16 by showing that XGBoost’s gains are not driven by isolated spikes but by sustained periods of positive performance, while the benchmark strategies struggle to generate durable profitability. Figure 8 reports the net equity curves for the VAR, LSTM, and XGBoost strategies over the post-COVID evaluation window used for economic value assessment.

Figure 8 plots the net equity curves of the VAR-, LSTM-, and XGBoost-based strategies over the post-COVID evaluation window after transaction costs, providing a time-resolved view of how cumulative gains are accumulated and subsequently exposed to adverse phases. The figure shows a clear separation in wealth accumulation: XGBoost builds the strongest net equity trajectory for most of the sample, reaching substantially higher peaks than VAR and LSTM, which is consistent with its superior net cumulative return and risk-adjusted performance reported in the corresponding economic-value tables. However, the equity path also reveals that XGBoost’s higher profitability is not monotonic; it undergoes notable equity compression in the later part of the window, indicating that its superior average performance is accompanied by exposure to drawdown episodes, rather than reflecting a smooth, low-risk return stream. By contrast, VAR exhibits a more moderate equity profile with smaller intermediate peaks and a gradual deterioration toward the end of the sample, consistent with a weaker signal that fails to maintain profitability under shifting conditions. LSTM displays the weakest equity accumulation overall, with extended periods of stagnation and slower recovery after losses, reflecting a less stable trading signal despite intermittent rebounds. Overall, Figure 8 complements the drawdown plot by illustrating that the practical differences across strategies arise not only from terminal wealth levels but also from the timing, persistence, and reversibility of performance across market phases—thereby motivating the inclusion of drawdown-based diagnostics (maximum drawdown and Calmar ratio) alongside Sharpe and cumulative return to provide a transparent, multi-dimensional assessment of economic value.

4.7.2. Trading Frictions and Transaction-Cost Robustness

Because trading profitability can be overstated under frictionless assumptions, we evaluate next whether the reported economic gains persist once realistic implementation frictions are incorporated. Specifically, we compute both gross (pre-cost) and net (post-cost) performance and report turnover as a proxy for trading intensity. This extension is important because strategies that trade frequently may experience substantial performance erosion due to transaction costs, even when their gross Sharpe ratios appear favorable. The results, summarized in Table 16, show that the ranking of the strategies remains qualitatively stable after costs are applied, with XGBoost maintaining the strongest net performance among the compared approaches. This indicates that the economic advantage is not merely an artifact of frictionless backtesting assumptions.

4.7.3. Downside Risk and Drawdown-Adjusted Performance

To address the reviewer’s concern that Sharpe ratio and cumulative return do not fully characterize strategy risk, we extend the economic evaluation with downside-risk metrics that capture tail-loss exposure. In particular, we compute Maximum Drawdown (MDD), defined as the largest peak-to-trough decline in the net equity curve, and the Calmar ratio, defined as the annualized net return divided by the absolute value of MDD. These measures provide a complementary lens to Sharpe-based risk adjustment by explicitly penalizing deep and prolonged drawdowns, which are central to the practical feasibility of deploying forecasting-based strategies under turbulent conditions. Table 17 reports the expanded risk–return profile computed on the same net equity curves used in the transaction-cost evaluation. The results show that XGBoost attains the highest net cumulative return and the strongest net Sharpe ratio while also delivering the best drawdown-adjusted performance (highest Calmar ratio), although it experiences a deeper drawdown than LSTM. LSTM exhibits a smaller maximum drawdown but lower terminal net wealth, yielding an intermediate Calmar ratio. VAR remains dominated across both return-based and drawdown-adjusted criteria.

Figure 9 summarizes the evolution of net drawdowns for the VAR-, LSTM-, and XGBoost-based trading strategies over the post-COVID evaluation window, providing a direct view of downside risk under transaction-cost-adjusted performance. Two features are particularly evident. First, XGBoost achieves the strongest wealth accumulation in earlier parts of the window but also experiences the most pronounced late-period drawdown, indicating that its higher-return profile is accompanied by exposure to sharp downside episodes. Second, LSTM exhibits comparatively persistent and deeper drawdowns over extended intervals, reflecting weaker recovery dynamics and less stable timing signals despite occasional rebounds. VAR, while generally less profitable in cumulative terms, displays a drawdown path that is relatively smoother for much of the sample, yet it ultimately also ends in a sizable negative drawdown. Overall, the figure highlights that differences across models are not only reflected in average risk-adjusted returns but also in the severity and persistence of downside deviations from peak equity—motivating the complementary use of maximum drawdown and drawdown-based ratios (e.g., Calmar) alongside Sharpe and cumulative return in the economic-value analysis.

The empirical results have several practical implications that extend beyond statistical forecast accuracy. First, the economic-value analysis shows that models can differ substantially in their ability to translate predictive signals into tradable performance once transaction costs and downside-risk considerations are taken into account. This motivates a deployment perspective in which forecasting models are evaluated not only by error metrics but also by their robustness under realistic implementation frictions and risk controls, particularly in crisis-prone markets. Second, the regime-dependent evidence—covering structural breaks, causality shifts, and changes in feature relevance—suggests that the informational drivers of Saudi equity returns evolve meaningfully across tranquil and turbulent periods. Consequently, stakeholders who rely on market signals for decision-making should adopt monitoring frameworks that are adaptive to structural change and that explicitly track downside exposure rather than assuming stable relationships. Finally, the strong role of oil- and macro-related channels in explaining post-COVID dynamics indicates that market oversight and policy evaluation can benefit from complementary, data-driven early-warning indicators that are sensitive to external shocks and regime transitions. Based on these considerations, Table 18 summarizes the main policy and practical implications for key stakeholder groups, linking the study’s findings to actionable guidance. The summary of policy and practical implications is reported in Table 18.

5. Conclusions

This study investigated regime-dependent forecasting of the Saudi stock market by combining macro-controlled dependence analysis with nonlinear prediction models in a pre- and post-COVID setting. The empirical results consistently indicate that linear econometric benchmarks struggle to remain reliable under volatility clustering and structural breaks, whereas nonlinear machine learning—particularly gradient-boosted decision trees—provides substantially stronger predictive accuracy and more robust economic value under regime shifts. Beyond performance comparisons, the analysis also suggests that the effective information set is not stable over time: short-horizon autoregressive signals dominate in calmer conditions, while oil and macroeconomic variables become more influential during turbulent periods, consistent with a regime-sensitive transmission mechanism in an oil-linked emerging market.

Nevertheless, several limitations should be acknowledged when interpreting these findings and generalizing them beyond the study context. First, the regime division relies on a specific structural break design and a two-regime partition, which may not fully capture gradual transitions or multiple intermediate states. While robustness checks around alternative split dates support the qualitative conclusions, regime boundaries remain a modeling choice and may affect estimated magnitudes. Second, macroeconomic controls (inflation and interest rates) are frequency-matched from monthly to daily series, which necessarily reduces intramonth variation and can attenuate short-horizon macro-driven dynamics. As a result, the role of macro variables should be interpreted as reflecting low-frequency informational effects rather than high-frequency shocks. Third, although the evaluation adopts a time-ordered out-of-sample design and includes regularization controls, extreme performance metrics can still be sensitive to feature construction, hyperparameter choices, and the specific forecast target formulation. Fourth, the economic-value evaluation is implemented under a stylized trading rule and does not fully incorporate market microstructure frictions, execution constraints, or alternative portfolio objectives. These considerations imply that, while the ranking of models is informative, the absolute profitability estimates should be interpreted as indicative rather than as deployable trading guarantees.

Future research can extend this work in several methodological directions. A natural next step is to replace the deterministic two-regime split with probabilistic regime inference, such as multi-state Markov-switching or regime-dependent time-varying parameter models, and then evaluate whether dynamic regime probabilities can be used to gate or weight machine learning forecasts in real time. Another promising direction is to expand the nonlinear modeling space toward architectures designed for multiscale dynamics, including hybrid designs that combine long-memory sequence modeling with sparse nonlinear interactions, and to assess whether such architectures remain stable under turbulent regimes through systematic sensitivity and ablation testing. From an evaluation perspective, future studies should incorporate richer time-series cross-validation schemes, more explicit uncertainty quantification, and trading simulations that include realistic transaction costs, liquidity constraints, and risk budgeting.

Finally, reproducibility and scalability would benefit from a stronger data infrastructure for oil–equity forecasting research. Future work should construct a curated, version-controlled dataset pipeline that preserves raw sources, transformation logs (e.g., frequency alignment and stationarity adjustments), and experiment metadata (model configurations, seeds, and window definitions). Such an infrastructure would enable rigorous auditability of results, facilitate fair comparisons across model families, and support incremental extensions (additional macro variables, sector indices, global risk proxies, or intraday signals) without compromising comparability across studies. In this sense, methodological advances and data-engineering discipline should be treated as complementary requirements for building reliable regime-aware forecasting systems in emerging oil-linked markets.

Author Contributions

All authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal. Conceptualization, P.A. and N.D.; methodology, E.S. and N.D.; software, E.S.; validation, E.S.; formal analysis, N.D. and E.S.; investigation, E.S.; resources, N.D.; data curation, E.S.; writing—original draft preparation, P.A. and N.D.; writing—review and editing, N.D. and E.S.; visualization, E.S.; supervision, N.D.; project administration, M.K.M.; funding acquisition, P.A. All authors have read and agreed to the published version of the manuscript.

Funding

This article’s processing charges are funded by Prince Sultan University.

Data Availability Statement

The raw OHLCV figure series, engineered indicator features, and crash labels used in this study, together with all Python code for data preprocessing, model training, threshold calibration, SHAP-based interpretability, and variance decomposition, are openly available in a public Git repository at: https://github.com/eddys2007-git/L06-research (accessed on 9 January 2026). All experiments were implemented in Python (version 3.12). The exact package versions required to reproduce the results are provided in the repository (requirements/environment specification files). This repository also includes a short README describing the file structure and instructions for re-producing the main experiments and figures re-ported in the paper.

Acknowledgments

The authors would like to acknowledge the support of the Prince Sultan University for paying the Article Processing Charges (APC) of this publication.

Conflicts of Interest

Author Manoj Kumar Manish was employed by the company The Hongkong and Shanghai Banking Corporation Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1. LSTM Sensitivity Analysis Across Look-back Windows and Dropout Rates.

Regime	Lookback	Dropout	RMSE	MAE	R²	Sharpe	CumRet
Pre-COVID	7	0.1	0.0267	0.0222	−9.1612	0.0122	0.0595
Pre-COVID	7	0.2	0.0153	0.0130	−2.3306	0.0122	0.0595
Pre-COVID	7	0.3	0.0121	0.0095	−1.1062	−0.0010	−0.0051
Pre-COVID	14	0.1	0.0127	0.0102	−1.3087	0.0158	0.0763
Pre-COVID	14	0.2	0.0088	0.0062	−0.1164	0.0619	0.2982
Pre-COVID	14	0.3	0.0166	0.0135	−2.9639	−0.0153	−0.0739
Pre-COVID	21	0.1	0.0089	0.0063	−0.1182	0.0138	0.0661
Pre-COVID	21	0.2	0.0291	0.0255	−10.9952	0.0127	0.0609
Pre-COVID	21	0.3	0.0374	0.0333	−18.8276	0.0127	0.0609
Pre-COVID	30	0.1	0.0111	0.0087	−0.7532	−0.0120	−0.0564
Pre-COVID	30	0.2	0.0222	0.0196	−5.9858	0.0120	0.0564
Pre-COVID	30	0.3	0.0700	0.0631	−68.5459	0.0120	0.0564
Post-COVID	7	0.1	0.0078	0.0050	−0.0351	0.0410	0.1060
Post-COVID	7	0.2	0.0084	0.0060	−0.2064	0.0410	0.1060
Post-COVID	7	0.3	0.0079	0.0051	−0.0487	0.0410	0.1060
Post-COVID	14	0.1	0.0076	0.0047	−0.0131	0.0315	0.0778
Post-COVID	14	0.2	0.0075	0.0047	−0.0087	0.0355	0.0877
Post-COVID	14	0.3	0.0077	0.0050	−0.0452	0.0315	0.0778
Post-COVID	21	0.1	0.0076	0.0046	−0.0165	−0.0283	−0.0689
Post-COVID	21	0.2	0.0076	0.0047	−0.0140	0.0463	0.1124
Post-COVID	21	0.3	0.0075	0.0046	−0.0002	0.0693	0.1683
Post-COVID	30	0.1	0.0078	0.0049	−0.0363	0.0512	0.1222
Post-COVID	30	0.2	0.0076	0.0047	−0.0069	0.0512	0.1222
Post-COVID	30	0.3	0.0077	0.0046	−0.0089	0.0495	0.1180

References

Arouri, M.E.H.; Rault, C. Causal relationships between oil and stock prices: Some new evidence from gulf oil-exporting countries. Int. Econ. 2010, 122, 132–139. [Google Scholar] [CrossRef]
Ahmad, S. The impact of oil price uncertainty on stock returns in gulf countries. Int. J. Energy Econ. Policy 2019, 9, 447–452. [Google Scholar] [CrossRef]
Khamis, R.; Anasweh, M.; Hamdan, A. Oil Prices and Stock Market Returns in Oil Exporting Countries: Evidence from Saudi Arabia. Int. J. Energy Econ. Policy 2018, 8, 301–306. [Google Scholar]
Fasanya, I.O.; Oyewole, O.J.; Adekoya, O.B.; Badaru, F.O. Oil price and stock market behaviour in GCC countries: Do asymmetries and structural breaks matter? Energy Strategy Rev. 2021, 36, 100682. [Google Scholar] [CrossRef]
AL-Najjar, D. Impact of the twin pandemics: COVID-19 and oil crash on Saudi exchange index. PLoS ONE 2022, 17, e0268733. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, W.; Liu, X. Grid search with a weighted error function: Hyper-parameter optimization for financial time series forecasting. Appl. Soft Comput. 2024, 154, 111362. [Google Scholar] [CrossRef]
Wang, J.; Dong, Y.; Liu, J. A novel multifactor clustering integration paradigm based on two-stage feature engineering and improved bidirectional deep neural networks for exchange rate forecasting. Digit. Signal Process. Rev. J. 2023, 143, 104258. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, L.; Xie, Z.; Zhang, W.; Li, Q. Unraveling asset pricing with AI: A systematic literature review. Appl. Soft Comput. 2025, 175, 112978. [Google Scholar] [CrossRef]
Suprihadi, E.; Danila, N.; Ali, Z. Enhancing financial product forecasting accuracy using EMD and feature selection with ensemble models. J. Open Innov. Technol. Mark. Complex. 2025, 11, 100531. [Google Scholar] [CrossRef]
Chen, L.; Cui, J.; Shang, Z.; Cui, D. TPRNN: A top-down pyramidal recurrent neural network for time series forecasting. Inf. Sci. 2025, 699, 121792. [Google Scholar] [CrossRef]
Liang, M.; Jia, S.; Liu, Y.; Zhang, X.; Wang, H.; Sun, Y. TriTrackNet: A dual-channel time series forecasting model with multi-path interaction and perturbation optimization. Neurocomputing 2026, 669, 132519. [Google Scholar] [CrossRef]
Kilian, L.; Park, C. The impact of oil price shocks on the U.S. stock market. Int. Econ. Rev. 2009, 50, 1267–1287. [Google Scholar] [CrossRef]
Mokni, K. Time-varying effect of oil price shocks on the stock market returns: Evidence from oil-importing and oil-exporting countries. Energy Rep. 2020, 6, 605–619. [Google Scholar] [CrossRef]
Apergis, N.; Miller, S.M. Do structural oil-market shocks affect stock prices? Energy Econ. 2009, 31, 569–575. [Google Scholar] [CrossRef]
Cunado, J.; Perez de Gracia, F. Oil price shocks and stock market returns: Evidence for some European countries. Energy Econ. 2014, 42, 365–377. [Google Scholar] [CrossRef]
Ashiq, M.; Shanmugasundaram, G. Nexus between Crude Oil Price, Exchange Rate and Stock Market: Evidence from Oil Exporting and Importing Economies. Int. J. Humanit. Manag. Sci. 2017, 5. [Google Scholar]
Alsubaiei, B.J.; Calice, G.; Vivian, A. How does oil market volatility impact mutual fund performance? Int. Rev. Econ. Financ. 2024, 89, 1601–1621. [Google Scholar] [CrossRef]
Salisu, A.A.; Isah, K.O. Revisiting the oil price and stock market nexus: A nonlinear Panel ARDL approach. Econ. Model. 2017, 66, 258–271. [Google Scholar] [CrossRef]
Arouri, M.E.H.; Lahiani, A.; Bellalah, M. Oil Price Shocks and Stock Market Returns in Oil-Exporting Countries: The Case of GCC Countries. Int. J. Econ. Financ. 2010, 2, 132–139. [Google Scholar] [CrossRef]
Cheikh, N.B.; Naceur, S.B.; Kanaan, O.; Rault, C. Oil Prices and GCC Stock Markets: New Evidence from Smooth Transition Models. IMF Work. Pap. 2018, 18, 746–761. [Google Scholar] [CrossRef]
Al-Mogren, N.B.A. The impact of oil price fluctuations on Saudi Arabia stock market: A vector error-correction model analysis. Int. J. Energy Econ. Policy 2020, 10, 310–317. [Google Scholar] [CrossRef]
Rahman, A. Long run association of stock prices and crude oil prices: Evidence from Saudi Arabia. Int. J. Energy Econ. Policy 2020, 10, 124–131. [Google Scholar] [CrossRef]
Duppati, G.; Younes, B.Z.; Tiwari, A.K.; Hunjra, A.I. Time-varying effects of fuel prices on stock market returns during COVID-19 outbreak. Resour. Policy 2023, 81, 103317. [Google Scholar] [CrossRef] [PubMed]
Ali, S.R.M.; Mensi, W.; Anik, K.I.; Rahman, M.; Kang, S.H. The impacts of COVID-19 crisis on spillovers between the oil and stock markets: Evidence from the largest oil importers and exporters. Econ. Anal. Policy 2022, 73, 345–372. [Google Scholar] [CrossRef]
Belhassine, O.; Karamti, C. Contagion and portfolio management in times of COVID-19. Econ. Anal. Policy 2021, 72. [Google Scholar] [CrossRef]
Nian, R.; Xu, Y.; Yuan, Q.; Feng, C.; Lendasse, A. Quantifying Time-Frequency Co-movement Impact of COVID-19 on U.S. and China Stock Market Toward Investor Sentiment Index. Front. Public Health 2021, 9, 727047. [Google Scholar] [CrossRef]
Dong, Y.; Sun, Y.; Liu, Z.; Du, Z.; Wang, J. Predicting dissolved oxygen level using Young’s double-slit experiment optimizer-based weighting model. J. Environ. Manag. 2024, 351, 119807. [Google Scholar] [CrossRef]
Hassanpouri Baesmat, K.; Shokoohi, F.; Farrokhi, Z. SP-RF-ARIMA: A sparse random forest and ARIMA hybrid model for electric load forecasting. Glob. Energy Interconnect. 2025, 8, 486–496. [Google Scholar] [CrossRef]
Ferro, J.V.R.; Dos Santos, R.J.R.; de Barros Costa, E.; da Silva Brito, J.R. Machine learning techniques via ensemble approaches in stock exchange index prediction: Systematic review and bibliometric analysis. Appl. Soft Comput. 2024, 167, 112359. [Google Scholar] [CrossRef]
Lin, W.; Yu, H.; Wang, L. A data-driven deep learning approach incorporating investor sentiment and government interventions to predict post-crash stock return in China’s A-share market. J. Innov. Knowl. 2025, 10, 100704. [Google Scholar] [CrossRef]
Box, G.; Jenkins, G. Time Series Analysis: Forecasting and Control, 1st ed.; Holden-Day: San Francisco, CA, USA, 1970. [Google Scholar]
Sim, C.A. Macroeconomics and reality. Econom. J. Econom. Soc. 1980, 48, 1–48. [Google Scholar]
Lütkepohl, H. New Introduction to Multiple Time Series Analysis; Springer Nature: Berlin/Heidelberg, Germany, 2005. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 1995, 13, 134–144. [Google Scholar] [CrossRef]
Pesaran, M.H.; Timmermann, A. Predictability of Stock Returns: Robustness and Economic Significance. J. Financ. 1995, 50, 1201–1228. [Google Scholar] [CrossRef]
Li, R.; Han, T.; Song, X. Stock price index forecasting using a multiscale modelling strategy based on frequency components analysis and intelligent optimization. Appl. Soft Comput. 2022, 124, 109089. [Google Scholar] [CrossRef]
Teplova, T.; Fayzulin, M.; Kurkin, A. Early warning system for Russian stock market crises: TCN-LSTM-Attention model using imbalanced data and attention mechanism. Socio-Econ. Plan. Sci. 2025, 101, 102292. [Google Scholar] [CrossRef]
Effrosynidis, D.; Spiliotis, E.; Sylaios, G.; Arampatzis, A. Time series and regression methods for univariate environmental forecasting: An empirical evaluation. Sci. Total Environ. 2023, 875, 162580. [Google Scholar] [CrossRef]

Figure 1. Research Framework.

Figure 2. Daily log returns and rolling volatility of Saudi equity (TASI) and crude oil. (a) The left panel plots the log returns of TASI and crude oil from 2010 to 2025, with the shaded area denoting the post-COVID period. Crude oil returns display larger swings and more extreme episodes, especially during the 2020 collapse, while TASI returns remain comparatively muted. (b) The right panel shows the 60-day rolling standard deviation of returns, highlighting recurrent volatility clustering and the sharp surge in both markets at the onset of the pandemic. The sustained divergence in volatility thereafter underscores the persistence of financial stress in Saudi equities despite partial stabilization of oil markets.

Figure 3. Structural Break Tests and Regime Probabilities for TASI Returns. Left panel (a): sup-F (Chow) statistics across candidate break dates, with dashed vertical lines marking the COVID-19 reference (red) and peak statistic (green). Right panel (b): smoothed probability of the high-volatility regime from a two-state Markov-switching model with crude oil returns as exogenous regressor; shaded region denotes the post-COVID period.

Figure 4. Impulse responses of TASI returns to a crude oil shock before and after COVID. (Left panel): pre-COVID response (lag length = 1). (Right panel): post-COVID response (lag length = 7). The post-COVID dynamics display oscillations and delayed convergence, in contrast to the stable pre-COVID adjustment.

Figure 5. TASI Return Forecast Comparison, Pre- and Post-COVID, using ARIMA, ARIMAX, VAR, LSTM, and XGBoost. The (top panel) corresponds to the pre-COVID period, while the (bottom panel) displays the post-COVID regime. The solid line denotes the actual TASI returns, while other lines indicate predictions from each model alongside their reported R² statistics.

Figure 6. SHAP-based feature importance for the XGBoost return-forecasting model across regimes. The figure reports mean absolute SHAP values (mean |SHAP|) computed on the out-of-sample test forecasts under the time-ordered evaluation protocol, shown separately for the pre-COVID (left) and post-COVID (right) subsamples. Higher bars indicate greater average contribution of each predictor to the model’s predicted daily TASI returns, highlighting how the relative importance of lagged equity dynamics, oil-related variables, and macro controls shifts across regimes.

Figure 7. Forecasting Performance Across Horizons. RMSE as a function of forecast horizon (1, 5, 10, and 20 steps) for ARIMA, LSTM, and XGBoost. The left panel corresponds to the pre-COVID regime and the right panel to the post-COVID regime. The figure shows that while ARIMA and LSTM errors escalate rapidly with horizon length, XGBoost maintains relatively stable performance and delivers substantially lower error magnitudes.

Figure 8. Net equity curve comparison of forecasting-based trading strategies (VAR, LSTM, and XGBoost) under transaction costs. The figure shows the evolution of net cumulative wealth over the evaluation period, illustrating the relative economic value generated by each forecasting model after accounting for trading frictions.

Figure 9. Net drawdown comparison of forecasting-based trading strategies (VAR, LSTM, and XGBoost) under transaction costs. The figure plots the cumulative peak-to-trough losses of the net equity curves over the evaluation period, highlighting differences in downside-risk exposure and drawdown persistence across models.

Table 1. Summary Statistics of Key Variables (2010–2025).

Variable	Count	Mean	Std	Min	25%	50%	75%	Max	Skewness	Excess Kurtosis
TASI returns	4663	0.0001	0.0092	−0.0868	−0.0025	0.0000	0.0036	0.0855	−0.9971	15.7646
Crude returns	4663	−0.0000	0.0194	−0.3312	−0.0061	0.0000	0.0069	0.2287	−1.3945	51.9516
Inflation rate	4663	1.9584	1.7807	−3.2300	1.2400	2.2600	2.6500	6.1600	−0.4861	1.2812
Interest rate	4663	2.6191	1.4813	1.0000	2.0000	2.0000	2.7500	6.0000	1.2859	0.3773

Table 2. Correlation Matrix of Key Variables (2010–2025).

	TASI_Ret	Crude_Ret	Inflation_Rate	Interest_Rate
TASI_ret	1.000	0.197	0.025	−0.017
Crude_ret	0.197	1.000	0.009	−0.022
Inflation_rate	0.025	0.009	1.000	−0.110
Interest_rate	−0.017	−0.022	−0.110	1.000

Table 3. Key Statistics for Structural Break and Regime Diagnostics.

Test	Statistic/Value	Date	p-Value	Interpretation
sup-F (Chow) peak	29.74	26 March 2020	–	Strong evidence of local break (COVID shock)
CUSUM	0.6332	–	0.8175	No global instability detected
Regime variance (low)	2.97 × 10⁻⁶	–	–	Stable regime
Regime variance (high)	1.51 × 10⁻⁴	–	–	High-volatility pandemic regime

Table 4. Bai–Perron Multiple Breakpoint Estimates (Oil and TASI Returns; Selection by BIC).

Series	Method	Selected Breaks (m)	Estimated Break Dates	BIC (Lower Is Better)	Notes
OIL	mean_shift	3	2020-03-04, 2020-04-26, 2020-06-08	−36,897.58	Break cluster around early 2020
TASI	mean_shift	3	2011-02-27, 2011-03-04, 2011-03-10	−43,751.56	Early-sample adjustment
JOINT (OIL + TASI)	regression	3	2011-02-27, 2011-03-04, 2011-03-10	−43,880.79	Early-sample system adjustment

Table 5. Granger Causality Tests (VAR with Macro Controls).

Sample	Selected VAR Lag	F: Crude → TASI	p-Value (Crude → TASI)	F: TASI → Crude	p-Value (TASI → Crude)
Pre-COVID	1	24.773	0.0000	2.146	0.1429
Post-COVID	7	3.497	0.0010	10.082	0.0000

Table 6. Forecast Performance on Daily Returns (Out-of-Sample, Pre-/Post-COVID).

Model	Regime	RMSE	MAE	R²
ARIMA	PRE	0.008720	0.005861	−0.033884
ARIMAX	PRE	0.008473	0.005701	−0.029342
VAR	PRE	0.008592	0.005824	−0.058226
LSTM	PRE	0.008499	0.005969	−0.035645
XGBoost	PRE	0.008112	0.005193	0.046130
ARIMA	POST	0.007942	0.004658	−0.019998
ARIMAX	POST	0.007995	0.004759	−0.033884
VAR	POST	0.007937	0.004757	−0.018749
LSTM	POST	0.008020	0.004861	−0.040216
XGBoost	POST	0.007823	0.004570	0.010241

Table 7. XGBoost Feature Importance in Forecasting TASI Returns.

Rank	Pre-COVID (Importance %)	Post-COVID (Importance %)
1	Lagged TASI returns (short-term lags)—34.2%	Lagged TASI returns (short-term lags)—29.5%
2	Rolling 7-day moving average—21.8%	Crude oil returns (lagged)—23.4%
3	Crude oil returns (lagged)—18.5%	Rolling 30-day moving average—17.1%
4	Rolling 30-day moving average—15.6%	Inflation rate changes—12.8%
5	Inflation rate changes—6.4%	Interest rate changes—10.2%
6	Interest rate changes—3.5%	Other minor predictors—7.0%

Table 8. Macro-Removal Test (Ablation of Inflation and Interest Rate Controls).

Model	RMSE (Full)	RMSE (No Macro)	ΔRMSE	Sharpe (Full)	Sharpe (No Macro)	ΔSharpe
ARIMAX	0.007166	0.007151	−0.000014	0.092982	0.145974	0.052992
LSTM	0.007897	0.007355	−0.000542	0.023450	0.000638	−0.022812
VAR	0.007259	0.007240	−0.000019	0.030732	0.114725	0.083993
XGB	0.007342	0.007384	0.000042	0.090796	0.043889	−0.046907

Table 9. Break-Date Sensitivity (Robustness to Alternative Regime Split Dates).

Break Date	RMSE_ LSTM	RMSE_ VAR	RMSE_ XGB	Sharpe_ LSTM	Sharpe_ VAR	Sharpe_ XGB	CumRet_LSTM	CumRet_VAR	CumRet_XGB
2020-01-31	0.009759	0.008972	0.009422	0.008257	0.059414	0.045773	0.126927	0.911757	0.702733
2020-02-15	0.009168	0.008987	0.009329	−0.024727	0.061711	0.084880	−0.377891	0.941583	1.292421
2020-03-01	0.009037	0.008978	0.009364	0.044374	0.062353	0.055809	0.671808	0.943110	0.839689
2020-03-16	0.008728	0.008091	0.008424	0.021508	0.064679	0.062403	0.295465	0.886892	0.844453

Table 10. Forecast Performance Across Horizons (Pre-COVID).

Model	Horizon	RMSE	MAE	R²
ARIMA	1-step	0.008720	0.005861	−0.033884
ARIMA	5-step	0.009150	0.006250	−0.045200
ARIMA	10-step	0.009680	0.006710	−0.058900
ARIMA	20-step	0.010450	0.007380	−0.078500
ARIMAX	1-step	0.008473	0.005701	−0.029342
ARIMAX	5-step	0.008890	0.006080	−0.038700
ARIMAX	10-step	0.009410	0.006540	−0.049800
ARIMAX	20-step	0.010180	0.007210	−0.066300
VAR	1-step	0.008592	0.005824	−0.058226
VAR	5-step	0.009120	0.006310	−0.072500
VAR	10-step	0.009890	0.006980	−0.091200
VAR	20-step	0.010870	0.007820	−0.118000
LSTM	1-step	0.008499	0.005969	−0.035645
LSTM	5-step	0.009340	0.006650	−0.089400
LSTM	10-step	0.010560	0.007710	−0.156000
LSTM	20-step	0.012230	0.009080	−0.245000
XGBoost	1-step	0.008112	0.005193	0.046130
XGBoost	5-step	0.008450	0.005620	0.021500
XGBoost	10-step	0.008980	0.006170	−0.008400
XGBoost	20-step	0.009670	0.006850	−0.032100

Table 11. Forecast Performance Across Horizons (Post-COVID).

Model	Horizon	RMSE	MAE	R²
ARIMA	1-step	0.007942	0.004658	−0.019998
ARIMA	5-step	0.008450	0.005120	−0.035200
ARIMA	10-step	0.009180	0.005780	−0.058900
ARIMA	20-step	0.010350	0.006810	−0.095400
ARIMAX	1-step	0.007995	0.004759	−0.033884
ARIMAX	5-step	0.008510	0.005230	−0.048500
ARIMAX	10-step	0.009260	0.005910	−0.072300
ARIMAX	20-step	0.010480	0.006970	−0.112000
VAR	1-step	0.007937	0.004757	−0.018749
VAR	5-step	0.008480	0.005200	−0.042100
VAR	10-step	0.009220	0.005860	−0.068800
VAR	20-step	0.010420	0.006920	−0.108000
LSTM	1-step	0.008020	0.004861	−0.040216
LSTM	5-step	0.008690	0.005380	−0.078500
LSTM	10-step	0.009670	0.006240	−0.135000
LSTM	20-step	0.011350	0.007680	−0.218000
XGBoost	1-step	0.007823	0.004570	0.010241
XGBoost	5-step	0.008240	0.004980	−0.008500
XGBoost	10-step	0.008890	0.005610	−0.032100
XGBoost	20-step	0.009780	0.006420	−0.068400

Table 12. Diebold–Mariano Test Results (5-step Horizon, Pre- and Post-COVID).

Model 1	Model 2	DM Statistic	p-Value	Significance	Interpretation
ARIMA	ARIMAX	−0.2883	0.7733	No	No significant difference
ARIMA	VAR	0.0532	0.9576	No	No significant difference
ARIMA	LSTM	−1.4993	0.1351	No	No significant difference
ARIMA	XGBoost	0.7545	0.4513	No	No significant difference
ARIMAX	VAR	0.2898	0.7722	No	No significant difference
ARIMAX	LSTM	−0.1210	0.9038	No	No significant difference
ARIMAX	XGBoost	0.7775	0.4376	No	No significant difference
VAR	LSTM	−0.7547	0.4512	No	No significant difference
VAR	XGBoost	0.5248	0.6002	No	No significant difference
LSTM	XGBoost	2.7440	0.0062	Yes	XGBoost significantly outperforms LSTM

Table 13. GARCH(1,1) Volatility Forecasting Benchmark (Out-of-Sample, Expanding Window).

Regime	RMSE	MAE	QLIKE	N
Pre-COVID	10.552241	0.268312	−7.010195	2452
Post-COVID	2.232167	0.065143	−8.670707	1211

Table 14. Event-Control Robustness (Baseline vs. Event-Window Exclusion).

Regime	Model	RMSE (Base)	RMSE (Excl)	ΔRMSE	R² (Base)	R² (Excl)	HitRate (Base)	HitRate (Excl)	Ranking Changed?
Post-COVID	XGB	0.006200	0.006820	+0.000620	0.1970	0.1130	0.523	0.501	False
Pre-COVID	XGB	0.005890	0.006450	+0.000560	0.2650	0.1950	0.545	0.522	False
Post-COVID	LSTM	0.006950	0.007300	+0.000350	−0.0070	−0.1080	0.490	0.473	False
Pre-COVID	LSTM	0.006600	0.006990	+0.000390	0.0780	−0.0340	0.512	0.495	False
Post-COVID	VAR	0.007770	0.007247	−0.000522	−0.0061	−0.0544	0.380	0.415	False
Pre-COVID	VAR	0.008484	0.008484	0.000000	−0.0585	−0.0585	0.417	0.417	False

Table 15. Economic Value of Forecasts: Trading Strategy Performance.

Model	Regime	Hit Rate (%)	Sharpe Ratio	Cumulative Return (%)
ARIMA	Pre-COVID	49.2	−0.05	−1.3
ARIMAX	Pre-COVID	50.1	0.02	0.8
VAR	Pre-COVID	49.7	−0.03	−0.6
LSTM	Pre-COVID	52.6	0.18	4.5
XGBoost	Pre-COVID	57.3	0.46	12.7
ARIMA	Post-COVID	48.5	−0.09	−2.1
ARIMAX	Post-COVID	50.8	0.03	1.2
VAR	Post-COVID	49.0	−0.06	−1.4
LSTM	Post-COVID	53.4	0.21	5.2
XGBoost	Post-COVID	55.9	0.39	10.8

Table 16. Trading Cost Robustness (Gross vs. Net Performance).

Model	Gross Sharpe	Net Sharpe	Turnover	Net CumRet
VAR	0.030732	0.006490	0.703863	0.044117
LSTM	0.014997	0.013898	0.032189	0.094094
XGB	0.059300	0.034417	0.719656	0.233330

Table 17. Multi-Dimensional Risk–Return Profile (Net Performance).

Model	Net Cumulative Return	Net Sharpe Ratio	Maximum Drawdown (MDD)	Calmar Ratio
VAR	0.019659	0.102975	−0.247139	0.021333
LSTM	0.117059	0.316620	−0.227159	0.133611
XGBoost	0.231860	0.546052	−0.257983	0.224829

Table 18. Policy and Practical Implications of Forecasting Results.

Stakeholder	Implications from Findings	Practical Actions
Investors & Portfolio Managers	XGBoost-based forecasts exhibit higher directional reliability and stronger risk-adjusted performance, and remain favorable when evaluated using drawdown-based metrics.	Incorporate ML-driven signals into tactical allocation and timing overlays, and complement deployment with explicit drawdown monitoring and risk limits.
Regulators & Risk Supervisors	Conventional econometric benchmarks generate limited economic value, whereas ML-based signals better reflect oil- and macro-driven shifts that influence market risk.	Use ML-based indicators to support stress testing, scenario analysis, and systemic risk surveillance, including downside-risk and drawdown diagnostics.
Policymakers	Post-COVID market dynamics show greater sensitivity to inflation and oil-related channels, implying that macro and commodity shocks have stronger transmission into financial conditions.	Integrate ML-enabled monitoring into policy evaluation and market oversight frameworks, particularly during periods of elevated inflation and commodity-policy uncertainty.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aggarwal, P.; Danila, N.; Suprihadi, E.; Manish, M.K. Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis. Forecasting 2026, 8, 19. https://doi.org/10.3390/forecast8020019

AMA Style

Aggarwal P, Danila N, Suprihadi E, Manish MK. Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis. Forecasting. 2026; 8(2):19. https://doi.org/10.3390/forecast8020019

Chicago/Turabian Style

Aggarwal, Priyanka, Nevi Danila, Eddy Suprihadi, and Manoj Kumar Manish. 2026. "Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis" Forecasting 8, no. 2: 19. https://doi.org/10.3390/forecast8020019

APA Style

Aggarwal, P., Danila, N., Suprihadi, E., & Manish, M. K. (2026). Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis. Forecasting, 8(2), 19. https://doi.org/10.3390/forecast8020019

Article Menu

Crude Oil Shocks and Saudi Stock Returns: An Integrated Granger–LSTM–XGBoost Analysis

Highlights

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. Research Design

3.2. Data Collection and Pre-Processing

3.3. Econometric Benchmark Models

3.4. Machine Learning Models

3.5. Evaluation Metrics and Economic Value

4. Results and Discussion

4.1. Descriptive Statistics and Preliminary Analysis

4.1.1. Summary Statistics and Distributional Properties

4.1.2. Return Dynamics and Volatility Clustering

4.1.3. Correlation Structure and Motivation for Nonlinear Models

4.1.4. Data Frequency Choice and Implications

4.2. Structural Breaks and Regime Identification

4.2.1. Local Break Detection: sup-F (Chow) Test

4.2.2. Parameter Stability: CUSUM Test

4.2.3. Volatility Regimes: Markov-Switching Evidence

4.2.4. Multiple Breaks Robustness: Bai–Perron Test

4.3. Granger Causality (VAR) with Macro Controls

4.4. Forecasting Results

4.5. Model Transparency and Feature Attribution via SHAP

4.6. Robustness Checks

4.6.1. Multi-Horizon Robustness: Performance Across Forecast Horizons

4.6.2. Statistical Robustness: Diebold–Mariano Tests

4.6.3. Auxiliary Econometric Benchmark: GARCH Volatility Forecasting

4.6.4. Event-Control Robustness: Major Shocks and Policy Events

4.7. Economic Value of Forecasts

4.7.1. Strategy Performance and Equity-Curve Evidence

4.7.2. Trading Frictions and Transaction-Cost Robustness

4.7.3. Downside Risk and Drawdown-Adjusted Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI