1. Introduction
Natural rubber is a critical industrial input and a strategic agricultural commodity that plays a central role in global manufacturing supply chains; it is widely used in tires, medical gloves, engineering components, and a broad range of technical rubber products. Global production is highly concentrated in Southeast Asia, particularly Thailand, Indonesia, and Malaysia, which, together, account for the majority of the global supply [
1,
2]. Thailand has consistently remained the world’s leading exporter of natural rubber: in 2025, it produced approximately 4.79 million tonnes (32.18% of global production), with about 85.79% of output exported to international markets [
3]. Price volatility in natural rubber markets therefore carries direct consequences for the income of millions of smallholder farmers and for the cost structure of downstream industries that rely heavily on rubber-based inputs.
Natural rubber prices are jointly driven by agricultural seasonality, macroeconomic conditions, exchange rate dynamics, and speculative activity in futures markets, producing dynamics that are simultaneously non-linear, non-stationary, and subject to structural breaks [
4,
5,
6]. Ribbed Smoked Sheet No. 3 (RSS3) serves as the benchmark grade for international transactions, with export prices typically quoted on a Free on Board (FOB) basis at major Thai ports. Cross-market price co-movement among SHFE, SGX, and JPX (TOCOM) futures provides additional predictive content beyond domestic spot prices alone, motivating the inclusion of exchange-traded settlement prices as model inputs [
7]. This informational structure is asymmetric: Sang and Ma [
8] demonstrate that SHFE Granger-causes TOCOM but not vice versa, while Kepulaje et al. [
9] (pp. 159–161) show that SGX RSS3 Rubber Futures (SRU) and SGX TSR20 Rubber Futures (STF) predict regional spot prices more consistently than TOCOM/JPX Rubber Futures (JRU) do—implying that the futures exchanges operate as a layered rather than flat information system, with the informational weight of each venue depending on whether one is studying financial benchmarking or physical export pricing.
Commodity price forecasting has long relied on econometric models such as ARIMA, VAR, and GARCH, which provide well-established frameworks for capturing temporal dependencies and volatility structures [
10,
11]. In the literature on natural rubber, this lineage is exemplified by Zahari et al. [
12], who apply the Box–Jenkins procedure to Malaysian SMR20 and identifies ARIMA(1,1,0) as the best-fitting model, and by Khin and Thambiah [
13], who show that a simultaneous supply–demand–price system outperforms univariate ARIMA across RMSE, MAE, and information criteria—demonstrating that purely univariate extrapolation is insufficient once market structure is made explicit. The linearity assumptions underlying these methods, however, limit their capacity on series with structural breaks and high non-stationarity, characteristics that are particularly pronounced in energy and agricultural commodity prices [
14,
15]. Subsequent studies introduced machine learning methods including support vector machines (SVMs), multi-layer perceptrons (MLPs), random forests, extreme gradient boosting (XGBoost), and back-propagation neural networks [
16,
17,
18], alongside hybrid statistical–machine learning frameworks such as ARIMA–SVM [
19,
20]. These methods address non-linearity but remain limited in capturing long-range dependencies, motivating the adoption of deep learning–based alternatives.
Since the above-mentioned machine learning studies, deep recurrent architectures—particularly Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Bidirectional LSTM (BiLSTM) architectures—have become standard for sequential modelling, and they outperform classical benchmarks in energy demand prediction, financial market forecasting, and environmental time-series modelling [
21,
22,
23]. BiLSTM extends conventional LSTM by processing sequences simultaneously in forward and backward directions, integrating contextual information from both past and future states within the observation window; this yields richer temporal representations in environments characterised by strong seasonality, non-linear interactions, and abrupt fluctuations [
24]. For natural rubber specifically, Phoksawat et al. [
25] demonstrate that a multi-layer LSTM achieves 95.88% accuracy on Thai RSS3 prices, while Eng and Khalid [
26] confirm that LSTM outperforms ARIMA on Malaysian SMR20 across MAE, RMSE, and MAPE—with both studies noting that further gains require the explicit incorporation of exogenous drivers such as exchange rates, energy prices, and futures markets. Hybrid extensions of BiLSTM include CNN–BiLSTM architectures that combine local feature extraction with bidirectional temporal learning [
27,
28], attention mechanisms that emphasise the most informative time steps [
29], and meta-heuristic or gradient-boosting combinations that capture residual non-linear patterns. These hybrids remain limited, however, in their ability to separate and individually model components operating across distinctly different temporal scales—a constraint that is particularly consequential for commodity series with co-existing trend, cyclical, and noise dynamics.
A parallel research direction addresses this limitation by incorporating signal decomposition—Empirical Mode Decomposition (EMD), Variational Mode Decomposition (VMD), Wavelet Transform (WT), or Seasonal-Trend Decomposition using LOESS (STL)—as a preprocessing stage, separating the original series into components representing distinct frequency structures and thereby reducing non-stationarity [
7,
29,
30]. Decomposition-based hybrid frameworks (EMD–LSTM, VMD–LSTM, VMD–GRU, VMD–TCN–LSTM, and STL–LSTM) have been shown to improve forecasting accuracy across coal, solar, wind, and electricity-load applications [
29], with VMD-based models outperforming both single-model and earlier decomposition-based alternatives [
31,
32]. Combining VMD with attention-based sequence architectures yields complementary improvements: VMD decomposes complex series into more learnable components, while attention enhances the selection of informative features and time steps [
33] (pp. 922–923); recent FX-forecasting evidence further indicates that VMD-enhanced bidirectional recurrent architectures can materially improve out-of-sample performance on low-frequency financial series [
34]. Within natural rubber, VMD extracts economically meaningful multi-scale components, and bidirectional recurrent architectures improve both fitting and directional prediction; importantly, the predictive value of individual modes depends on their time-scale correspondence with the forecast target [
7] (p. 6). More recently, Transformer-based architectures—Informer, Autoformer, FEDformer, PatchTST, and ReVIN—have introduced sparse attention, decomposition-based encoders, frequency-enhanced representations, patch-level embedding, and distribution-shift normalisation, substantially improving long-sequence forecasting [
33,
35,
36,
37]. A parallel line of work questions whether this architectural complexity is necessary at horizons of operational interest: Zeng et al. [
38] introduce DLinear, a simple linear decomposition baseline reported to match or outperform Informer, Autoformer, and FEDformer on standard benchmarks, and subsequent work—SOFTS [
39], MoFo [
40], and patch-based hierarchical-attention variants such as PHAT [
41]—continues this trajectory toward parsimonious alternatives. The contribution of the present work operates at the level of preprocessing and evaluation methodology and is therefore composable with these alternative downstream architectures; a systematic backbone comparison is identified as a natural follow-up in
Section 5.
A common limitation of existing decomposition-based pipelines is that signal separation is treated purely as a preprocessing step to improve input quality, after which each intrinsic mode function is forecast independently and the predictions summed. This per-IMF-and-sum design is prone to a forecast-output underdispersion problem on stationary (differenced) commodity series: under symmetric loss and limited predictive information, regression models shrink predictions toward the conditional mean, reducing MAE and RMSE while suppressing the variance of the predicted series. The mechanism is conceptually related to but distinct from the recursive training-data collapse discussed by Shumailov et al. [
42] (pp. 755–757); the shared element is the systematic erosion of output variance under compound or underdetermined estimation, while the present application refers strictly to forecast-output dispersion on differenced commodity series rather than to recursive training-data degradation. Rajpal et al. [
43] (pp. 13–15) document the directly analogous output-level pathology in financial forecasting, showing that symmetric loss objectives push models toward degenerate, one-sided prediction regimes whose apparent performance masks near-zero predictive content. In the context of natural rubber, Fakthong et al. [
44] further argue that existing studies are heavily oriented toward short-term fluctuation tracking and that limited attention has been paid to architectures that preserve multi-scale information.
This study, using a 24-feature economic input set for the period 2018–2026, examines whether decomposition-based deep learning forecasts of daily RSS3 FOB price changes can appear directionally accurate while failing to preserve the dispersion of the target series. Three contributions follow. First, the study shows that the conventional per-IMF forecasting-and-summation pipeline can induce variance collapse on differenced rubber prices (per-IMF ablation: StdR ≈ 0.20) and that appending VMD components as auxiliary features within a single BiLSTM encoder preserves amplitude fidelity more effectively. Second, the study establishes a variance-sensitive evaluation protocol in which directional accuracy is interpreted jointly with Pearson correlation, the Standard Deviation Ratio (StdR), and class-conditional recall; the protocol’s diagnostic power is demonstrated across three empirically distinct mechanisms by which variance collapse arises—feature deficiency (Stage 1 vs. Stage 2 cross-stage analysis), compound estimation error (per-IMF pipeline ablation), and underparameterised recurrent learning (a Vanilla LSTM baseline that attains directional accuracy statistically indistinguishable from that of the deployed model yet exhibits StdR
). Third, the cross-stage feature-availability analysis serves as a design check for the Stage 3 evaluation protocol. The theoretical basis for this metric combination is established by Taylor [
45] (pp. 7183–7184), who demonstrate that correlation and the Standard Deviation Ratio are complementary rather than substitutable descriptors of forecast skill: correlation measures co-movement in shape, while StdR measures amplitude fidelity, and a reduction in RMS error cannot be taken as evidence of improved skill if forecast variance has been suppressed in the process. This evaluation logic aligns with a broader methodological literature: Bürgi [
46] shows that DA is not self-sufficient without knowledge of magnitude sensitivity and user objective; McCarthy and Snudden [
47] (pp. 1–7) demonstrate that temporal aggregation can mechanically inflate success ratios beyond 0.5 even for a random walk, generating pseudo-skill where no genuine predictability exists; and Costantini and Kunst [
48] show that embedding DA into model-selection criteria seldom yields robust gains and can worsen MSE performance—together implying that DA should function as a supplementary diagnostic rather than a primary evaluation objective [
49].
The remainder of the paper is organised as follows.
Section 2 describes the data, preprocessing pipeline, decomposition procedure, model architecture, and evaluation protocol.
Section 3 reports the empirical results.
Section 4 discusses the economic interpretation and practical implications.
Section 5 concludes with limitations and directions for future research.
3. Results
This section reports the empirical results of the proposed VMD-Augmented BiLSTM framework. All primary evaluations are conducted on the Stage 3 out-of-sample test set (
effective observations after deduplication, 18 September 2025–27 February 2026).
Section 3.1 outlines the evaluation framework;
Section 3.2 presents the VMD signal decomposition;
Section 3.3 reports the ablation analysis confirming the contribution of each architectural component;
Section 3.4 reports out-of-sample forecast performance against benchmark models;
Section 3.5 presents directional bias diagnostics; and
Section 3.6 reports multi-step forecast skill across horizons
.
3.1. Evaluation Framework
The primary evaluation is conducted in differenced space using five metrics. Directional Accuracy (DA) measures the proportion of days on which the predicted sign of the price change matches the realised sign; days on which
are excluded from the denominator. Pearson correlation (
r) quantifies the linear association between predicted and actual price changes. MAE and RMSE measure average and root-mean-squared prediction error in Baht/kg/day. The Standard Deviation Ratio (StdR) is defined as
StdR equals 1.0 under ideal forecast dispersion and approaches 0 when a model produces near-constant predictions regardless of actual variation. A value below 0.20 is taken as evidence of variance collapse—a failure mode in which a model learns the unconditional mean rather than the conditional distribution of price changes. This convergence toward the unconditional mean is a forecast-output underdispersion phenomenon, conceptually parallel to the broader distributional-concentration failures discussed in the multi-stage estimation literature [
42] (pp. 755–757), and directly analogous to the one-sided prediction regimes identified by Rajpal et al. [
43] (pp. 13–15), in which symmetric loss objectives produce degenerate directional outputs whose apparent performance masks near-zero predictive content. Rajpal et al. [
43] further show that this degenerate regime produces severe imbalance between recall and specificity—a pathology that high DA alone cannot detect, reinforcing the need for confusion-matrix and variance-ratio diagnostics alongside directional hit-rate statistics. StdR is the most important diagnostic for differenced commodity price models because high DA can be achieved trivially by predicting the majority class, whereas reproducing the variance of the target series requires predictive information beyond the unconditional sign distribution.
Price-level metrics—specifically
and MAPE—are included in
Section 3.4 for completeness only. These statistics are inflated by the autoregressive anchor effect: because the reconstructed level
inherits the previous day’s price regardless of forecast quality,
and MAPE at the level will appear favourable even when the differenced-space forecast is uninformative. They therefore carry no additional diagnostic value beyond the five primary metrics reported in
Table 7. The VMD decomposition of the Stage 3 differenced price series, obtained under the expanding-window procedure with
, is shown in
Figure 4.
3.2. VMD of the RSS3 Price Series
Variational Mode Decomposition was applied to the differenced RSS3 FOBm1 series (
) over the Stage 3 training and validation partitions (
observations). The decomposition with
modes (see
Appendix C) achieved a reconstruction RMSE of 0.048 normalised units, representing 3.2% of the series standard deviation, confirming that the six components collectively reconstruct the original differenced series with minimal residual error.
IMF
1 accounts for 74.7% of total signal energy and exhibits the lowest centre frequency (
), consistent with the long-run structural trends documented in the literature on natural rubber, including fluctuations in global automobile production and plantation supply cycles [
7]. The two intermediate-frequency components (IMF
2–IMF
3) collectively capture 17.1% of energy and correspond to temporal scales of one to four weeks, consistent with the inventory adjustment cycles and seasonal tapping rhythms documented in rubber export data from Thailand, Indonesia, and Malaysia [
7,
30]. The structural dominance of IMF
1 is further consistent with the simultaneous supply–demand framework of Arunwarakorn et al. [
50] (pp. 8–9), who show that equilibrium rubber prices are anchored by slow-moving fundamentals—plantation area, synthetic rubber prices, GDP growth, and crude oil—operating over multi-year cycles rather than at the weekly or daily frequencies captured by IMF
2–IMF
6. The mid- to high-frequency residuals (IMF
4–IMF
6) account for 8.3% of energy and reflect intra-day speculative activity and short-term information shocks operating at sub-weekly scales. Full VMD decomposition results, including centre frequencies and energy shares for each mode, are summarised in
Table 9.
The top panel shows the original , while the remaining panels present IMF1–IMF6 with their centre frequencies and energy shares. IMF1 dominates the signal (74.7% energy), capturing the low-frequency trend, whereas IMF4–IMF6 mainly represent mid- to high-frequency noise.
3.3. Ablation Study
To verify the contribution of each architectural component, an ablation analysis was conducted on the deduplicated, leakage-free Stage 3 test partition (
). Four configurations were compared: (i) the full VMD-Augmented BiLSTM–Transformer hybrid control, (ii) the BiLSTM-only model (Transformer pathway removed), (iii) the Transformer-only model (BiLSTM encoder removed), and (iv) the conventional per-IMF decomposition–forecast pipeline. The results are summarised in
Table 10.
The ablation results identify the bidirectional LSTM pathway as the principal architectural driver of accuracy. The BiLSTM-only variant achieves Pearson’s , StdR = 1.029, and DA = 83.4%—matching or exceeding the performance of the full hybrid across all primary metrics at seed 42. The Transformer-only variant attains lower correlation () and comparable variance dispersion (StdR = 0.680), indicating that self-attention alone does not capture multi-scale temporal structure as effectively as bidirectional recurrence does. The full hybrid configuration (, StdR = 0.697 at seed 42) does not improve upon BiLSTM-only in this setting.
Multi-seed sensitivity analysis ( seeds: 42, 123, 2024, 7, 999) confirms this pattern: BiLSTM-only attains versus the full hybrid’s , with no overlap between the two per-seed distributions of Pearson correlation. The BiLSTM-only configuration is therefore adopted as the primary model and reported as VMD-Augmented BiLSTM in subsequent sections.
The per-IMF conventional pipeline shows the largest performance gap. Five separate models, each forecasting one IMF and summed at inference time, attain
with StdR = 0.200—at the variance-collapse threshold, where forecasts become nearly constant in dispersion. This finding directly motivates the VMD-as-features design adopted throughout this study: by exposing all IMFs jointly to a single forward pass, the encoder learns cross-scale interactions that the additive ensemble cannot recover. This empirical pattern—in which forecast variance shrinks when finite-sample approximation error compounds across successive estimation stages—is conceptually parallel to the underdispersion outcomes discussed in the literature on multi-stage estimation [
42] (p. 757), though the present result is established here as an empirical finding on differenced commodity-price forecasts rather than derived from a theorem.
The ablation patterns shown in
Figure 5 support three observations. The BiLSTM-only configuration achieves the strongest balance across the three primary metrics. The full hybrid (BiLSTM + Transformer) adds 84% more parameters without improving any of the three metrics, indicating that the Transformer pathway is redundant when VMD features are already provided. The Transformer-only variant captures less correlation than the BiLSTM-only variant, indicating that bidirectional recurrence is the principal predictive driver. The per-IMF conventional pipeline exhibits variance collapse (StdR
), motivating the VMD-as-features design adopted in the proposed framework.
3.4. Out-of-Sample Forecast Performance
Now that the contribution of each component has been confirmed through ablation, this section evaluates the proposed VMD-Augmented BiLSTM model against four external benchmarks on the deduplicated, leakage-free test partition ().
The Naive No-Change model () serves as a lower bound; any useful forecasting model must exceed it in directional accuracy. The Naive Random Walk () tests whether recent momentum has predictive value. ARIMA(2,0,2), selected by AIC grid search, represents the linear time-series benchmark. Vanilla LSTM (unidirectional, two-layer, same 24 economic features as input, no VMD) provides an additional ablation of the VMD-as-features contribution against a standard deep learning baseline.
The proposed VMD-Augmented BiLSTM achieves the strongest balance across primary metrics on the deduplicated leakage-free test partition. Across five random seeds, the model attains Pearson’s
and StdR
, improving correlation by approximately
over ARIMA(2,0,2). Crucially, the variance ratio (StdR ≈ 1.09) sits close to unity, indicating that predicted amplitudes are well calibrated to the realised magnitude distribution—a property that neither ARIMA (StdR = 0.368) nor Vanilla LSTM (StdR =
) achieves. The Vanilla LSTM result is particularly informative: despite attaining directional accuracy statistically indistinguishable from the proposed model (
vs.
), its predictions exhibit pronounced variance collapse, with predicted dispersion at only one-fifth that of the realised series. The Vanilla LSTM attains lower MAE and RMSE than the proposed model precisely because its forecasts are variance-collapsed toward the conditional mean: when the predicted series has near-zero dispersion, the squared and absolute deviations from the realised series shrink mechanically, mimicking apparent accuracy without producing usable directional or magnitude information. Lower error alone is therefore not evidence of superior forecast usefulness; the joint diagnostic
must accompany level metrics for a faithful assessment of forecast skill. The mechanisms generating this divergence are analysed in
Section 3.5.
This margin is consistent with the broader rubber forecasting literature showing that purely univariate linear models are structurally insufficient once commodity-specific drivers are made explicit: Khin and Thambiah [
13] demonstrate that a simultaneous supply–demand system outperforms ARIMA across all accuracy criteria, while Eng and Khalid [
26] confirm that LSTM dominates ARIMA on Malaysian SMR20—establishing that the gain over ARIMA reported here reflects a systematic pattern rather than a sample-specific outcome. The directional accuracy of 82.5% ± 1.8% (across five seeds) substantially exceeds that of the random baseline (50%) on the deduplicated test partition, consistent with the predictive content carried by the 24 economic features and the VMD-augmented frequency components.
The model’s predictive skill is consistent with the layered cross-market structure documented by Kepulaje et al. [
9] (pp. 159–161) and Pinitjitsamut [
52]: because SHFE signals are incorporated into FOB Bangkok with a lag and pass-through remains incomplete and asymmetric in the short run, one-day-ahead predictability beyond the autoregressive baseline is structurally limited—making a Pearson correlation of
across five random seeds on the differenced series a substantively strong result rather than an expected one. The VMD-Augmented BiLSTM successfully tracks the realised RSS3 price trajectory across the test period, capturing both directional movements and amplitude variation (
Figure 6).
The figure presents three views of the seed-42 test realisation. The top panel shows the rolling one-step price-level reconstruction,
, against the realised level series; the close fit reflects the autoregressive anchor effect (
Section 3.1) rather than underlying differenced-space skill. The middle panel makes this transparent: the cumulative-prediction path—predicted increments accumulated without re-anchoring—drifts substantially over the test window, demonstrating why performance must be assessed in the differenced space. The bottom panel reports the daily first-differenced predictions
versus realised
(Baht/kg/day)—the primary evaluation space—with predictions tracking actuals in both sign and amplitude, consistent with the multi-seed StdR of
in
Table 11.
The StdR of 1.091 (mean across five seeds) is the highest StdR across all evaluated models and sits close to the ideal value of 1.0, indicating that the VMD-Augmented BiLSTM produces forecasts whose dispersion is well calibrated to the actual first-differenced series (
Figure 7). By contrast, ARIMA (StdR = 0.368) substantially underdisperses, while Vanilla LSTM (StdR =
across five seeds) exhibits variance collapse at the diagnostic threshold of 0.20 introduced in
Section 3.1—a regime in which the model has converged to near-flat forecasts that reproduce the empirical sign distribution of the test set without preserving its magnitude. This combination of high Pearson correlation, near-unity StdR, and 175-observation reproducibility across five seeds provides the strongest evidence in this study of forecasts whose dispersion and co-movement structure are jointly consistent with the realised series, rather than artefacts of mean reversion or sign-dominant predictions.
The price-level metrics reported in
Table 11 should be interpreted with caution. The Naive No-Change model, which predicts zero price change on every day, achieves low level-space error despite having zero predictive content (DA
, Corr
). This confirms that price-level metrics alone cannot distinguish informative forecasts from a trivial baseline, as discussed in
Section 3.1.
This is precisely the evaluation failure described by Taylor [
45] (pp. 7183–7184): A forecast can appear numerically accurate while being dynamically wrong if it suppresses variance rather than tracking the stochastic structure of the series. Ampountolas [
67] (pp. 2, 14–18) illustrates the same gap in commodity forecasting practice—standard MAE, MSE, and RMSE comparisons identify the model with smallest average error but cannot reveal whether the winning model reproduces distributional scale, making StdR and Pearson’s
r indispensable complements rather than optional additions.
3.5. Directional Bias Analysis
Now that the VMD-Augmented BiLSTM has been established to achieve the best balance between correlation and variance fidelity, this section examines whether high directional accuracy in baseline models reflects predictive content beyond the unconditional sign distribution or merely reflects distributional bias. The analysis combines class-conditional recall with the variance-fidelity diagnostic on the
test set (deduplicated) in
Table 12; full confusion matrices for visual inspection are presented in
Figure 8.
The deduplicated test set contains 78 up days and 73 down days (non-zero observations), close to balanced—in contrast to the unbalanced 191 vs. 46 split that arose when SGX multi-contract records double-counted up days in the pre-deduplication version. The Naive Random Walk baseline attains DA = 59.6% by following the previous day’s direction with no recall asymmetry; ARIMA(2,0,2) attains DA = 56.3% with up recall 61.5% and down recall 50.7%, the latter being at the random-baseline threshold. The Vanilla LSTM baseline is reported separately under the multi-seed protocol in
Table 13, where its variance-collapse pattern is documented and analysed.
The VMD-Augmented BiLSTM (mean across seeds) attains DA
with down recall
, indicating that its high directional accuracy reflects balanced sign tracking across both up and down days rather than a positive-bias artefact. Combined with the near-unity StdR, this confirms that the proposed model identifies directional turning points without sacrificing amplitude fidelity (
Figure 9). Collectively, the confusion matrix diagnostics (
Figure 8) and StdR comparison show that high directional accuracy must be interpreted jointly with variance fidelity and class-specific recall. This finding corroborates the work of Bürgi [
46] (pp. 7909–7912), who demonstrates that DA is not a self-sufficient evaluation concept without knowledge of magnitude sensitivity and class composition, and McCarthy and Snudden [
47] (pp. 1–7), who show that success ratios can exceed 0.5 mechanically under class imbalance. Costantini et al. [
49] further show that DA becomes materially informative only when paired with magnitude-sensitive directional value measures that weight correct sign calls by the size of the realised move rather than treating all correct predictions as equivalent.
A direct empirical illustration of this pathology is obtained by re-evaluating the Vanilla LSTM baseline under the same five-seed protocol applied to the primary model. Across seeds
, the architecture—single-directional LSTM with hidden dimension 128, two layers, and dropout 0.20, receiving the 24 economic features without the VMD modes—attains directional accuracy of
, Pearson correlation
, and StdR
(
Table 13). The zero standard deviation of DA across five independently initialised runs, in combination with the near-identical Pearson
r and StdR values, indicates that the model has converged in every run to a forecast whose dispersion is sufficient only to identify the empirical sign distribution of the test sample, not to reproduce the magnitude of realised price changes. The variance ratio of
sits at the diagnostic threshold of
introduced in
Section 3.1 and is consistent with the forecast-output underdispersion regime that motivates the variance-collapse diagnostic in this study, paralleling the broader literature on distributional-concentration failures in underdetermined estimation [
42] (p. 757). By contrast, the proposed VMD-Augmented BiLSTM attains
and StdR
on the same test partition, doubling the Pearson correlation and recovering near-ideal variance fidelity at a statistically indistinguishable directional accuracy (
vs.
; difference within one standard deviation of the proposed model). This pairing of metrics provides the empirical anchor for the methodological argument advanced throughout this section: A model that achieves
directional accuracy with
and StdR
cannot be distinguished from a magnitude-preserving forecast on the DA criterion alone but is unambiguously identified as variance-collapsed under the joint
diagnostic. The implication generalises beyond the present sample: Any reporting practice that relies on directional accuracy without complementary variance and correlation diagnostics will systematically fail to detect this class of degenerate forecast, which can otherwise pass as a competitive deep learning baseline.
3.6. Multi-Step Forecast Skill Degradation
Forecast skill degradation across horizons is assessed using the direct multi-step strategy: A separate model is trained for each target horizon
, with all other architecture and hyperparameter settings held constant. Both the dual-pathway BiLSTM–Transformer hybrid and the BiLSTM-only deployed configuration were trained side by side under the same data pipeline as
Table 11 (seed 42), enabling a direct apples-to-apples comparison of how each architecture’s skill degrades as the horizon is extended. The full results are reported in
Table 14 and visualised in
Figure 10.
The direct forecasting strategy is adopted, training a separate model with identical architecture and hyperparameters for each horizon
h, to avoid the error accumulation associated with recursive multi-step forecasting [
65]. While this requires
H independent models, the approach avoids recursive error propagation and provides horizon-specific forecasts.
At the deployment horizon
, the BiLSTM-only configuration attains
, DA
, and StdR
—statistics consistent with the multi-seed estimate reported in
Table 11 (
). The dual-pathway hybrid trained under identical conditions attains
, DA
, StdR
, confirming the pathway-contribution finding of
Section 3.3 that the Transformer pathway does not contribute additional predictive value in this context. The Pearson-correlation gap between the two configurations is
at
and widens to
at
and
at
, indicating that the parsimony advantage of BiLSTM-only is not specific to the one-day horizon but persists across the short-horizon regime where forecast skill is operationally meaningful.
At the three-day horizon, BiLSTM-only attains , DA , and StdR —the strongest non-trivial multi-step result reported in this study. The non-monotone variation between and is shallow under BiLSTM-only (DA: ; r: ), consistent with the multi-scale frequency structure of the differenced series: the three-day horizon better aligns with the low-frequency trend components dominant in IMF1, while the two-day horizon retains exposure to higher-frequency noise that is partially attenuated by the BiLSTM–temporal attention pathway.
Beyond
, performance deteriorates sharply under both architectures. Under BiLSTM-only, correlation drops to
at
and to
at
, while StdR collapses below the diagnostic threshold of 0.20 by
(StdR
). Horizons
should be regarded as outside the model’s reliable forecasting range under either architecture. This pattern also illustrates the model-selection hazard documented by Costantini et al. [
49] and Costantini and Kunst [
48]: relying on DA as a selection criterion at longer horizons would favour models whose nominal accuracy exceeds that of short-horizon counterparts despite zero predictive content by every variance-sensitive measure. The emphasis on short lead times is consistent with prior VMD-based forecasting studies, where predictive gains from decomposition are strongest at near-term horizons and tend to weaken as the forecast horizon extends [
66].
These single-seed multi-horizon results provide preliminary corroboration of the pathway-contribution finding (
Section 3.3): the BiLSTM-only configuration outperforms the dual-pathway hybrid on Pearson correlation at every horizon and on directional accuracy at every horizon
. The BiLSTM-only configuration provides reliable forecasting skill at horizons of one, two, and three trading days, with diminished but non-trivial skill at
. Beyond a two-week horizon, predictive value collapses, consistent with the fundamental limit of high-frequency commodity price forecasting established in the prior literature. A multi-seed extension of the multi-step protocol is identified as an immediate priority for future work to characterise initialisation sensitivity at each horizon; the qualitative pattern reported here is consistent across seeds at
where multi-seed evidence is already available (
Table 11).
4. Discussion
The VMD results provide structural insight into the multi-scale price dynamics of the natural rubber market, and their economic interpretation reinforces the theoretical rationale for the proposed architectural design. The dominant trend component (IMF
1, 74.7% of total signal energy under expanding-window VMD with
) captures slow-moving, long-run structural forces—principally global tyre demand driven by automobile production; plantation area adjustment in Thailand, Indonesia, and Malaysia; and the substitution relationship between natural and synthetic rubber. The pronounced concentration of signal energy in this single low-frequency mode confirms that RSS3 price dynamics are fundamentally anchored by supply–demand fundamentals operating over multi-year cycles, consistent with the commodity economics literature and with prior VMD-based evidence from natural rubber futures markets showing that the dominant mode reflects broader structural market conditions and longer-horizon price behaviour [
7].
The intermediate-frequency components (IMF
2–IMF
3, 17.1% combined energy) are consistent with medium-term market cycles arising from seasonal tapping patterns, inventory accumulation and drawdown, and semi-annual export policy adjustments in major producing countries. These components exhibit a degree of regularity that makes them particularly amenable to modelling by the BiLSTM encoder, whose recurrent structure is well suited to capturing sequentially persistent, periodic fluctuations. This interpretation aligns with financial and energy-price forecasting evidence demonstrating that decomposition-based models improve predictive accuracy precisely by separating lower-frequency structural movements—which possess greater regularity—from higher-frequency disturbance components before sequence learning is applied [
24,
30].
The high-frequency residuals (IMF
4–IMF
6, 8.3% combined energy under
) capture speculative trading noise and short-horizon information shocks, including geopolitical disruptions, weather-related supply shocks, and market dislocations. Although the Transformer pathway was evaluated as a control architecture for capturing irregular high-frequency deviations, the ablation results indicate that it did not improve performance over the BiLSTM-only model in this dataset. This suggests that the dominant predictive structure in RSS3 price changes is more effectively captured by recurrent sequential learning with VMD-augmented inputs than by adding a self-attention pathway. This interpretation is supported by prior rubber-futures evidence indicating that high-frequency IMFs carry information relevant to near-term directional prediction, whereas low-frequency modes dominate longer-horizon behaviour [
7], and by energy-price forecasting research establishing that model performance improves when architectural design reflects the heterogeneous information content of frequency components [
30].
These decomposition-level findings translate directly into forecasting outcomes. The VMD-Augmented BiLSTM achieves Pearson’s
on the deduplicated test set across five seeds, with DA
and balanced down recall
, outperforming ARIMA(2,0,2) by approximately 0.67 in correlation. In the context of statistical evaluation, the model correctly classifies the sign of approximately four out of five non-zero daily RSS3 FOB movements, with predicted amplitudes calibrated to the realized magnitude distribution (StdR
). This combination of correlation, directional accuracy, and variance fidelity—reproducible across random seeds—characterises the model’s contribution beyond what any single conventional metric would suggest. These results are consistent with the broader hybrid-forecasting literature demonstrating that VMD-enhanced architectures improve predictive performance by reducing the stationary burden on downstream sequence learners, with gains being most pronounced when decomposition is integrated within a unified framework capable of exploiting both local sequential dependencies and global temporal structure [
55]. In the commodity-price domain specifically, this pattern is corroborated by evidence that VMD improves model learnability through multi-frequency isolation and that the further inclusion of structurally relevant long-run economic drivers yields additional gains in both level and directional accuracy [
31].
5. Conclusions
The central finding of this study is that directional accuracy alone is insufficient for evaluating differenced commodity-price forecasts. In the RSS3 FOB application, a Vanilla LSTM baseline attains directional accuracy of across five seeds—statistically indistinguishable from that of the deployed VMD-Augmented BiLSTM ()—yet its forecasts collapse toward near-zero dispersion (StdR versus ). The variance-sensitive evaluation protocol developed in this study, in which directional accuracy is interpreted jointly with Pearson correlation, the Standard Deviation Ratio, and class-conditional recall, identifies this degeneracy directly: A model with DA, , and StdR cannot be distinguished from a magnitude-preserving forecast on the DA criterion alone but is unambiguously identified as variance-collapsed under the joint diagnostic. The protocol’s diagnostic power is demonstrated across three empirically distinct mechanisms by which variance collapse arises in this setting—feature deficiency (Stage 1 and Stage 2 cross-stage analysis: StdR and ), compound estimation error (per-IMF conventional pipeline: StdR ), and underparameterised recurrent learning (Vanilla LSTM: StdR )—supporting its generality beyond the present sample.
Within this evaluation framework, the VMD-Augmented BiLSTM forecasting design preserves both directional information and amplitude fidelity. The key design choice—appending VMD components directly as input features rather than ensembling per-IMF forecasts—preserves multi-scale frequency information within a single forward pass while avoiding the variance collapse characteristic of conventional decomposition–forecast pipelines, with VMD fitted using an expanding-window procedure to eliminate information leakage. On a 175-observation, deduplicated, leakage-free held-out test set, the deployed model attained Pearson correlation of
, directional accuracy of
, and an StdR of
across five random seeds, outperforming ARIMA(2,0,2), Naive Random Walk, and Vanilla LSTM baselines on the co-primary Pearson correlation and variance-fidelity (StdR) metrics. The Vanilla LSTM baseline records lower nominal MAE and RMSE than the deployed model, but its forecasts are variance-collapsed toward the conditional mean; under variance collapse, lower error reflects mechanical contraction of the predicted dispersion rather than improved forecast usefulness, and the co-primary
diagnostic adopted here identifies this pattern. Ablation analysis identified the bidirectional LSTM pathway as the principal source of forecasting accuracy; the additional Transformer self-attention pathway did not provide consistent gain in this context. A single-seed multi-horizon evaluation provides preliminary evidence that the BiLSTM-only configuration outperforms the dual-pathway hybrid across horizons: At the three-day horizon, the BiLSTM-only configuration attains
and DA
, while the dual-pathway hybrid trained under identical conditions attains
and DA
; predictive skill collapses beyond a two-week horizon under both architectures. A multi-seed extension of the multi-step protocol is identified as an immediate priority for future work. The full benchmark evidence motivates the variance-sensitive co-primary diagnostics adopted throughout this study, consistent with the methodological consensus that DA alone is not a sufficient evaluation framework [
46] and becomes materially informative only when paired with magnitude-sensitive directional value measures [
49].
The VMD further revealed the multi-scale structure underlying RSS3 price dynamics. The dominant trend component (IMF
1, 74.7% of signal energy) reflects long-run supply–demand fundamentals, while intermediate components (IMF
2–IMF
3, 17.1%) correspond to inventory and export cycles, and high-frequency residuals (IMF
4–IMF
6, 8.3%) capture speculative noise and short-horizon information shocks. The primary forecasting gains were attributable to the lower-frequency components, which provided stable sequential context to the BiLSTM encoder—consistent with prior rubber-futures evidence that decomposition improves performance when retained modes are matched to the temporal scale of the prediction target [
7]. The precise interaction mechanisms between IMF characteristics and the BiLSTM encoder remain an open question warranting systematic ablation across decomposition modes in future work.
The Vanilla LSTM benchmark reported in
Section 3.5 provides direct empirical support for the variance-sensitive evaluation framework adopted in this study. The zero standard deviation of its directional accuracy across five seeds is itself diagnostic: when an architecture converges to the same near-flat forecast regardless of initialisation, it has learned the unconditional sign distribution of the target rather than its conditional dynamics. Under any reporting practice that relies on DA alone, the Vanilla LSTM would be presented as a competitive baseline whose simplicity favours adoption on the grounds of parsimony; the variance-sensitive co-primary metrics adopted here partition the two models unambiguously and identify the VMD-augmented feature representation, rather than the recurrent architecture alone, as the contribution responsible for magnitude fidelity. This finding generalises the methodological argument of Bürgi [
46], McCarthy and Snudden [
47], and Costantini et al. [
49] from theoretical critique to empirical demonstration on commodity-price data.
Five limitations constrain the current framework. First, although the final dataset contains complete coverage of all 24 input features, the analysis remains conditional on the harmonisation of heterogeneous market calendars and contract-specific reporting conventions across exchanges. Second, although VMD is fitted in an expanding-window manner that eliminates leakage, the bandwidth penalty
and the number of modes
K are not adaptively re-tuned as the window expands. Under structural market transitions, fully adaptive decomposition may yield further gains at substantially higher computational cost. Third, the analysis covers a single commodity (natural rubber, RSS3 FOB), a single market structure (Thai physical price), and one specific test window; external validity across commodity classes and market structures remains to be tested. Fourth,
Appendix G reports approximate analytical confidence intervals (Fisher’s
z for Pearson’s
r and the delta method for StdR) alongside a paired non-parametric bootstrap with
resamples for the co-primary diagnostics at the deployment horizon. The bootstrap intervals for the paired differences
and
between the proposed model and the Vanilla LSTM baseline exclude zero at the 95% level, providing an exact-coverage inferential complement to the descriptive multi-seed evidence. Diebold–Mariano and Giacomini–White tests under a model-confidence-set adjustment across all baselines and all horizons remain as a planned extension in subsequent work; their omission here reflects the methodological argument advanced in
Section 3.5 that DM tests on MAE/RMSE losses are not informative when the dominant divergence between architectures is forecast-output underdispersion rather than mean squared error. Fifth, the comparison of the proposed VMD-as-features design is conducted against ARIMA, Naive baselines, Vanilla LSTM, and within-family architectural ablations (per-IMF pipeline, Transformer-only, and dual-pathway hybrid); a head-to-head benchmark against recent Transformer-based and MLP-based long-horizon forecasters—including Informer, Autoformer, FEDformer, PatchTST, DLinear, SOFTS, MoFo, and PHAT—is not undertaken here. Because the leakage-free VMD-as-features design and the variance-sensitive evaluation protocol are architecture-agnostic, a systematic benchmark in which each of these models is coupled to the same VMD-augmented feature representation and evaluated under the same variance-sensitive co-primary diagnostic constitutes the natural next step and is identified as a priority for follow-up work.
Policy Implications
The findings carry direct implications for price risk management along the Thai rubber supply chain. In a statistical evaluation setting, the proposed model correctly classifies the sign of daily RSS3 FOB price movements in approximately four out of five non-zero test observations, with calibrated amplitude predictions (StdR ≈ 1.09); this provides a defensible quantitative input for hedging decisions and physical inventory scheduling. This statistical performance should not, however, be read as direct trading guidance: realised economic value depends on transaction costs, hedging-instrument availability, market liquidity, and risk-management overlay, none of which is modelled here. More broadly, the demonstration that variance collapse can arise across multiple independent mechanisms in conventional decomposition pipelines applied to differenced commodity series has direct relevance for the design of early-warning systems used by national rubber price stabilization programmes, where near-zero forecast variance can render model signals operationally uninformative.
Concretely, the following operational deployment criteria are proposed for forecasting systems submitted for adoption by market regulators and stabilisation agencies. Each retraining cycle should report (i) the directional accuracy DA on the non-zero subset, (ii) the Pearson correlation (
r), (iii) the Standard Deviation Ratio (StdR), (iv) the class-conditional up recall and down recall, and (v) a forecast-dispersion plot comparing the empirical distributions of predicted and realised differenced series. A model should not be adopted on the basis of DA alone: A configuration that records DA above
while StdR falls below
is directionally biased toward the unconditional mean and is at most weakly informative about magnitude; such a configuration should be flagged for review rather than deployed. This is not merely a matter of reporting convention: Costantini and Kunst [
48] demonstrate that selecting forecasting models on the basis of DA alone seldom yields robust gains and can worsen performance on magnitude-sensitive criteria—implying that operational deployment decisions grounded solely in hit-rate statistics risk systematically selecting the wrong model. The proposed evaluation framework is immediately adoptable without additional data requirements and is applicable across agricultural commodity markets more broadly.
As natural rubber markets become increasingly integrated with global financial systems and sensitive to climate-induced supply disruptions, the capacity to generate reliable, variance-faithful short-horizon price forecasts will grow in operational importance; the present framework offers a methodologically transparent and empirically validated foundation on which such systems can be built and extended.