In this section, we present the empirical evaluation of the proposed forecasting framework. We begin by detailing the dataset and experimental setup, including data partitioning and model implementation parameters. We then report stationarity diagnostics for the raw price and transformed series. Following this foundation, we examine the effectiveness of the delta-targeting strategy across both recurrent and the recent adapted neural baselines. Subsequently, we provide a comprehensive benchmarking of all evaluated models against classical statistical baselines across three tactical horizons and report residual autocorrelation diagnostics to assess forecast reliability beyond point-error metrics. Finally, we analyze the interpretability components of the Hybrid model.
4.1. Dataset and Experimental Setup
The empirical analysis is based on a proprietary daily scrap steel price series provided by an industrial partner. To preserve commercial confidentiality, the company identity and market-specific contractual details are not disclosed; however, the series is internally consistent over time and corresponds to a single scrap steel price stream used for operational decision support in the partner’s procurement setting. The cleaned dataset spans the period from 10 May 2016 to 21 February 2025 and contains 3210 daily observations. After preprocessing, the final series contains no missing dates and no duplicate timestamps, yielding one observation for each calendar day in the study period. Observations are available for all seven days of the week, consistent with the continuous pricing practice of the partner organization. Over the observation period, prices range from 207 to 658 Turkish Lira (TL)/ton, with a mean of 352.3 TL/ton and a standard deviation of 82.6 TL/ton, indicating substantial variation over time, including the COVID-19 period (2020) and the subsequent recovery phase (2021–2022).
For neural models, we adopt a chronological split after multi-horizon target alignment. Since the largest forecasting horizon is , the construction of supervised targets for reduces the effective sample size from 3210 raw daily observations to 3203 supervised samples. The first 80% of these supervised samples (2562 observations; May 2016–May 2023) forms the training-full partition, and the remaining 20% (641 observations; May 2023–February 2025) is reserved for out-of-sample testing. Within the training-full partition, the earliest 85% (2177 observations) is used for training and the most recent 15% (385 observations) for validation, enabling early stopping and model selection under strict temporal causality.
The stationarity diagnostics reported in
Section 4.2 are computed separately on the first 80% of the original chronological raw price series before multi-horizon target alignment. This raw diagnostic segment contains 2568 daily price observations. The difference between 2568 and 2562 arises from the order of operations: stationarity testing operates on the raw price-level series before target construction, whereas neural-model training uses the supervised partition obtained after excluding the final
observations that cannot serve as forecast anchors and then applying the chronological 80/20 split with integer indexing. In all cases, the held-out test period is excluded from preprocessing, diagnostics, and model selection. To prevent look-ahead bias, all preprocessing steps used for neural models, including MinMax scaling to
, are fitted exclusively on the neural training subset and then applied unchanged to the validation and test subsets. Classical baselines (SARIMA and ETS), by contrast, are evaluated under a rolling-origin refit protocol on the same test period. Across all experiments, methods are evaluated on the same set of test anchor times to ensure strict comparability.
To ensure a consistent comparison across model families, all neural models process a sliding window of length and use a direct multi-horizon output configuration for horizons . Each daily input vector contains features derived strictly from past information only: logarithmic return, 7-day rolling volatility, the first difference of a shifted EWMA with span 14, and sine/cosine encodings of the day of week. The logarithmic return captures short-term momentum, the 7-day rolling volatility reflects local variability, the EWMA slope provides a smooth trend indicator, and the calendar encodings represent weekly seasonality in a continuous form. All features are computed from lagged information with appropriate shifting so that no future observation enters the feature construction process.
For the delta-targeting formulations, including the proposed Hybrid model, targets are defined as
and final forecasts are reconstructed through
This formulation shifts the learning objective from absolute price levels to short-horizon price changes, which is particularly relevant given the non-stationary behavior of the raw series documented in
Section 4.2.
Regarding model architectures, the recurrent deep learning baselines (RNN, GRU, and LSTM) consist of two stacked layers with 64 units each and a dropout rate of 0.2.
For the recent adapted neural baselines, DLinear and N-BEATS follow the architectures described in
Section 3.4 under the same optimization settings. The C-KAN-style baseline uses a convolutional front-end with 32 channels and kernel size 5, followed by stacked KAN-style layers with hidden dimension 128 and Gaussian RBF activations. The proposed Hybrid model employs a deeper non-linear pathway with Bidirectional GRU (BiGRU) layers of 128 and 64 units, respectively, together with
L2 regularization and a dropout rate of 0.3. All neural architectures are optimized using Adam (
) with Huber loss (
) and a batch size of 32. Training uses early stopping and learning-rate reduction on plateau, with patience values in the range of 12–15 epochs. To reduce the sensitivity of neural results to random initialization, all reported neural metrics are aggregated over 10 independent runs using fixed seeds
.
4.2. Stationarity Diagnostics
Prior to model estimation, the stationarity properties of the price series were examined exclusively on the training partition (
daily observations; May 2016–May 2023) in order to prevent any leakage from the held-out test period. We apply two complementary tests: the Augmented Dickey–Fuller (ADF) test, whose null hypothesis states that the series contains a unit root (i.e., is non-stationary), and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test, whose null hypothesis states that the series is stationary. Using both tests jointly reduces the risk of drawing misleading conclusions from either test alone. The combined use of ADF and KPSS tests as a preprocessing diagnostic step is consistent with recent practice in time-series forecasting evaluation, where non-stationarity and temporal distributional changes are known to affect model training, validation design, and generalization behavior, including in recurrent neural forecasting models [
36,
37]. The joint interpretation of ADF and KPSS results also provides a compact diagnostic framework for characterizing a series as non-stationary, stationary, or potentially trend-stationary depending on the combination of unit-root and stationarity test outcomes. In the present case, the raw price series is consistently classified as non-stationary, whereas all delta-transformed targets are classified as stationary under both tests. Results are reported in
Table 2.
The raw price series fails to reject the ADF unit-root null hypothesis (), while the KPSS stationarity null is strongly rejected (stat , ). The two tests therefore agree in characterizing the raw price series as non-stationary. In contrast, all three delta series () and the log-return series reject the ADF null at conventional significance levels () while failing to reject the KPSS null (), indicating stationarity under both criteria. These results provide statistical support for the delta-targeting formulation adopted in this study: by training models on short-horizon price changes rather than absolute price levels, the learning objective is defined on a statistically more stable target space, which can improve optimization behavior and reduce the risk of persistence-dominated predictions. It should be noted, however, that stationarity of the target series does not guarantee well-behaved residuals or eliminate all sources of forecast difficulty; rather, it removes one structural obstacle that is known to affect model training under non-stationary conditions.
4.3. Effect of Delta-Targeting
To assess the practical effect of the delta-targeting formulation, we compare six neural models under two training objectives: (i) direct regression on raw price levels and (ii) regression on short-horizon price changes. The evaluated models include three recurrent architectures (RNN, GRU, and LSTM) and three recent adapted neural baselines (DLinear, N-BEATS, and C-KAN). All configurations share the same input window (
), direct multi-horizon output setup, and the leakage-free evaluation protocol defined in
Section 4.1. Performance is summarized in
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7 in terms of RMSE, MAPE(%),
, RAMP, and normalized HD.
Table 3 shows a clear and consistent effect of target formulation across all six neural models. For the recurrent architectures, delta-targeted training reduces RMSE at all horizons, with the largest improvement observed for LSTM at
, where RMSE decreases from
to
. The same trend is visible for RNN and GRU, indicating that recurrent architectures benefit systematically from learning in delta space rather than directly in the raw price domain. The recent neural forecasting baselines exhibit an even stronger dependence on target formulation: under raw-price training, DLinear, N-BEATS, and C-KAN all produce very large RMSE values, whereas their delta-targeted variants recover to a competitive range.
The same pattern appears in relative error terms. As shown in
Table 4, delta-targeting consistently yields lower MAPE across all recurrent and recent adapted neural architectures. Among the recurrent models, the delta-based RNN achieves the lowest weekly-horizon MAPE (
). Among the recent adapted neural forecasting baselines, the shift from raw-price to delta-targeted training is especially pronounced: for example, C-KAN improves from
to
at
, while DLinear and N-BEATS show similarly large reductions. These results suggest that the delta formulation substantially improves relative-error behavior, particularly for architectures that otherwise struggle with raw-price targets.
The goodness-of-fit results in
Table 5 reinforce the same conclusion. For recurrent models, raw-price training already leads to a clearer loss of fit at
, where
falls to the
–
range, while delta-targeted training preserves a higher fit level (
–
). For the recent adapted neural forecasting baselines, the contrast is even sharper: raw-price training yields negative
values across all horizons, whereas delta-targeting shifts all three models into a positive and competitive range. This indicates that, in the present multivariate short-horizon setting, the target formulation is at least as important as the architectural family itself.
The improvement under delta-targeting also extends to the complementary trajectory-level diagnostics reported in
Table 6 and
Table 7. In terms of the normalized HD, raw-price training generally produces larger distances between the predicted and observed trajectories, indicating weaker geometric alignment in the time–value plane. This effect is particularly visible for the adapted neural baselines, where raw-price training leads to unstable or poorly aligned forecast paths, whereas delta-targeted training yields more comparable trajectory shapes. Thus, the HD results support the main point-error findings by showing that delta-targeting improves not only average error levels but also the overall shape similarity between forecasts and observations.
The RAMP results require more careful interpretation because RAMP measures local change-tracking rather than absolute level accuracy. For DLinear and N-BEATS, delta-targeting substantially reduces RAMP across all horizons, which is consistent with their large improvements in RMSE, MAPE, and
. However, for some models, a lower RAMP value under raw-price training does not necessarily imply a better forecast. This is most evident for the raw-price C-KAN model, which obtains the lowest RAMP values across all horizons (
,
, and
for
,
, and
, respectively). Despite these low RAMP scores, the same model shows substantially degraded point-forecast accuracy, with negative
values across all horizons (
,
, and
for
,
, and
, respectively; see
Table 5). This discrepancy suggests that raw-price C-KAN produces relatively smooth or persistence-dominated trajectories with limited local variation. Such forecasts may reduce the difference between successive predicted changes and successive observed changes, thereby lowering RAMP, while still failing to match the actual price level. Therefore, RAMP and HD are interpreted here as complementary diagnostics rather than standalone model-selection criteria.
Taken together, the results across both recurrent and recent adapted neural forecasting baselines indicate that delta-targeting is a critical design choice in this multivariate short-horizon forecasting problem. For recurrent models, it consistently reduces point error and improves fit. For the recent adapted neural forecasting baselines, it appears necessary for competitive performance: without it, the combination of multivariate input and non-stationary raw-price targets leads to severely degraded results regardless of architectural complexity. All subsequent neural comparisons in this paper therefore use delta-targeted training.
4.4. Overall Benchmarking Across Horizons
We benchmark the proposed Hybrid Wide & Deep model against four classical statistical baselines (Naive, S-Naive, SARIMA, and ETS), three delta-targeted recurrent deep learning models (
-RNN,
-GRU, and
-LSTM), and three delta-targeted recent adapted neural forecasting baselines (
-DLinear,
-N-BEATS, and
-C-KAN).
Figure 1,
Figure 2 and
Figure 3 summarize performance across horizons
using RMSE, MAPE, and
, respectively.
Figure 1 presents the RMSE comparison across all evaluated methods. At
, the Naive baseline (RMSE 2.335) remains highly competitive, with delta-targeted recurrent models (e.g.,
-RNN: 2.271) and the recent adapted neural forecasting baselines (N-BEATS: 2.327, C-KAN: 2.338, DLinear: 2.465) achieving comparable or slightly improved results. The proposed Hybrid model (2.314) also remains within this competitive range. At
, a clearer stratification emerges: SARIMA (8.234) and ETS (8.161) deteriorate below the Naive baseline (7.555), while all delta-targeted neural models maintain relative stability. The
-RNN achieves the lowest RMSE (7.289), followed by
-GRU (7.372). Among the recent adapted neural forecasting baselines,
-DLinear (7.472) is competitive, while
-N-BEATS (7.592) and
-C-KAN (7.691) trail slightly. The Hybrid model (7.471) remains within the competitive range without uniformly minimizing point error.
As illustrated in
Figure 2, the relative-error results reinforce the main pattern observed in the RMSE analysis. At the intermediate horizon (
), the Naive baseline remains the strongest benchmark (MAPE
), while the delta-targeted neural models occupy a narrow but slightly higher range. Among these,
-C-KAN attains the lowest MAPE (
), followed by
-RNN (
), the Hybrid model (
), and
-N-BEATS (
). In contrast, the classical statistical baselines SARIMA (
) and ETS (
) remain less competitive at this horizon. At the extended horizon (
), a clearer separation emerges. The classical statistical models deteriorate to MAPE values above
, whereas the best neural models remain in the
–
range. The lowest mean MAPE is achieved by
-RNN (
), marginally outperforming the Naive baseline (
). The proposed Hybrid model (
) remains within the competitive range, although it does not minimize point error relative to the best-performing recurrent configuration. Among the recent adapted neural forecasting baselines,
-C-KAN (
) and
-DLinear (
) are close to the Hybrid model, while
-N-BEATS trails slightly (
).
The goodness-of-fit analysis shown in
Figure 3 is consistent with the patterns observed in the error-based metrics. At the shortest horizon (
), most evaluated models already achieve high explanatory power (
), reflecting strong short-term persistence in the series and leaving limited room for improvement over simple carry-forward forecasting. In this setting, the gains achieved by delta-targeted neural models over the Naive baseline remain relatively modest. A clearer separation emerges at the weekly horizon (
). The classical statistical baselines SARIMA (
) and ETS (
) fall below the Naive benchmark (
), whereas the delta-targeted neural models remain in a higher range. The highest
is obtained by
-RNN (
), followed by
-GRU (
). The proposed Hybrid model (
) remains within the competitive range and exceeds both the Naive and the classical statistical baselines, although it does not attain the highest explanatory power among all evaluated neural configurations. Among the recent adapted neural forecasting baselines,
-DLinear (
),
-N-BEATS (
), and
-C-KAN (
) also remain competitive at this horizon. Overall, the
results indicate that the relative advantage of neural models becomes more apparent as the forecasting horizon increases, while the Hybrid model should be interpreted as competitive rather than uniformly best.
Table 8 reports the RAMP scores across all evaluated methods. At
, the Seasonal Naive baseline attains the lowest RAMP (
), followed closely by
-C-KAN (
) and the Naive baseline (
). The delta-targeted recurrent models occupy a similar range, whereas SARIMA (
) and
-DLinear (
) exhibit comparatively higher values at this horizon. At
, RAMP values increase for most models, consistent with the greater difficulty of tracking short-term price movements at longer horizons. The lowest RAMP is obtained by the Naive and Seasonal Naive baselines (both
), followed by
-C-KAN (
). Among the remaining neural models,
-LSTM (
) and
-GRU (
) remain relatively competitive, whereas the Hybrid model (
) and
-DLinear (
) yield higher values. It should be noted that low RAMP values do not necessarily indicate superior overall forecast quality; as discussed in
Section 4.3, models that produce near-constant or persistence-dominated forecasts can appear favorable under RAMP while performing less strongly on point-error metrics. RAMP should therefore be interpreted alongside RMSE, MAPE, and
rather than in isolation.
Table 9 reports the normalized HD across all evaluated methods. In contrast to RAMP, HD provides a trajectory-level view by quantifying the geometric similarity between the predicted and observed series in the time–value plane. At
, the Naive baseline yields the lowest HD (
), reflecting the strong short-term persistence of the series. Among the neural models,
-C-KAN attains the lowest HD (
), followed by
-GRU and
-LSTM (both
), while the Hybrid model (
) remains within a similar range. At longer horizons, the relative differences become more pronounced. At
,
-C-KAN again attains the lowest HD (
), substantially below the classical statistical baselines SARIMA (
) and ETS (
), and also below the recurrent delta-targeted models. A similar pattern is observed at
, where
-C-KAN (
) remains the strongest method under HD, while the Hybrid model (
),
-DLinear (
), and the delta-targeted recurrent models occupy a moderately higher but still competitive range relative to the classical baselines. Overall, the HD results suggest that the neural models, and especially
-C-KAN, preserve the global shape of the target trajectory more effectively than the statistical baselines as the forecasting horizon increases. As with RAMP, however, HD should be interpreted as a complementary diagnostic rather than as a standalone selection criterion. It is worth noting that
-C-KAN’s strong HD performance across all horizons coexists with elevated residual autocorrelation at
(see
Section 4.5). This indicates that geometric trajectory similarity and residual independence reflect different aspects of forecast quality and may lead to different model rankings.
4.5. Residual Diagnostics and Forecast Reliability
To assess forecast reliability beyond point-error metrics, we apply the Ljung–Box Q-test at lag
to the residuals of all evaluated models. Although Ljung–Box testing originates in statistical time-series analysis, tests for residual autocorrelation have also been discussed in neural time-series modeling as tools for detecting remaining serial dependence in model errors, particularly when such dependence may arise from omitted variables, measurement errors, model misspecification, or insufficient temporal feature representation [
38,
39]. Accordingly, in this study, the Ljung–Box Q-test is used as a complementary residual-dependence diagnostic rather than as a point-accuracy metric. The null hypothesis (
) states that the residuals exhibit no serial autocorrelation up to lag 10. Rejection of
at the
level indicates that predictable temporal structure remains in the errors, suggesting that the model has not fully captured the dynamics of the series. For deterministic baselines, the test statistic and
p-value are reported directly; for neural models, we report the Ljung–Box statistic, together with the number of runs in which
is rejected out of 10 independent runs. Results are summarized in
Table 10.
The results reveal a clear pattern: meaningful differentiation across models is observed only at , while at and , virtually all models exhibit significant residual autocorrelation. At the one-step horizon, the delta-targeted recurrent models (-RNN, -GRU, and -LSTM) produce the cleanest residuals, with not rejected in any of the 10 independent runs (Q statistics of 4.24, 6.08, and 8.29, respectively). The Hybrid model (, 3/10 rejections) and the recent adapted neural forecasting baselines -DLinear (, 2/10) and -N-BEATS (, 1/10) also remain broadly consistent with the no-autocorrelation null at . In contrast, the classical persistence-based baselines (Naive and Seasonal Naive) and -C-KAN exhibit strongly significant autocorrelation at , suggesting that these models leave substantial predictable structure unexploited in their one-step residuals.
At and , all evaluated models reject across all runs, with Q statistics increasing substantially with the forecast horizon. This pattern is consistent with the difficulty of fully capturing multi-step dependence in financial price series under the present feature set and direct multi-horizon forecasting setup. The presence of residual autocorrelation at longer horizons does not necessarily invalidate the point-error results reported in previous sections; rather, it indicates that additional explanatory variables or alternative modeling strategies may be required to further reduce unexplained temporal dependence at extended forecast horizons. The residual diagnostics are therefore particularly informative for model comparison. The superior residual behavior of the delta-targeted recurrent models at this horizon, combined with their competitive point-error performance, suggests that delta-targeting not only improves predictive accuracy but may also yield more weakly autocorrelated one-step residuals in this setting. -C-KAN’s failure to pass the Ljung–Box diagnostic at despite its strong HD performance further underscores that different evaluation criteria capture distinct aspects of forecast quality.
4.6. Interpretability and Diagnostic Analysis of the Hybrid Model
This subsection examines the Hybrid model from an interpretability and diagnostic perspective at the most challenging tactical horizon (). Rather than treating these analyses as causal explanations of model behavior, we use them as complementary diagnostic tools to characterize how the model distributes emphasis across time steps, input features, and internal pathways. In particular, we analyze (i) across-run forecast stability, (ii) the decomposition of the Wide and Deep components, (iii) temporal attention profiles, and (iv) permutation-based feature sensitivity. These diagnostics are used to contextualize the comparative forecasting results reported in the previous subsections and to provide a more transparent view of the model’s internal behavior under the adopted experimental setting.
Figure 4 provides a qualitative view of the Hybrid model’s forecast behavior at the most challenging horizon (
). The shaded region represents the 5–95% variability band across 10 independent runs, while the central line shows the mean prediction. The relatively narrow band suggests limited run-to-run variation under different random initializations. At the same time, the figure indicates that the model broadly follows the observed price trajectory, but some changes appear to be tracked with delay, which is consistent with a persistence-sensitive forecasting behavior. This behavior is also consistent with the endogenous-only feature set used in this study, under which abrupt externally driven changes cannot be anticipated directly. Accordingly, this visualization should be interpreted as a descriptive stability and alignment diagnostic rather than as evidence, on its own, that the Hybrid model provides actionable advantages over simpler baselines. That question is addressed more appropriately through the comparative benchmark results reported in the previous subsection.
Figure 5 presents the scaled-space decomposition of the Hybrid model’s internal pathways at
. The visualization provides a diagnostic view of how the Wide and Deep components contribute to the final prediction across the test period. Across independent training runs, the Wide component (dashed blue line) exhibits more rapidly varying behavior, while the Deep component (solid orange line) follows a smoother and more oscillatory pattern. This contrast is consistent with a functional separation in which the Wide pathway captures more local linear adjustments and the Deep pathway contributes a smoother correction term. The total prediction (solid green line) remains relatively stable across runs, as indicated by the narrow ± 1 standard deviation band. These patterns should be interpreted as model-internal diagnostics rather than as causal evidence that specific market mechanisms have been isolated. In particular, the oscillatory behavior of the Deep component is consistent with sensitivity to recurring temporal structure, but it does not by itself establish that the model has identified a unique weekly causal process.
Figure 6 summarizes the global attention profile of the Deep pathway, averaged across the test set and multiple training seeds. Rather than exhibiting a purely monotonic decay toward recent lags, the attention weights display a recurring pattern with peaks that are broadly consistent with 7-day spacing. This suggests that the model assigns relatively greater emphasis to certain lags associated with weekly temporal structure. However, attention-weight distributions can vary across independent training runs because they are affected by random initialization, optimization dynamics, and equivalent internal representations. The shaded standard deviation bands therefore provide a useful indication of across-run variability, but the resulting attention profile should be interpreted primarily as a qualitative, run-aggregated diagnostic summary rather than as a stable explanatory attribution. These results should not be interpreted as causal evidence that the attention mechanism has isolated a unique seasonal process. In this sense, the attention profile provides supporting diagnostic evidence that weekly lag structure is relevant to the Deep pathway under the present experimental setting.
To examine the Hybrid model’s feature sensitivity at the most challenging horizon (
), we use permutation feature importance as a diagnostic tool. The input space contains five predictors: two cyclical calendar encodings,
and
, and three technical variables, namely
, rolling volatility, and
. As shown in
Figure 7a, permuting the day-of-week encodings leads to larger increases in RMSE than permuting the technical indicators, suggesting that the model is relatively more sensitive to weekly calendar information under this perturbation scheme. This result should be interpreted as a feature-sensitivity diagnostic rather than as causal evidence that the model’s forecasts are driven exclusively by weekly operational cycles.
A complementary descriptive view is provided by the polar plot in
Figure 7b, which displays the Mean Absolute Error (MAE) across the days of the week. The plotted differences are visually discernible but numerically small, as also indicated by the in-figure MAE range annotation. The radial and color scales should therefore be read in light of this narrow numerical range. For this reason, the day-specific differences should be interpreted cautiously and as descriptive rather than statistically or practically conclusive. Taken together, the PFI and polar analyses suggest that weekly structure is relevant to the Hybrid model at
, but they do not by themselves establish a causal mechanism or practical superiority over alternative models.