Next Article in Journal
Study on Road Friction Estimation System Using Non-Contact Sensor Fusion
Previous Article in Journal
Soil–Atmosphere-Coupled CFD Modeling of Methane Dispersion from Buried Natural Gas Pipeline Leakage: Roles of Wind, Temperature, Topography, and Obstacle
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Delta-Targeted Hybrid Deep Learning Architecture for Short-Term Scrap Steel Price Forecasting: A Comparative Study

1
Department of Smart Systems Engineering, Izmir Bakircay University, Izmir 35665, Türkiye
2
Department of Management Information Systems, Fenerbahce University, Istanbul 34758, Türkiye
3
Cyberspace Research and Application Center, Fenerbahce University, Istanbul 34758, Türkiye
4
Department of Data Science and Analytics, Istanbul Topkapi University, Istanbul 34662, Türkiye
5
Department of Computer Engineering, Izmir Bakircay University, Izmir 35665, Türkiye
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(10), 4981; https://doi.org/10.3390/app16104981
Submission received: 26 March 2026 / Revised: 5 May 2026 / Accepted: 7 May 2026 / Published: 16 May 2026

Abstract

Forecasting scrap steel prices is crucial for the economic sustainability of recycling operations, yet it remains challenging due to inherent volatility and non-stationary behavior. In this study, we develop and evaluate a delta-targeted Hybrid forecasting pipeline for short horizons of 1, 3, and 7 days. We benchmark classical baselines (Naive, Seasonal Autoregressive Integrated Moving Average (SARIMA), and Exponential Smoothing (ETS)) against recurrent deep learning models (Simple Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM)) and recent neural forecasting baselines, including Decomposition-Linear (DLinear), Convolutional Kolmogorov–Arnold Network (C-KAN), and Neural Basis Expansion Analysis for Time Series (N-BEATS), using real-world daily scrap steel price data. The results indicate that delta-targeting generally yields more stable predictive performance than direct raw-price forecasting as the prediction horizon increases. For example, at the 7-day horizon, the predictive fit improves from approximately R 2 0.87 for raw-price LSTM to around R 2 0.90 for delta-trained recurrent models. At the same horizon, a delta-based RNN achieves the lowest Mean Absolute Percentage Error (MAPE) among the evaluated models (approximately 1.39 % ), while the proposed Hybrid model remains competitive across all tested horizons and maintains a goodness-of-fit of approximately R 2 0.90 without uniformly minimizing point error relative to the best-performing recurrent baseline. Attention profiling and permutation-based feature importance analyses indicate that the model places relatively higher weight on calendar-related inputs, consistent with the presence of weekly patterns in the data; these results should be interpreted as sensitivity diagnostics rather than causal evidence. Overall, the findings suggest that delta-transformed targets provide a more suitable prediction space than raw-price targets for short-horizon scrap steel forecasting, while the Hybrid design offers a balanced combination of predictive performance and diagnostic interpretability for operational decision support.

1. Introduction

The accelerating transition toward circular-economy practices has fundamentally reshaped the role of steel recycling within global supply chains, elevating secondary raw materials from supplementary inputs to critical industrial assets. Recent work has accordingly examined scrap steel not merely as a low-cost substitute, but as a strategically important input whose price behavior deserves dedicated forecasting attention [1]. Related studies have also shown that scrap markets display structural features that differ from those of primary commodity markets, including stronger sensitivity to localized market conditions and irregular supply-demand adjustments [2]. In contrast to primary raw materials such as iron ore or coking coal, which are often governed by long-term supply contracts, scrap steel prices are shaped by more fragmented procurement channels and more reactive trading conditions [3]. International assessments further indicate that these fluctuations are amplified by global demand shocks, logistics constraints, and rapid changes in inventory cycles [4]. For key stakeholders such as recycling yards, Electric Arc Furnace (EAF) steel producers, and procurement teams, this volatility introduces significant financial risk. Short-horizon price uncertainty directly impacts day-to-week operational decisions, including the timing of bulk purchases, the optimization of inventory turnover ratios, and the management of exposure to spot-market movements.
From a methodological perspective, scrap steel price series pose a particular challenge because of their non-stationary behavior and the coexistence of multiple signal components, including strong linear persistence, calendar-related seasonality, and irregular shocks. In industrial practice, classical time-series models such as Autoregressive Integrated Moving Average (ARIMA/SARIMA) and Holt–Winters Exponential Smoothing (ETS) remain widely used. These models are often preferred because their linear structure is transparent, mathematically tractable, and easy to update as new observations become available. However, their reliance on historical autocorrelation can limit their effectiveness during periods of market turbulence. Conversely, deep learning has introduced flexible architectures such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs), which can model non-linear temporal dependencies over historical windows. However, when applied directly to raw financial time series, these models may still face practical difficulties related to non-stationarity, target instability, and limited interpretability. Furthermore, the relative opacity of these architectures may limit their adoption in risk-sensitive operational settings, where forecasting systems are often more useful when they provide not only competitive predictions but also interpretable diagnostic signals. Accordingly, there is a practical need for forecasting frameworks that combine the flexibility of deep learning with the stability and transparency required for operational adoption. In this work, we develop a short-horizon forecasting pipeline for H { 1 , 3 , 7 } days and evaluate a hybrid model trained with a delta-targeting formulation and augmented with temporal attention. The main contributions of this study are as follows:
  • We provide a systematic comparison of raw-price and delta-targeted forecasting formulations for short horizons ( H { 1 , 3 , 7 } ) on real-world daily scrap steel price data, thereby quantifying the effect of target design on short-horizon predictive behavior.
  • We propose a hybrid Wide & Deep forecasting model with temporal attention, designed to combine linear persistence-oriented structure with non-linear representation learning while offering model-level diagnostic interpretability through attention profiling and permutation-based feature analysis.
  • We evaluate the proposed framework against classical baselines (Naive, Seasonal Naive, SARIMA, and ETS), recurrent neural models (RNN, GRU, and LSTM), and recent neural forecasting baselines (Decomposition-Linear (DLinear), Convolutional Kolmogorov–Arnold Network (C-KAN), and Neural Basis Expansion Analysis for Time Series (N-BEATS)), under time-consistent, leakage-free preprocessing and horizon-specific evaluation settings.
The remainder of this paper is organized as follows. Section 2 reviews the relevant literature on commodity price forecasting and the application of Hybrid neural architectures in time-series analysis. Section 3 details the proposed methodology, including the delta-targeting strategy, and the formulation of the proposed Hybrid model. Section 4 presents the experimental setup and a comparative analysis of the empirical results. Section 5 discusses the findings, highlighting practical implications and limitations. Finally, Section 6 summarizes the study’s conclusions.

2. Literature Review

Price forecasting for industrial commodities, including scrap steel and related steel inputs, has received sustained attention because of its relevance to procurement, inventory planning, and circular-economy operations [5]. Existing studies may be broadly grouped into three methodological streams: (i) statistical and econometric models, (ii) machine learning and deep learning approaches, and (iii) Hybrid or interpretability-oriented frameworks that aim to balance predictive performance with model transparency.
Early work commonly relied on linear time-series formulations. Generalized Autoregressive Conditional Heteroskedasticity (GARCH) was introduced to model time-varying volatility in financial and commodity series [6]. The Box–Jenkins tradition established ARIMA as a standard framework for modeling linear dependence, trend, and seasonality in temporally ordered data [7]. Exponential smoothing and related state-space formulations have remained important operational baselines because of their simplicity and strong short-horizon performance under persistence-dominated dynamics [8]. Although such models can remain competitive when the dominant structure is linear, they are often less flexible under abrupt regime shifts, non-linear interactions, or rapidly changing market conditions.
To capture such non-linearities, later studies explored data-driven methods. Support Vector Regression, Random Forests, and gradient-boosting variants have been reported as competitive alternatives in steel- and commodity-related prediction tasks [9]. Neural-network approaches have also been applied to scrap steel forecasting in regional market studies. For example, one line of work examined short-horizon dynamics in Northeast China [10], while related studies extended the analysis to North China [11], South China [12], East China [13], nationwide Chinese markets [14], and Southwest China [15]. These studies collectively suggest that recurrent architectures can capture useful local temporal structure, but they also highlight the sensitivity of price forecasting performance to target design, market scope, and data preprocessing choices.
Recent forecasting research has further expanded toward state-of-the-art sequence models. Transformer-based architectures have received considerable attention because self-attention can model long-range temporal dependencies and support multi-horizon reasoning; examples include the original Transformer formulation [16] and later forecasting-oriented architectures such as the Temporal Fusion Transformer [17]. In parallel, recent neural baselines such as linear decomposition-based models and basis-expansion architectures have been proposed as lightweight alternatives for multi-step forecasting. Another important direction concerns probabilistic forecasting, where the goal is not only to predict a point estimate but also to quantify predictive uncertainty through intervals or full predictive distributions. Such approaches are particularly relevant in volatile commodity settings, where operational decisions may depend as much on risk bounds as on central forecasts. For instance, probabilistic recurrent models such as DeepAR estimate the conditional predictive distribution rather than only a point forecast, illustrating the practical value of uncertainty-aware forecasting in operational decision settings [18]. A related line of research emphasizes the role of exogenous economic variables, such as macroeconomic indicators, logistics disruptions, energy costs, or market-specific shocks, which may help explain abrupt movements that cannot be inferred from endogenous price history alone. Recent work has also explored Transformer-based forecasting with explicit exogenous-variable integration; for example, TimeXer is designed to incorporate external information to improve forecasting of endogenous target variables [19].
Alongside Transformer-based approaches, a recent line of work has questioned whether architectural complexity is necessary for competitive time-series forecasting. Zeng et al. [20] demonstrate that simple linear models—specifically DLinear and NLinear, which decompose the input into trend and seasonal components before applying a linear mapping—can match or outperform Transformer-based models on standard benchmarks. Oreshkin et al. [21] propose N-BEATS, a deep neural architecture based on backward and forward residual links and a very deep stack of fully connected layers with basis expansion, which achieves strong performance on univariate forecasting tasks without requiring domain-specific feature engineering. More recently, Liu et al. [22] introduce Kolmogorov–Arnold Networks (KANs), which replace fixed activation functions with learnable univariate functions on edges, offering a different inductive bias from standard multilayer perceptrons. These developments motivate the inclusion of DLinear, N-BEATS, and a KAN-based architecture as adapted neural baselines in the present study.
Despite this progress, end-to-end neural models trained directly on raw, non-stationary price levels can face practical challenges. Reported issues include unstable optimization, sensitivity to initialization, and persistence-dominated behavior that can resemble copying the most recent observed value when the target is dominated by level shifts [23]. Similar concerns arise when recurrent architectures are applied to complex sequential tasks without sufficient target stabilization or time-consistent evaluation [24]. These observations motivate differencing-based target formulations, referred to here as delta-targeting, in which the model predicts short-horizon price changes rather than absolute levels. The rationale is not merely empirical convenience: by reducing the influence of level non-stationarity on the learning objective, delta-targeting can produce a better-behaved target space for short-horizon forecasting. In this study, we focus on absolute short-horizon differences rather than returns or log-returns because the operational problem is framed in the original price domain and because stakeholders typically interpret day-to-week procurement decisions in terms of absolute price movement rather than proportional change. Returns and log-returns remain useful for diagnostics and feature construction, but absolute deltas offer a more direct mapping back to decision-relevant price levels.
Another critical issue in short-term forecasting is the multi-step prediction strategy. Models are commonly implemented using either a recursive strategy, which can accumulate error across forecast steps, or a direct multi-horizon strategy. Direct strategies, particularly multi-output architectures, optimize forecasts for several horizons simultaneously and can mitigate error propagation in volatile series [25]. Rigorous evaluation further requires time-ordered train–test splits and fitting all preprocessing steps, such as normalization, on the training partition only in order to avoid leakage and overly optimistic estimates of predictive accuracy [26]. This requirement is especially important for price series, where global extrema may occur in the holdout period and would otherwise leak future information into training through scaling.
A parallel recent trend concerns Hybrid architectures that combine explicit linear structure with non-linear residual learning. The Wide & Deep framework was originally introduced in recommender systems to jointly model memorization-oriented linear effects and generalized non-linear representations [27]. Related ideas have since been adapted to predictive tasks beyond recommendation. In addition, attention mechanisms offer a structured diagnostic signal by assigning different weights to historical time steps [28]. In recurrent settings, such weighting can be interpreted as a temporal emphasis mechanism over hidden-state sequences rather than as causal evidence. Attention-augmented forecasting models have therefore become attractive when prediction accuracy must be considered together with model auditing and interpretability [29]. However, fewer studies explicitly combine a Wide & Deep decomposition, attention-based diagnostics, leakage-free evaluation, and short-horizon scrap steel forecasting in a single comparative framework.
While the literature on commodity and scrap-related forecasting is substantial, fewer studies simultaneously focus on short tactical horizons (1, 3, and 7 days) using real-world daily scrap price data, systematically compare raw-price and delta-targeted training objectives under a controlled leakage-free setting, benchmark against both classical baselines and recent neural baselines within a unified experimental framework, and provide interpretable diagnostics, including attention profiling and permutation feature importance that can support auditing and stakeholder communication. Accordingly, we investigate a delta-targeted Hybrid Wide & Deep architecture with temporal attention and evaluate it against a comprehensive set of classical, recurrent, and recent neural baselines using consistent preprocessing, leakage-free scaling, and an evaluation framework that extends standard point-error metrics with residual autocorrelation diagnostics, ramping accuracy scores, and shape-based distance measures.

3. Methodology

This section details the forecasting pipeline and the proposed Hybrid neural architecture, specifically designed to balance linear regularities with non-linear corrections under short-horizon tactical forecasting. The mathematical notation used throughout this section is summarized in Table 1.

3.1. Problem Definition

Let { P t } t = 1 T denote the daily scrap price series. The primary objective is to forecast future prices over short tactical horizons h H = { 1 , 3 , 7 } . Direct modeling of raw price levels often proves unstable in commodity markets, where series frequently exhibit non-stationary behavior and abrupt regime shifts [8,30].
To stabilize the learning process, we adopt a delta-targeting strategy, training the models to predict price changes relative to the current observation. For each horizon h, the target variable is defined as
Δ h ( t ) = P t + h P t
Given a predicted delta Δ ^ h ( t ) , the final forecast is reconstructed in the original price space as
P ^ t + h = P t + Δ ^ h ( t )
In our primary experiments, these horizons are learned jointly using a direct multi-horizon (multi-output) formulation, where a single model concurrently outputs the predictions ( Δ ^ 1 ( t ) , Δ ^ 3 ( t ) , Δ ^ 7 ( t ) ) [25]. Valid training samples are constructed for time steps W t T max ( H ) , ensuring that both the input window and the target horizon fall within the observed data range.

3.2. Traditional Statistical Baselines

To establish strong reference points, we evaluate widely used statistical baselines that primarily model linear dependence, trend, and calendar-driven seasonality. All baselines are assessed under the same anchor-based (rolling-origin) protocol described in Section 3.1.
  • Naive: The Naive (persistence) forecaster assumes that the best short-term prediction is the most recent observed value. Despite its simplicity, it is a widely used reference in time-series forecasting because it provides a strong lower bound in series with high inertia or slow dynamics [8].
    We adopt the persistence forecast as a benchmark for all horizons h H :
    P ^ t + h t = P t .
    It quantifies how much predictive gain is achieved beyond last-observation carry-forward behavior under short tactical horizons.
  • Seasonal Naive: Seasonal Naive (S-Naive) forecasting extends persistence by assuming that recurring calendar regularities repeat from one seasonal cycle to the next. It is commonly used when weekly, monthly, or annual seasonality is expected and serves as a transparent, interpretable seasonal baseline [8,31].
    Given the weekly periodicity ( s = 7 ), we define the seasonal Naive forecast as
    P ^ t + h t = P t + h s
    where the constraint t + h s t ensures that only information available up to the anchor time t is used for all h H .
  • SARIMA: SARIMA models represent a widely adopted econometric family for series exhibiting linear dependence, trend, and seasonality. SARIMA combines non-seasonal and seasonal autoregressive and moving-average dynamics with differencing operators to handle non-stationarity and seasonal unit roots [7,32].
    In this work, we employ a seasonal specification with weekly periodicity and evaluate it using a rolling-origin scheme. Parameterization is fixed during test-time evaluation; the state is updated sequentially as new observations become available to maintain temporal causality. The resulting forecasts are produced for the same test anchors and horizons as the other baselines.
  • ETS: Exponential smoothing methods model time series through evolving latent components such as level, trend, and seasonality, updated with exponentially decaying weights. Holt–Winters variants are frequently used for operational forecasting because they are computationally lightweight and interpretable in terms of component updates [33,34]. In this study, we adopt an additive Holt–Winters formulation with weekly seasonality. Evaluation follows the same rolling-origin principle as above: forecasts at each anchor are formed using information available up to that time point. When incremental state updates are supported, we use sequential updating; otherwise, the model is re-estimated on an expanding window, preserving strict time ordering.

3.3. Deep Learning Models

To complement linear statistical baselines, we evaluate three recurrent neural architectures capable of modeling non-linear temporal dependencies and sequential dynamics.
  • Recurrent Neural Network: The fundamental RNN extends standard feedforward networks by maintaining a hidden state h t that acts as a memory of previous inputs. At each step, the state is updated via h t = σ ( W h x t + U h h t 1 + b ) , allowing the network to capture short-term temporal correlations. It serves as a lightweight benchmark for quantifying the value of gated memory mechanisms, especially when the forecasting horizon is short and the dominant dynamics are local in time [35].
  • Long Short-Term Memory: Designed to mitigate the vanishing gradient limitation, LSTM networks introduce a cell state c t regulated by three gating mechanisms: the forget gate (discards irrelevant history), the input gate (updates the cell with new information), and the output gate (determines the hidden state) [23]. This architecture is the de facto standard for financial time series where separating signal from transient noise is critical.
  • Gated Recurrent Unit: The GRU is a streamlined variant of the LSTM that merges the cell state and hidden state, utilizing only two gates (update and reset) [24]. GRUs often achieve comparable performance to LSTMs with lower computational complexity, making them an attractive candidate for operational forecasting systems with resource constraints.
A key methodological choice is whether the network is trained on raw price levels or on transformed targets that reduce non-stationarity. To isolate the effect of target, we consider two conceptually distinct learning objectives for each recurrent architecture: (i) raw-price forecasting, where the network directly predicts future price levels P ^ t + h ; and (ii) delta-targeted forecasting, where the network predicts price changes Δ ^ h ( t ) and the price forecast is reconstructed as P ^ t + h = P t + Δ ^ h ( t ) (Equation (2)). This formulation emphasizes short-horizon increments rather than absolute levels, which can mitigate scale drift and improve stability under regime changes.

3.4. Recent Neural Forecasting Baselines

To position the proposed Hybrid model within the current forecasting landscape, we include three recent adapted neural forecasting baselines: DLinear, N-BEATS, and a C-KAN-style architecture. All three are evaluated under the same multivariate preprocessing pipeline and short-horizon forecasting framework as the recurrent models described in Section 3.3. It is important to note that these models are used here in an adapted form: the original architectures were developed primarily for univariate or canonical single-feature settings, whereas our implementation takes a multivariate flattened input window X t R W × F , as described in Section 3.5. Accordingly, these models are referred to as adapted neural baselines throughout.
  • DLinear: DLinear decomposes the input sequence into a trend component and a seasonal (residual) component using a moving-average filter, and then applies a separate linear mapping to each component. Formally, given the input window X t , the trend component T t is extracted via a moving-average operation with kernel size k, and the seasonal component is defined as S t = X t T t . The output is then computed as
    Δ ^ DLinear ( t ) = W T vec ( T t ) + W S vec ( S t ) + b ,
    where vec ( · ) denotes vectorization of the input window and W T , W S R D × W F are learnable weight matrices. Despite its simplicity, DLinear has been reported to match or outperform more complex Transformer-based models on standard forecasting benchmarks, making it a strong lightweight baseline [20].
  • N-BEATS: N-BEATS is a deep neural architecture based on backward and forward residual connections arranged in a stack of fully connected blocks. Each block produces a backcast x ^ (an approximation of the input) and a forecast Δ ^ . The residual input x x ^ is passed to the next block, allowing the network to progressively decompose the input signal across layers. In the generic (non-interpretable) configuration used here, basis expansion coefficients are learned without domain-specific constraints. N-BEATS has shown strong performance on univariate forecasting tasks without requiring hand-crafted features [21]; in the present study, it is adapted to the multivariate short-horizon setting.
  • C-KAN-style baseline: To broaden the comparison further, we include a convolutional KAN-style neural baseline adapted for time-series forecasting. Unlike standard multilayer perceptrons, which rely on fixed activation functions at the node level, KAN formulations place learnable univariate functions on edges [22]. In the present implementation, a convolutional front-end first extracts local temporal features from the input window, after which the resulting representation is passed through stacked KAN-style layers that apply Gaussian radial basis function (RBF) activations alongside a residual linear connection. This design introduces a different inductive bias from standard recurrent or purely linear architectures and is included here as an exploratory adapted neural baseline rather than as a canonical reproduction of a specific C-KAN implementation.
For all three baselines, the input features, preprocessing pipeline, train–validation–test splits, and evaluation protocol are identical to those used for the recurrent models. To enable a consistent comparison across model families, both raw-price and delta-targeted variants are evaluated.

3.5. Proposed Hybrid Model

To reconcile the transparency of linear forecasting with the representational flexibility of deep learning, we propose a Hybrid Wide & Deep architecture augmented with a temporal attention mechanism. The model follows a dual-pathway design: (i) a linear Wide pathway captures dominant linear regularities from the most recent observation, and (ii) a non-linear Deep pathway learns residual temporal structure from a fixed historical window. The final forecast is obtained by additive fusion of these two components, yielding an explicit decomposition between linear inertia and non-linear adjustment [27].
Let H = { 1 , 3 , 7 } denote the set of forecast horizons and let D = | H | = 3 be the output dimension. For each anchor time t, the model receives a sliding window of length W composed of feature vectors
X t = [ x t W + 1 , x t W + 2 , , x t ] R W × F ,
where x τ R F is the feature vector at time τ and F denotes the number of input features. In the present study, F = 5 , corresponding to log-return, rolling volatility, exponentially weighted moving average (EWMA) slope, and the two calendar encodings ( DoW sin , DoW cos ) . Under the delta-targeting formulation introduced in Section 3.1, the network produces a direct multi-horizon output in delta space
Δ ^ ( t ) = Δ ^ 1 ( t ) , Δ ^ 3 ( t ) , Δ ^ 7 ( t ) R D .
These horizon-specific delta forecasts are mapped back to the original price domain through
P ^ t + h = P t + Δ ^ h ( t ) , h H .
The Wide pathway is defined as a linear map from the most recent feature vector x t to the D-dimensional delta output:
Δ ^ wide ( t ) = W w x t + b w ,
where W w R D × F and b w R D . This component provides an explicit linear baseline embedded within the overall architecture, capturing persistence-like effects and other approximately linear regularities.
The Deep pathway processes the full input window X t . Let
H t = [ h t W + 1 , h t W + 2 , , h t ] R W × M
denote the sequence of hidden representations produced by the recurrent encoder, where h τ R M and M is the hidden-state dimension. In our implementation, the encoder is constructed from stacked bidirectional gated recurrent layers, although the formulation below is general and applies to any recurrent encoder that produces one hidden vector per time step.
Temporal attention is then applied over the hidden-state sequence. For each relative position k { 1 , , W } , an unnormalized attention score is computed as
e t , k = v a tanh W a h t W + k + b a ,
where W a R R × M , b a R R , v a R R , and R is the attention projection dimension. The normalized attention weights are obtained by a softmax transformation
α t , k = exp ( e t , k ) j = 1 W exp ( e t , j ) , k = 1 , , W ,
which ensures α t , k 0 and k = 1 W α t , k = 1 . The resulting context vector is
c t = k = 1 W α t , k h t W + k R M .
The context vector is mapped to horizon-specific residual deltas through a linear output head
Δ ^ deep ( t ) = W d c t + b d ,
where W d R D × M and b d R D . The final Hybrid prediction is then obtained by additive fusion:
Δ ^ ( t ) = Δ ^ wide ( t ) + Δ ^ deep ( t ) .
This decomposition encourages the Deep pathway to focus on residual structure not captured by the linear Wide component, while preserving an explicit linear contribution in the final forecast.
It is important to emphasize that the attention mechanism used here is not a full Transformer module; rather, it is a temporal weighting mechanism applied to recurrent hidden representations [17,28]. In the present study, attention weights are used as model-internal diagnostic signals for examining temporal emphasis patterns across the input window. Accordingly, they are interpreted as indicators of relative temporal reliance, not as causal evidence about the underlying market process.
Beyond point forecasts, the model therefore exposes two interpretable internal components that can be examined qualitatively: (i) the linear contribution of the Wide pathway and (ii) the attention-weight profiles over the W-step input history. Aggregating these signals over test anchors provides a structured summary of how the model distributes emphasis across recent lags and how much of the final prediction is attributable to linear versus non-linear components.

3.6. Performance Metrics

We report Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and the Coefficient of Determination ( R 2 ) as primary evaluation metrics. In addition, we include three complementary criteria: the Ramping Accuracy Metric for Power (RAMP) score, which measures ramping accuracy; the normalized Hausdorff Distance (HD), which quantifies shape-based similarity between the predicted and observed series; and the Ljung–Box Q-test, which assesses whether residual autocorrelation remains in the forecast errors. For n test samples, let y i denote the actual price and y ^ i the predicted price at step i, with y ¯ representing the mean of the observed values. The metrics are defined as follows:
  • RMSE: Represents the standard deviation of the prediction errors. It is particularly useful for penalizing larger errors more heavily than smaller ones:
    RMSE = 1 n i = 1 n ( y i y ^ i ) 2 .
  • MAPE: Expresses the average absolute error as a percentage of the actual values, providing a scale-independent measure of interpretability:
    MAPE = 100 % n i = 1 n y i y ^ i y i .
  • R 2 : Quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables, indicating the goodness of fit:
    R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ ) 2 .
  • RAMP Score: The RAMP score evaluates how well the model captures short-term changes in the series by comparing first differences of the predicted and observed sequences. Lower values indicate better agreement in local directional movement:
    RAMP = 1 n 1 i = 1 n 1 ( y ^ i + 1 y ^ i ) ( y i + 1 y i ) .
  • HD: To assess shape similarity between the predicted and observed trajectories, we compute a normalized symmetric HD. First, each series is represented as a set of points in the two-dimensional time–value plane:
    A = { ( t i , y i * ) } i = 1 n , B = { ( t i , y ^ i * ) } i = 1 n ,
    where t i [ 0 , 1 ] denotes the normalized time index, and y i * and y ^ i * are the normalized observed and predicted values, respectively. The directed HD from A to B is
    d H ( A , B ) = max a A min b B a b 2 .
    The symmetric HD is then defined as
    HD = max d H ( A , B ) , d H ( B , A ) .
    Lower HD values indicate stronger geometric similarity between the predicted and actual trajectories.
  • Ljung–Box Q-test: To evaluate forecast reliability beyond point accuracy, we examine whether residuals exhibit serial autocorrelation. Let ρ ^ k denote the sample autocorrelation of the residual series at lag k. For a fixed lag order L, the Ljung–Box statistic is
    Q ( L ) = n ( n + 2 ) k = 1 L ρ ^ k 2 n k .
    Under the null hypothesis that the residuals are independently distributed up to lag L, Q ( L ) approximately follows a chi-square distribution with L degrees of freedom. A statistically significant result indicates remaining autocorrelation in the residuals and therefore suggests that temporal dependence has not been fully captured by the model. In our experiments, we report the test at L = 10 . For neural models evaluated over multiple independent runs, we report the median p-value across runs together with the number of runs in which H 0 is rejected at the α = 0.05 level (reject count out of 10). For deterministic baselines, the Ljung–Box statistic and its corresponding p-value are reported directly.

4. Results

In this section, we present the empirical evaluation of the proposed forecasting framework. We begin by detailing the dataset and experimental setup, including data partitioning and model implementation parameters. We then report stationarity diagnostics for the raw price and transformed series. Following this foundation, we examine the effectiveness of the delta-targeting strategy across both recurrent and the recent adapted neural baselines. Subsequently, we provide a comprehensive benchmarking of all evaluated models against classical statistical baselines across three tactical horizons and report residual autocorrelation diagnostics to assess forecast reliability beyond point-error metrics. Finally, we analyze the interpretability components of the Hybrid model.

4.1. Dataset and Experimental Setup

The empirical analysis is based on a proprietary daily scrap steel price series provided by an industrial partner. To preserve commercial confidentiality, the company identity and market-specific contractual details are not disclosed; however, the series is internally consistent over time and corresponds to a single scrap steel price stream used for operational decision support in the partner’s procurement setting. The cleaned dataset spans the period from 10 May 2016 to 21 February 2025 and contains 3210 daily observations. After preprocessing, the final series contains no missing dates and no duplicate timestamps, yielding one observation for each calendar day in the study period. Observations are available for all seven days of the week, consistent with the continuous pricing practice of the partner organization. Over the observation period, prices range from 207 to 658 Turkish Lira (TL)/ton, with a mean of 352.3 TL/ton and a standard deviation of 82.6 TL/ton, indicating substantial variation over time, including the COVID-19 period (2020) and the subsequent recovery phase (2021–2022).
For neural models, we adopt a chronological split after multi-horizon target alignment. Since the largest forecasting horizon is H max = 7 , the construction of supervised targets for h { 1 , 3 , 7 } reduces the effective sample size from 3210 raw daily observations to 3203 supervised samples. The first 80% of these supervised samples (2562 observations; May 2016–May 2023) forms the training-full partition, and the remaining 20% (641 observations; May 2023–February 2025) is reserved for out-of-sample testing. Within the training-full partition, the earliest 85% (2177 observations) is used for training and the most recent 15% (385 observations) for validation, enabling early stopping and model selection under strict temporal causality.
The stationarity diagnostics reported in Section 4.2 are computed separately on the first 80% of the original chronological raw price series before multi-horizon target alignment. This raw diagnostic segment contains 2568 daily price observations. The difference between 2568 and 2562 arises from the order of operations: stationarity testing operates on the raw price-level series before target construction, whereas neural-model training uses the supervised partition obtained after excluding the final H max = 7 observations that cannot serve as forecast anchors and then applying the chronological 80/20 split with integer indexing. In all cases, the held-out test period is excluded from preprocessing, diagnostics, and model selection. To prevent look-ahead bias, all preprocessing steps used for neural models, including MinMax scaling to [ 1 , 1 ] , are fitted exclusively on the neural training subset and then applied unchanged to the validation and test subsets. Classical baselines (SARIMA and ETS), by contrast, are evaluated under a rolling-origin refit protocol on the same test period. Across all experiments, methods are evaluated on the same set of test anchor times to ensure strict comparability.
To ensure a consistent comparison across model families, all neural models process a sliding window of length W = 120 and use a direct multi-horizon output configuration for horizons h { 1 , 3 , 7 } . Each daily input vector contains F = 5 features derived strictly from past information only: logarithmic return, 7-day rolling volatility, the first difference of a shifted EWMA with span 14, and sine/cosine encodings of the day of week. The logarithmic return captures short-term momentum, the 7-day rolling volatility reflects local variability, the EWMA slope provides a smooth trend indicator, and the calendar encodings represent weekly seasonality in a continuous form. All features are computed from lagged information with appropriate shifting so that no future observation enters the feature construction process.
For the delta-targeting formulations, including the proposed Hybrid model, targets are defined as
Δ h ( t ) = P t + h P t ,
and final forecasts are reconstructed through
P ^ t + h = P t + Δ ^ h ( t ) .
This formulation shifts the learning objective from absolute price levels to short-horizon price changes, which is particularly relevant given the non-stationary behavior of the raw series documented in Section 4.2.
Regarding model architectures, the recurrent deep learning baselines (RNN, GRU, and LSTM) consist of two stacked layers with 64 units each and a dropout rate of 0.2.
For the recent adapted neural baselines, DLinear and N-BEATS follow the architectures described in Section 3.4 under the same optimization settings. The C-KAN-style baseline uses a convolutional front-end with 32 channels and kernel size 5, followed by stacked KAN-style layers with hidden dimension 128 and Gaussian RBF activations. The proposed Hybrid model employs a deeper non-linear pathway with Bidirectional GRU (BiGRU) layers of 128 and 64 units, respectively, together with 10 4 L2 regularization and a dropout rate of 0.3. All neural architectures are optimized using Adam ( l r = 10 3 ) with Huber loss ( δ = 1.0 ) and a batch size of 32. Training uses early stopping and learning-rate reduction on plateau, with patience values in the range of 12–15 epochs. To reduce the sensitivity of neural results to random initialization, all reported neural metrics are aggregated over 10 independent runs using fixed seeds { 11 , , 111 } .

4.2. Stationarity Diagnostics

Prior to model estimation, the stationarity properties of the price series were examined exclusively on the training partition ( n = 2568 daily observations; May 2016–May 2023) in order to prevent any leakage from the held-out test period. We apply two complementary tests: the Augmented Dickey–Fuller (ADF) test, whose null hypothesis states that the series contains a unit root (i.e., is non-stationary), and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test, whose null hypothesis states that the series is stationary. Using both tests jointly reduces the risk of drawing misleading conclusions from either test alone. The combined use of ADF and KPSS tests as a preprocessing diagnostic step is consistent with recent practice in time-series forecasting evaluation, where non-stationarity and temporal distributional changes are known to affect model training, validation design, and generalization behavior, including in recurrent neural forecasting models [36,37]. The joint interpretation of ADF and KPSS results also provides a compact diagnostic framework for characterizing a series as non-stationary, stationary, or potentially trend-stationary depending on the combination of unit-root and stationarity test outcomes. In the present case, the raw price series is consistently classified as non-stationary, whereas all delta-transformed targets are classified as stationary under both tests. Results are reported in Table 2.
The raw price series fails to reject the ADF unit-root null hypothesis ( p = 0.145 > 0.05 ), while the KPSS stationarity null is strongly rejected (stat = 3.986 , p < 0.01 ). The two tests therefore agree in characterizing the raw price series as non-stationary. In contrast, all three delta series ( Δ H = 1 , 3 , 7 ) and the log-return series reject the ADF null at conventional significance levels ( p < 0.001 ) while failing to reject the KPSS null ( p > 0.10 ), indicating stationarity under both criteria. These results provide statistical support for the delta-targeting formulation adopted in this study: by training models on short-horizon price changes rather than absolute price levels, the learning objective is defined on a statistically more stable target space, which can improve optimization behavior and reduce the risk of persistence-dominated predictions. It should be noted, however, that stationarity of the target series does not guarantee well-behaved residuals or eliminate all sources of forecast difficulty; rather, it removes one structural obstacle that is known to affect model training under non-stationary conditions.

4.3. Effect of Delta-Targeting

To assess the practical effect of the delta-targeting formulation, we compare six neural models under two training objectives: (i) direct regression on raw price levels and (ii) regression on short-horizon price changes. The evaluated models include three recurrent architectures (RNN, GRU, and LSTM) and three recent adapted neural baselines (DLinear, N-BEATS, and C-KAN). All configurations share the same input window ( W = 120 ), direct multi-horizon output setup, and the leakage-free evaluation protocol defined in Section 4.1. Performance is summarized in Table 3, Table 4, Table 5, Table 6 and Table 7 in terms of RMSE, MAPE(%), R 2 , RAMP, and normalized HD.
Table 3 shows a clear and consistent effect of target formulation across all six neural models. For the recurrent architectures, delta-targeted training reduces RMSE at all horizons, with the largest improvement observed for LSTM at H = 7 , where RMSE decreases from 8.485 ± 0.129 to 7.470 ± 0.076 . The same trend is visible for RNN and GRU, indicating that recurrent architectures benefit systematically from learning in delta space rather than directly in the raw price domain. The recent neural forecasting baselines exhibit an even stronger dependence on target formulation: under raw-price training, DLinear, N-BEATS, and C-KAN all produce very large RMSE values, whereas their delta-targeted variants recover to a competitive range.
The same pattern appears in relative error terms. As shown in Table 4, delta-targeting consistently yields lower MAPE across all recurrent and recent adapted neural architectures. Among the recurrent models, the delta-based RNN achieves the lowest weekly-horizon MAPE ( 1.393 ± 0.014 % ). Among the recent adapted neural forecasting baselines, the shift from raw-price to delta-targeted training is especially pronounced: for example, C-KAN improves from 5.961 ± 1.946 % to 1.460 ± 0.029 % at H = 7 , while DLinear and N-BEATS show similarly large reductions. These results suggest that the delta formulation substantially improves relative-error behavior, particularly for architectures that otherwise struggle with raw-price targets.
The goodness-of-fit results in Table 5 reinforce the same conclusion. For recurrent models, raw-price training already leads to a clearer loss of fit at H = 7 , where R 2 falls to the 0.870 0.894 range, while delta-targeted training preserves a higher fit level ( 0.898 0.903 ). For the recent adapted neural forecasting baselines, the contrast is even sharper: raw-price training yields negative R 2 values across all horizons, whereas delta-targeting shifts all three models into a positive and competitive range. This indicates that, in the present multivariate short-horizon setting, the target formulation is at least as important as the architectural family itself.
The improvement under delta-targeting also extends to the complementary trajectory-level diagnostics reported in Table 6 and Table 7. In terms of the normalized HD, raw-price training generally produces larger distances between the predicted and observed trajectories, indicating weaker geometric alignment in the time–value plane. This effect is particularly visible for the adapted neural baselines, where raw-price training leads to unstable or poorly aligned forecast paths, whereas delta-targeted training yields more comparable trajectory shapes. Thus, the HD results support the main point-error findings by showing that delta-targeting improves not only average error levels but also the overall shape similarity between forecasts and observations.
The RAMP results require more careful interpretation because RAMP measures local change-tracking rather than absolute level accuracy. For DLinear and N-BEATS, delta-targeting substantially reduces RAMP across all horizons, which is consistent with their large improvements in RMSE, MAPE, and R 2 . However, for some models, a lower RAMP value under raw-price training does not necessarily imply a better forecast. This is most evident for the raw-price C-KAN model, which obtains the lowest RAMP values across all horizons ( 0.880 ± 0.047 , 0.874 ± 0.030 , and 0.873 ± 0.024 for H = 1 , H = 3 , and H = 7 , respectively). Despite these low RAMP scores, the same model shows substantially degraded point-forecast accuracy, with negative R 2 values across all horizons ( R 2 = 0.476 , R 2 = 0.529 , and R 2 = 0.423 for H = 1 , H = 3 , and H = 7 , respectively; see Table 5). This discrepancy suggests that raw-price C-KAN produces relatively smooth or persistence-dominated trajectories with limited local variation. Such forecasts may reduce the difference between successive predicted changes and successive observed changes, thereby lowering RAMP, while still failing to match the actual price level. Therefore, RAMP and HD are interpreted here as complementary diagnostics rather than standalone model-selection criteria.
Taken together, the results across both recurrent and recent adapted neural forecasting baselines indicate that delta-targeting is a critical design choice in this multivariate short-horizon forecasting problem. For recurrent models, it consistently reduces point error and improves fit. For the recent adapted neural forecasting baselines, it appears necessary for competitive performance: without it, the combination of multivariate input and non-stationary raw-price targets leads to severely degraded results regardless of architectural complexity. All subsequent neural comparisons in this paper therefore use delta-targeted training.

4.4. Overall Benchmarking Across Horizons

We benchmark the proposed Hybrid Wide & Deep model against four classical statistical baselines (Naive, S-Naive, SARIMA, and ETS), three delta-targeted recurrent deep learning models ( Δ -RNN, Δ -GRU, and Δ -LSTM), and three delta-targeted recent adapted neural forecasting baselines ( Δ -DLinear, Δ -N-BEATS, and Δ -C-KAN). Figure 1, Figure 2 and Figure 3 summarize performance across horizons h { 1 , 3 , 7 } using RMSE, MAPE, and R 2 , respectively.
Figure 1 presents the RMSE comparison across all evaluated methods. At H = 1 , the Naive baseline (RMSE 2.335) remains highly competitive, with delta-targeted recurrent models (e.g., Δ -RNN: 2.271) and the recent adapted neural forecasting baselines (N-BEATS: 2.327, C-KAN: 2.338, DLinear: 2.465) achieving comparable or slightly improved results. The proposed Hybrid model (2.314) also remains within this competitive range. At H = 7 , a clearer stratification emerges: SARIMA (8.234) and ETS (8.161) deteriorate below the Naive baseline (7.555), while all delta-targeted neural models maintain relative stability. The Δ -RNN achieves the lowest RMSE (7.289), followed by Δ -GRU (7.372). Among the recent adapted neural forecasting baselines, Δ -DLinear (7.472) is competitive, while Δ -N-BEATS (7.592) and Δ -C-KAN (7.691) trail slightly. The Hybrid model (7.471) remains within the competitive range without uniformly minimizing point error.
As illustrated in Figure 2, the relative-error results reinforce the main pattern observed in the RMSE analysis. At the intermediate horizon ( H = 3 ), the Naive baseline remains the strongest benchmark (MAPE = 0.651 % ), while the delta-targeted neural models occupy a narrow but slightly higher range. Among these, Δ -C-KAN attains the lowest MAPE ( 0.683 % ), followed by Δ -RNN ( 0.729 % ), the Hybrid model ( 0.739 % ), and Δ -N-BEATS ( 0.757 % ). In contrast, the classical statistical baselines SARIMA ( 0.855 % ) and ETS ( 0.821 % ) remain less competitive at this horizon. At the extended horizon ( H = 7 ), a clearer separation emerges. The classical statistical models deteriorate to MAPE values above 1.64 % , whereas the best neural models remain in the 1.39 % 1.48 % range. The lowest mean MAPE is achieved by Δ -RNN ( 1.393 % ), marginally outperforming the Naive baseline ( 1.414 % ). The proposed Hybrid model ( 1.475 % ) remains within the competitive range, although it does not minimize point error relative to the best-performing recurrent configuration. Among the recent adapted neural forecasting baselines, Δ -C-KAN ( 1.460 % ) and Δ -DLinear ( 1.470 % ) are close to the Hybrid model, while Δ -N-BEATS trails slightly ( 1.515 % ).
The goodness-of-fit analysis shown in Figure 3 is consistent with the patterns observed in the error-based metrics. At the shortest horizon ( H = 1 ), most evaluated models already achieve high explanatory power ( R 2 0.99 ), reflecting strong short-term persistence in the series and leaving limited room for improvement over simple carry-forward forecasting. In this setting, the gains achieved by delta-targeted neural models over the Naive baseline remain relatively modest. A clearer separation emerges at the weekly horizon ( H = 7 ). The classical statistical baselines SARIMA ( R 2 = 0.860 ) and ETS ( R 2 = 0.862 ) fall below the Naive benchmark ( R 2 = 0.882 ), whereas the delta-targeted neural models remain in a higher range. The highest R 2 is obtained by Δ -RNN ( 0.903 ), followed by Δ -GRU ( 0.901 ). The proposed Hybrid model ( 0.898 ) remains within the competitive range and exceeds both the Naive and the classical statistical baselines, although it does not attain the highest explanatory power among all evaluated neural configurations. Among the recent adapted neural forecasting baselines, Δ -DLinear ( 0.898 ), Δ -N-BEATS ( 0.895 ), and Δ -C-KAN ( 0.892 ) also remain competitive at this horizon. Overall, the R 2 results indicate that the relative advantage of neural models becomes more apparent as the forecasting horizon increases, while the Hybrid model should be interpreted as competitive rather than uniformly best.
Table 8 reports the RAMP scores across all evaluated methods. At H = 1 , the Seasonal Naive baseline attains the lowest RAMP ( 1.444 ), followed closely by Δ -C-KAN ( 1.491 ) and the Naive baseline ( 1.502 ). The delta-targeted recurrent models occupy a similar range, whereas SARIMA ( 1.976 ) and Δ -DLinear ( 2.074 ) exhibit comparatively higher values at this horizon. At H = 7 , RAMP values increase for most models, consistent with the greater difficulty of tracking short-term price movements at longer horizons. The lowest RAMP is obtained by the Naive and Seasonal Naive baselines (both 1.428 ), followed by Δ -C-KAN ( 1.473 ). Among the remaining neural models, Δ -LSTM ( 1.889 ) and Δ -GRU ( 1.990 ) remain relatively competitive, whereas the Hybrid model ( 2.180 ) and Δ -DLinear ( 2.634 ) yield higher values. It should be noted that low RAMP values do not necessarily indicate superior overall forecast quality; as discussed in Section 4.3, models that produce near-constant or persistence-dominated forecasts can appear favorable under RAMP while performing less strongly on point-error metrics. RAMP should therefore be interpreted alongside RMSE, MAPE, and R 2 rather than in isolation.
Table 9 reports the normalized HD across all evaluated methods. In contrast to RAMP, HD provides a trajectory-level view by quantifying the geometric similarity between the predicted and observed series in the time–value plane. At H = 1 , the Naive baseline yields the lowest HD ( 0.0016 ), reflecting the strong short-term persistence of the series. Among the neural models, Δ -C-KAN attains the lowest HD ( 0.016 ), followed by Δ -GRU and Δ -LSTM (both 0.022 ), while the Hybrid model ( 0.025 ) remains within a similar range. At longer horizons, the relative differences become more pronounced. At H = 3 , Δ -C-KAN again attains the lowest HD ( 0.015 ), substantially below the classical statistical baselines SARIMA ( 0.0826 ) and ETS ( 0.0809 ), and also below the recurrent delta-targeted models. A similar pattern is observed at H = 7 , where Δ -C-KAN ( 0.020 ) remains the strongest method under HD, while the Hybrid model ( 0.109 ), Δ -DLinear ( 0.11 ), and the delta-targeted recurrent models occupy a moderately higher but still competitive range relative to the classical baselines. Overall, the HD results suggest that the neural models, and especially Δ -C-KAN, preserve the global shape of the target trajectory more effectively than the statistical baselines as the forecasting horizon increases. As with RAMP, however, HD should be interpreted as a complementary diagnostic rather than as a standalone selection criterion. It is worth noting that Δ -C-KAN’s strong HD performance across all horizons coexists with elevated residual autocorrelation at H = 1 (see Section 4.5). This indicates that geometric trajectory similarity and residual independence reflect different aspects of forecast quality and may lead to different model rankings.

4.5. Residual Diagnostics and Forecast Reliability

To assess forecast reliability beyond point-error metrics, we apply the Ljung–Box Q-test at lag L = 10 to the residuals of all evaluated models. Although Ljung–Box testing originates in statistical time-series analysis, tests for residual autocorrelation have also been discussed in neural time-series modeling as tools for detecting remaining serial dependence in model errors, particularly when such dependence may arise from omitted variables, measurement errors, model misspecification, or insufficient temporal feature representation [38,39]. Accordingly, in this study, the Ljung–Box Q-test is used as a complementary residual-dependence diagnostic rather than as a point-accuracy metric. The null hypothesis ( H 0 ) states that the residuals exhibit no serial autocorrelation up to lag 10. Rejection of H 0 at the α = 0.05 level indicates that predictable temporal structure remains in the errors, suggesting that the model has not fully captured the dynamics of the series. For deterministic baselines, the test statistic and p-value are reported directly; for neural models, we report the Ljung–Box statistic, together with the number of runs in which H 0 is rejected out of 10 independent runs. Results are summarized in Table 10.
The results reveal a clear pattern: meaningful differentiation across models is observed only at H = 1 , while at H = 3 and H = 7 , virtually all models exhibit significant residual autocorrelation. At the one-step horizon, the delta-targeted recurrent models ( Δ -RNN, Δ -GRU, and Δ -LSTM) produce the cleanest residuals, with H 0 not rejected in any of the 10 independent runs (Q statistics of 4.24, 6.08, and 8.29, respectively). The Hybrid model ( Q = 12.15 , 3/10 rejections) and the recent adapted neural forecasting baselines Δ -DLinear ( Q = 13.93 , 2/10) and Δ -N-BEATS ( Q = 12.75 , 1/10) also remain broadly consistent with the no-autocorrelation null at H = 1 . In contrast, the classical persistence-based baselines (Naive and Seasonal Naive) and Δ -C-KAN exhibit strongly significant autocorrelation at H = 1 , suggesting that these models leave substantial predictable structure unexploited in their one-step residuals.
At H = 3 and H = 7 , all evaluated models reject H 0 across all runs, with Q statistics increasing substantially with the forecast horizon. This pattern is consistent with the difficulty of fully capturing multi-step dependence in financial price series under the present feature set and direct multi-horizon forecasting setup. The presence of residual autocorrelation at longer horizons does not necessarily invalidate the point-error results reported in previous sections; rather, it indicates that additional explanatory variables or alternative modeling strategies may be required to further reduce unexplained temporal dependence at extended forecast horizons. The H = 1 residual diagnostics are therefore particularly informative for model comparison. The superior residual behavior of the delta-targeted recurrent models at this horizon, combined with their competitive point-error performance, suggests that delta-targeting not only improves predictive accuracy but may also yield more weakly autocorrelated one-step residuals in this setting. Δ -C-KAN’s failure to pass the Ljung–Box diagnostic at H = 1 despite its strong HD performance further underscores that different evaluation criteria capture distinct aspects of forecast quality.

4.6. Interpretability and Diagnostic Analysis of the Hybrid Model

This subsection examines the Hybrid model from an interpretability and diagnostic perspective at the most challenging tactical horizon ( H = 7 ). Rather than treating these analyses as causal explanations of model behavior, we use them as complementary diagnostic tools to characterize how the model distributes emphasis across time steps, input features, and internal pathways. In particular, we analyze (i) across-run forecast stability, (ii) the decomposition of the Wide and Deep components, (iii) temporal attention profiles, and (iv) permutation-based feature sensitivity. These diagnostics are used to contextualize the comparative forecasting results reported in the previous subsections and to provide a more transparent view of the model’s internal behavior under the adopted experimental setting.
Figure 4 provides a qualitative view of the Hybrid model’s forecast behavior at the most challenging horizon ( H = 7 ). The shaded region represents the 5–95% variability band across 10 independent runs, while the central line shows the mean prediction. The relatively narrow band suggests limited run-to-run variation under different random initializations. At the same time, the figure indicates that the model broadly follows the observed price trajectory, but some changes appear to be tracked with delay, which is consistent with a persistence-sensitive forecasting behavior. This behavior is also consistent with the endogenous-only feature set used in this study, under which abrupt externally driven changes cannot be anticipated directly. Accordingly, this visualization should be interpreted as a descriptive stability and alignment diagnostic rather than as evidence, on its own, that the Hybrid model provides actionable advantages over simpler baselines. That question is addressed more appropriately through the comparative benchmark results reported in the previous subsection.
Figure 5 presents the scaled-space decomposition of the Hybrid model’s internal pathways at H = 7 . The visualization provides a diagnostic view of how the Wide and Deep components contribute to the final prediction across the test period. Across independent training runs, the Wide component (dashed blue line) exhibits more rapidly varying behavior, while the Deep component (solid orange line) follows a smoother and more oscillatory pattern. This contrast is consistent with a functional separation in which the Wide pathway captures more local linear adjustments and the Deep pathway contributes a smoother correction term. The total prediction (solid green line) remains relatively stable across runs, as indicated by the narrow ± 1 standard deviation band. These patterns should be interpreted as model-internal diagnostics rather than as causal evidence that specific market mechanisms have been isolated. In particular, the oscillatory behavior of the Deep component is consistent with sensitivity to recurring temporal structure, but it does not by itself establish that the model has identified a unique weekly causal process.
Figure 6 summarizes the global attention profile of the Deep pathway, averaged across the test set and multiple training seeds. Rather than exhibiting a purely monotonic decay toward recent lags, the attention weights display a recurring pattern with peaks that are broadly consistent with 7-day spacing. This suggests that the model assigns relatively greater emphasis to certain lags associated with weekly temporal structure. However, attention-weight distributions can vary across independent training runs because they are affected by random initialization, optimization dynamics, and equivalent internal representations. The shaded standard deviation bands therefore provide a useful indication of across-run variability, but the resulting attention profile should be interpreted primarily as a qualitative, run-aggregated diagnostic summary rather than as a stable explanatory attribution. These results should not be interpreted as causal evidence that the attention mechanism has isolated a unique seasonal process. In this sense, the attention profile provides supporting diagnostic evidence that weekly lag structure is relevant to the Deep pathway under the present experimental setting.
To examine the Hybrid model’s feature sensitivity at the most challenging horizon ( H = 7 ), we use permutation feature importance as a diagnostic tool. The input space contains five predictors: two cyclical calendar encodings, DoW sin and DoW cos , and three technical variables, namely LogRet , rolling volatility, and EWMA slope . As shown in Figure 7a, permuting the day-of-week encodings leads to larger increases in RMSE than permuting the technical indicators, suggesting that the model is relatively more sensitive to weekly calendar information under this perturbation scheme. This result should be interpreted as a feature-sensitivity diagnostic rather than as causal evidence that the model’s forecasts are driven exclusively by weekly operational cycles.
A complementary descriptive view is provided by the polar plot in Figure 7b, which displays the Mean Absolute Error (MAE) across the days of the week. The plotted differences are visually discernible but numerically small, as also indicated by the in-figure MAE range annotation. The radial and color scales should therefore be read in light of this narrow numerical range. For this reason, the day-specific differences should be interpreted cautiously and as descriptive rather than statistically or practically conclusive. Taken together, the PFI and polar analyses suggest that weekly structure is relevant to the Hybrid model at H = 7 , but they do not by themselves establish a causal mechanism or practical superiority over alternative models.

5. Discussion

This study evaluated short-horizon scrap steel price forecasting (1–3–7 days) on real-world daily data. Classical baselines were assessed under a rolling-origin refit protocol, while neural models were trained once on a chronological training partition and evaluated on a fixed held-out test block under leakage-free preprocessing. This distinction matters for interpretation: the neural results reflect generalization from a single training run rather than adaptive refitting, and direct comparisons between the two evaluation regimes should be made with this difference in mind.
Across all methods, accuracy declines as the horizon increases from H = 1 to H = 7 , which is expected for volatile, non-stationary series. A key structural reason is that the feature set used in this study is entirely endogenous: it consists of past-price-derived transformations and calendar encodings, with no exogenous covariates. This means that all evaluated models, including the Hybrid, can only respond to patterns already present in the price history. Abrupt price movements driven by external factors, such as energy cost shocks, logistics disruptions, or macroeconomic announcements, are not represented in the inputs and therefore cannot be anticipated. The delay-like behavior visible in Figure 4 is a natural consequence of this design rather than a modeling failure per se: the model broadly tracks the trajectory, but reacts to sharp changes with some lag. This reflects the inherent limit of endogenous-only forecasting under external shocks, and is consistent with the persistence-dominated structure of the residuals documented in Section 4.5. A key operational observation is the strength of the Naive baseline at short horizons: at H = 1 , persistence already explains most day-ahead variation (Figure 3). Consequently, the incremental gains of complex models at H = 1 are limited, and deployment should be justified by the operational value of marginal improvements rather than by average metrics alone.
At longer horizons, differences between model families become more informative. The stationarity diagnostics reported in Section 4.2 establish that the raw price series is non-stationary under both ADF and KPSS criteria, which provides a structural motivation for the observed pattern: models trained directly on absolute price levels must implicitly learn to track a shifting mean, while delta-targeted models operate on a stationary target space. This distinction helps explain why raw-price training leads to a sharper decline in R 2 at H = 7 for all architectures, and why the recent adapted neural forecasting baselines deteriorate substantially under raw-price training. The gain from delta-targeting is therefore not only empirical but also consistent with the stationarity diagnostics reported for the present dataset [8,26]. At H = 7 , classical statistical baselines (SARIMA, ETS) deteriorate and often underperform the persistence benchmark, whereas delta-targeted deep models remain competitive with Naive and can slightly surpass it. This matters in practice because many procurement and inventory decisions follow weekly cycles; therefore, forecasting quality at H = 7 directly affects planning.
The most consistent empirical finding is that training deep learning models in delta space reduces horizon-driven degradation. For all three recurrent architectures, delta-targeting yields lower error at H = 7 and higher R 2 than raw-price training. The effect is most pronounced for LSTM, where raw-price training shows the largest drop in fit at H = 7 , while the delta-trained variant recovers a substantial portion of that loss. Standard deviations also generally decrease under delta-targeting, indicating improved convergence stability across runs. However, delta-targeting is not a universal remedy: it cannot remove uncertainty from exogenous shocks not present in the inputs, and yields modest gains when persistence dominates, as at H = 1 . Furthermore, the Ljung–Box diagnostics in Section 4.5 indicate that even the best delta-targeted models retain significant residual autocorrelation at H = 3 and H = 7 . This suggests that a meaningful portion of the predictable variance at longer horizons remains unexplained by the current feature set, and that incorporating additional temporal structure or exogenous signals could further improve reliability.
The proposed Hybrid (Wide & Deep) model is competitive but not uniformly best on point-error metrics. In the overall benchmarking, the delta-based RNN attains the lowest mean MAPE at H = 7 among all evaluated models, while the Hybrid model remains close but slightly higher. This outcome is consistent with the architectural trade-off inherent in the Hybrid design: the added complexity of the attention mechanism and the dual-pathway structure introduces additional learnable parameters and optimization variance, which may not translate into point-error gains over a compact recurrent baseline on this dataset. Nevertheless, the Hybrid model maintains a comparable goodness-of-fit level across all tested horizons and generally outperforms traditional statistical baselines. Its practical value stems from (i) explicit decomposition of linear and non-linear contributions and (ii) model-internal diagnostic signals. The additive fusion of Wide and Deep pathways yields a transparent separation between persistence-like effects and residual corrections. The forecast variability band at H = 7 (Figure 4) indicates stable behavior across independent initializations, which is important for operational settings where models are retrained periodically.
At the same time, the Hybrid architecture entails additional modeling choices, which can increase tuning effort relative to simpler recurrent baselines. The attention profile (Figure 6) and PFI results (Figure 7a) suggest that the model places relatively greater sensitivity on calendar-related inputs, which is consistent with the weekly structure present in the data. However, these diagnostics should be interpreted as sensitivity signals rather than causal attributions: they indicate which inputs are most influential under perturbation, not whether those inputs causally determine market prices. The polar sensitivity plot (Figure 7b) further illustrates that errors are not uniform across weekdays, which is consistent with day-dependent predictability, but the differences are numerically small and should not be over-interpreted.
The residual diagnostics reported in Section 4.5 provide additional context for interpreting these results. At H = 1 , the delta-targeted recurrent models produce the most well-behaved residuals under the Ljung–Box test. This suggests that, at the one-step horizon, these models capture a larger share of the exploitable temporal structure in the residuals than the competing alternatives in this study. At H = 3 and H = 7 , however, all models exhibit significant residual autocorrelation, indicating that temporal dependence remains in the forecast errors even for the best-performing configurations. This remaining dependence should not be attributed solely to missing exogenous variables. It may also reflect unmodeled regime shifts, residual weekly or calendar-related seasonal effects not fully captured by the current feature set, or persistence patterns that become more difficult to resolve as the forecast horizon increases. This interpretation is consistent with the endogenous-only feature set: in the absence of exogenous covariates, the models cannot account for external drivers that introduce predictable structure at multi-step horizons.
Related approaches in the forecasting literature, such as Transformer-based models with explicit exogenous-variable integration [19] or probabilistic frameworks such as DeepAR [18], suggest that these limitations may be partially addressed by richer input representations or by uncertainty-aware output formulations.
From a deployment perspective, the Naive baseline should remain a mandatory reference in performance monitoring. Delta-targeted recurrent models are attractive when the primary objective is point accuracy at H = 7 , whereas the Hybrid model is preferable when diagnosability and stakeholder-facing justification are required alongside competitive accuracy. Regarding the limitation of endogenous-only inputs, two directions may be partially explored within the present modeling framework, even before incorporating additional data sources. First, scenario-based projections could complement point forecasts by conditioning on assumed external shock magnitudes, providing scenario-conditioned forecast ranges under different market conditions. Second, conformal prediction or quantile regression could be layered on top of the existing point-forecast models to produce calibrated uncertainty intervals, supporting risk-aware procurement decisions without requiring full probabilistic modeling. These extensions are identified as priorities for future work.

6. Conclusions

This study examined short-horizon scrap steel price forecasting at 1-, 3-, and 7-day horizons using real-world daily data from an industrial setting. The empirical comparison covered classical statistical baselines, recurrent deep learning models, and recent adapted neural forecasting baselines evaluated under both raw-price and delta-targeted formulations, together with a delta-targeted Hybrid Wide & Deep architecture augmented with temporal attention and diagnostic interpretability tools. Methodologically, the study contributes a comparative framework that combines leakage-free preprocessing, stationarity diagnostics, target-formulation analysis, horizon-specific evaluation, residual diagnostics, and model-internal interpretability analysis within a single forecasting pipeline.
The main empirical finding is that delta-targeting provides a consistently more effective learning objective than direct raw-price forecasting for this non-stationary series, especially at the weekly horizon. Across recurrent architectures, predicting short-horizon price changes improves error metrics and preserves goodness-of-fit more effectively than learning directly on absolute price levels. In the overall benchmark, delta-targeted recurrent models remain competitive with persistence-based baselines and outperform the classical statistical models at H = 7 . The proposed Hybrid model is competitive but not uniformly best in point-error terms; rather, its contribution lies in offering a balanced combination of forecasting performance, explicit linear/non-linear decomposition, and diagnostic transparency. In this sense, this study suggests that, for volatile commodity series, target formulation may be as important as architectural complexity.
At the same time, several limitations should be acknowledged. First, the feature set is entirely endogenous, consisting of price-derived transformations and calendar encodings, which limits the ability of all evaluated models to anticipate abrupt externally driven market movements. This limitation is also reflected in the residual diagnostics: at longer horizons, significant residual autocorrelation remains even for the strongest-performing models, indicating that part of the predictable temporal structure is still not captured, potentially due to missing exogenous drivers, unmodeled regime shifts, or residual calendar-related seasonal effects. Second, the interpretability analyses reported here are diagnostic rather than causal; they help characterize temporal emphasis and feature sensitivity but do not establish causal market mechanisms.
These limitations point to several concrete directions for future work. The most immediate extension is to incorporate exogenous covariates such as energy prices, exchange rates, macroeconomic indicators, logistics signals, and holiday effects in order to better capture externally induced price variation. A second direction is to extend the framework toward uncertainty-aware forecasting through quantile regression, conformal prediction, or broader probabilistic forecasting formulations, which may provide more useful support for risk-sensitive procurement decisions than point estimates alone. Finally, future studies could examine longer forecast horizons, richer multi-step evaluation protocols, and broader market settings to test the generality of the present findings beyond the single-series industrial case considered here.

Author Contributions

Conceptualization, N.S.C. and O.U.; methodology, N.S.C., M.K. and Y.D.; software, N.S.C. and Y.A.; validation, M.K., Y.D. and Y.A.; formal analysis, N.S.C. and O.U.; investigation, N.S.C., M.K. and Y.D.; resources, O.U.; data curation, N.S.C.; writing—original draft preparation, N.S.C.; writing—review and editing, M.K., Y.D., Y.A. and O.U.; visualization, N.S.C. and Y.A.; supervision, O.U.; project administration, O.U. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Istanbul Topkapi University.

Data Availability Statement

The data supporting the findings of this study were obtained from a private industrial partner and are subject to commercial confidentiality agreements. Due to these legal and ethical restrictions, the raw dataset is not publicly available. However, processed or anonymized samples of the data may be available from the corresponding author upon reasonable request.

Acknowledgments

Y. Aygul thanks TUBITAK for their scholarship support under the BIDEB 2211-A program. During the preparation of this manuscript, the authors used ChatGPT 5.2. (OpenAI) for assistance with academic writing and language editing (e.g., drafting and refining section text, improving clarity and consistency, and polishing grammar). The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Alexopoulos, K.; Catti, P.; Kanellopoulos, G.; Nikolakis, N.; Blatsiotis, A.; Christodoulopoulos, K.; Kaimenopoulos, A.; Ziata, E. Deep learning for estimating the fill-level of industrial waste containers of metal scrap: A case study of a copper tube plant. Appl. Sci. 2023, 13, 2575. [Google Scholar] [CrossRef]
  2. Huang, B.; Liu, J.; Zhang, Q.; Liu, K.; Li, K.; Liao, X. Identification and classification of aluminum scrap grades based on the Resnet18 model. Appl. Sci. 2022, 12, 11133. [Google Scholar] [CrossRef]
  3. Xiarchos, I.M.; Fletcher, J.J. Price and volatility transmission between primary and scrap metal markets. Resour. Conserv. Recycl. 2009, 53, 664–673. [Google Scholar] [CrossRef]
  4. OECD. Unlocking Potential in the Global Scrap Steel Market: Opportunities and Challenges; OECD Science, Technology and Industry Policy Papers, No. 170; OECD Publishing: Paris, France, 2024. [Google Scholar] [CrossRef]
  5. Watari, T.; Nansai, K.; Nakajima, K. Review of critical metal dynamics to 2050 for 48 elements. Resour. Conserv. Recycl. 2020, 155, 104669. [Google Scholar] [CrossRef]
  6. Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
  7. Box, G.E.P.; Jenkins, G.M. Time Series Analysis: Forecasting and Control; Holden-Day: San Francisco, CA, USA, 1970. [Google Scholar]
  8. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2018. [Google Scholar]
  9. Xu, X.; Zhang, Y. Steel price index forecasting through neural networks: The composite index, long products, flat products, and rolled products. Miner. Econ. 2023, 36, 563–582. [Google Scholar] [CrossRef]
  10. Jin, B.; Xu, X. Machine Learning-Based Scrap Steel Price Forecasting for the Northeast Chinese Market. Int. J. Empir. Econ. 2024, 3, 2450011. [Google Scholar] [CrossRef]
  11. Jin, B.; Xu, X. Price predictions of scrap steel for north China via machine learning. J. Chin. Econ. Bus. Stud. 2025, 1–19. [Google Scholar] [CrossRef]
  12. Jin, B.; Xu, X. Predicting scrap steel prices through machine learning for South China. Mater. Circ. Econ. 2025, 7, 2. [Google Scholar] [CrossRef]
  13. Jin, B.; Xu, X. Machine learning scrap steel price forecasts for the regional east Chinese market. J. Model. Manag. 2025, 20, 2086–2113. [Google Scholar] [CrossRef]
  14. Xu, X.; Zhang, Y. Scrap steel price forecasting with neural networks for east, north, south, central, northeast, and southwest China and at the national level. Ironmak. Steelmak. 2023, 50, 1683–1697. [Google Scholar] [CrossRef]
  15. Jin, B.; Xu, X. Scrap steel price predictions for southwest China via machine learning. Innov. Emerg. Technol. 2025, 12, 2550002. [Google Scholar] [CrossRef]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Available online: https://arxiv.org/abs/1706.03762 (accessed on 10 February 2026).
  17. Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  18. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  19. Wang, Y.; Wu, H.; Dong, J.; Liu, Y.; Qiu, Y.; Zhang, H.; Wang, J.; Long, M. TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. arXiv 2024, arXiv:2402.19072. [Google Scholar] [CrossRef]
  20. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar]
  21. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  22. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  23. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  24. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the EMNLP 2014, Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
  25. Ben Taieb, S.; Bontempi, G.; Atiya, A.F.; Sorjamaa, A. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition. Expert Syst. Appl. 2012, 39, 7067–7083. [Google Scholar] [CrossRef]
  26. Bergmeir, C.; Benitez, J.M. On the use of cross-validation for time series predictor evaluation. Inf. Sci. 2012, 191, 192–213. [Google Scholar] [CrossRef]
  27. Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Dean, J. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016), Boston, MA, USA, 15 September 2016; Available online: https://arxiv.org/abs/1606.07792 (accessed on 10 February 2026).
  28. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  29. Zhou, J. Predicting stock price by using attention-based hybrid LSTM model. Asian J. Basic Sci. Res. 2024, 6, 145–158. [Google Scholar] [CrossRef]
  30. Tsay, R.S. Analysis of Financial Time Series, 3rd ed.; Wiley: Hoboken, NJ, USA, 2014. [Google Scholar]
  31. Makridakis, S.; Wheelwright, S.C.; Hyndman, R.J. Forecasting: Methods and Applications, 3rd ed.; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
  32. Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 1994. [Google Scholar]
  33. Hyndman, R.J.; Koehler, A.B.; Ord, J.K.; Snyder, R.D. Forecasting with Exponential Smoothing: The State Space Approach; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  34. Chatfield, C. The Holt-Winters Forecasting Procedure. Appl. Stat. 1978, 27, 264–279. [Google Scholar] [CrossRef]
  35. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
  36. Hewamalage, H.; Bergmeir, C.; Bandara, K. Recurrent neural networks for time series forecasting: Current status and future directions. Int. J. Forecast. 2021, 37, 388–427. [Google Scholar] [CrossRef]
  37. Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
  38. Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Taieb, S.B.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. Int. J. Forecast. 2022, 38, 705–871. [Google Scholar] [CrossRef]
  39. Sun, F.-K.; Lang, C.I.; Boning, D.S. Adjusting for Autocorrelated Errors in Neural Networks for Time Series. Adv. Neural Inf. Process. Syst. 2021, 34, 3077–3088. [Google Scholar]
Figure 1. RMSE (price space) across horizons for all evaluated methods (neural: mean of 10 runs).
Figure 1. RMSE (price space) across horizons for all evaluated methods (neural: mean of 10 runs).
Applsci 16 04981 g001
Figure 2. MAPE (%) across horizons for all evaluated methods (neural: mean of 10 runs).
Figure 2. MAPE (%) across horizons for all evaluated methods (neural: mean of 10 runs).
Applsci 16 04981 g002
Figure 3. R 2 across horizons for all evaluated methods (neural: mean of 10 runs).
Figure 3. R 2 across horizons for all evaluated methods (neural: mean of 10 runs).
Applsci 16 04981 g003
Figure 4. Forecast with across-run variability band for the Hybrid model at H = 7 . The shaded band represents the 5–95% variability range across 10 independent runs.
Figure 4. Forecast with across-run variability band for the Hybrid model at H = 7 . The shaded band represents the 5–95% variability range across 10 independent runs.
Applsci 16 04981 g004
Figure 5. Scaled-space decomposition of the Hybrid model at H = 7 . The dotted horizontal line denotes the zero-reference level; shaded bands denote ±1 standard deviation across 10 runs.
Figure 5. Scaled-space decomposition of the Hybrid model at H = 7 . The dotted horizontal line denotes the zero-reference level; shaded bands denote ±1 standard deviation across 10 runs.
Applsci 16 04981 g005
Figure 6. Global attention profile of the Deep pathway at H = 7 , shown as the mean attention weights across the test set and independent runs, with shaded bands indicating across-run variability.
Figure 6. Global attention profile of the Deep pathway at H = 7 , shown as the mean attention weights across the test set and independent runs, with shaded bands indicating across-run variability.
Applsci 16 04981 g006
Figure 7. Diagnostic analysis of the Hybrid model at H = 7 . (a) Permutation feature importance indicates greater sensitivity to day-of-week encodings than to the technical indicators under the adopted perturbation scheme. (b) The polar plot shows the distribution of Mean Absolute Error across the days of the week; differences are small in magnitude and should be interpreted cautiously. The radial scale, color scale, and in-figure MAE range annotation in panel (b) emphasize that the apparent day-of-week differences occur within a narrow numerical MAE range.
Figure 7. Diagnostic analysis of the Hybrid model at H = 7 . (a) Permutation feature importance indicates greater sensitivity to day-of-week encodings than to the technical indicators under the adopted perturbation scheme. (b) The polar plot shows the distribution of Mean Absolute Error across the days of the week; differences are small in magnitude and should be interpreted cautiously. The radial scale, color scale, and in-figure MAE range annotation in panel (b) emphasize that the apparent day-of-week differences occur within a narrow numerical MAE range.
Applsci 16 04981 g007
Table 1. Summary of mathematical notation used in this paper.
Table 1. Summary of mathematical notation used in this paper.
SymbolDescription
P t Observed scrap steel price at time t
P ^ t + h Forecasted price at horizon h
Δ h ( t ) Price change target, defined as P t + h P t
Δ ^ h ( t ) Predicted price change at horizon h
H Set of forecast horizons, { 1 , 3 , 7 }
HForecast horizon
WSliding-window length
FNumber of input features
DOutput dimension, equal to | H |
x t Input feature vector at time t
X t Input window matrix at anchor time t
h t Hidden representation at time t
H t Sequence of hidden representations over the window
α t , k Attention weight assigned to lag position k
c t Attention-weighted context vector
Δ ^ wide ( t ) Wide-pathway prediction
Δ ^ deep ( t ) Deep-pathway prediction
Δ ^ ( t ) Final Hybrid delta prediction
Q ( L ) Ljung–Box statistic at lag order L
ρ ^ k Sample autocorrelation at lag k
nNumber of test observations
y i Observed value for metric computation
y ^ i Predicted value for metric computation
y ¯ Mean of observed values
Table 2. Stationarity test results for the raw price series and derived target series, computed on the first 80% of the original chronological raw price series before multi-horizon target alignment. The raw diagnostic segment contains 2568 daily observations; derived series have horizon-dependent valid lengths because differencing reduces the number of usable observations. ADF: Augmented Dickey–Fuller test (H0: unit root present); KPSS: Kwiatkowski–Phillips–Schmidt–Shin test (H0: series is stationary). For KPSS, p < 0.01 indicates that the reported value is at the lower boundary of the look-up table (actual p is smaller), whereas p > 0.10 indicates that the reported value is at the upper boundary (actual p is larger).
Table 2. Stationarity test results for the raw price series and derived target series, computed on the first 80% of the original chronological raw price series before multi-horizon target alignment. The raw diagnostic segment contains 2568 daily observations; derived series have horizon-dependent valid lengths because differencing reduces the number of usable observations. ADF: Augmented Dickey–Fuller test (H0: unit root present); KPSS: Kwiatkowski–Phillips–Schmidt–Shin test (H0: series is stationary). For KPSS, p < 0.01 indicates that the reported value is at the lower boundary of the look-up table (actual p is smaller), whereas p > 0.10 indicates that the reported value is at the upper boundary (actual p is larger).
SeriesADF StatADF pKPSS StatKPSS pResult
Raw price ( P t ) 2.389 0.145 3.986 <0.01Non-stationary
Δ H = 1 8.934 <0.001 0.054 >0.10Stationary
Δ H = 3 7.895 <0.001 0.051 >0.10Stationary
Δ H = 7 8.258 <0.001 0.049 >0.10Stationary
Log-return 9.538 <0.001 0.058 >0.10Stationary
Table 3. RMSE comparison (mean ± std over 10 runs). For each horizon H, we compare raw-price vs. delta-targeted training across all evaluated neural models.
Table 3. RMSE comparison (mean ± std over 10 runs). For each horizon H, we compare raw-price vs. delta-targeted training across all evaluated neural models.
Model H = 1 H = 3 H = 7
Raw Delta Raw Delta Raw Delta
RNN2.732 ± 0.1422.271 ± 0.0104.595 ± 0.0764.258 ± 0.0197.690 ± 0.0757.289 ± 0.038
GRU2.692 ± 0.0872.266 ± 0.0054.540 ± 0.0484.268 ± 0.0247.686 ± 0.0357.372 ± 0.077
LSTM3.671 ± 0.1772.287 ± 0.0195.350 ± 0.1664.339 ± 0.0858.485 ± 0.1297.470 ± 0.076
DLinear38.124 ± 8.4052.465 ± 0.10039.892 ± 8.6694.425 ± 0.04139.426 ± 6.5477.472 ± 0.295
N-BEATS34.111 ± 9.2772.327 ± 0.06135.344 ± 9.8074.382 ± 0.08734.635 ± 9.9177.592 ± 0.313
C-KAN27.794 ± 5.8942.338 ± 0.03828.017 ± 7.4724.424 ± 0.02927.220 ± 6.7007.691 ± 0.051
Table 4. MAPE (%) comparison (mean ± std over 10 runs). For each horizon H, we compare raw-price vs. delta-targeted training across all evaluated neural models.
Table 4. MAPE (%) comparison (mean ± std over 10 runs). For each horizon H, we compare raw-price vs. delta-targeted training across all evaluated neural models.
Model H = 1 H = 3 H = 7
Raw Delta Raw Delta Raw Delta
RNN0.459 ± 0.0400.303 ± 0.0100.804 ± 0.0290.729 ± 0.0141.501 ± 0.0311.393 ± 0.014
GRU0.446 ± 0.0340.302 ± 0.0060.786 ± 0.0180.746 ± 0.0151.505 ± 0.0181.420 ± 0.030
LSTM0.652 ± 0.0430.315 ± 0.0140.976 ± 0.0400.765 ± 0.0211.681 ± 0.0321.452 ± 0.028
DLinear8.458 ± 2.2360.395 ± 0.0428.866 ± 2.2020.803 ± 0.0128.773 ± 1.7131.470 ± 0.080
N-BEATS7.529 ± 2.2610.318 ± 0.0387.801 ± 2.3450.757 ± 0.0247.645 ± 2.3681.515 ± 0.096
C-KAN6.025 ± 1.7980.296 ± 0.0476.136 ± 2.0840.683 ± 0.0345.961 ± 1.9461.460 ± 0.029
Table 5. R 2 comparison (mean ± std over 10 runs). For each horizon H, we compare raw-price vs. delta-targeted training across all evaluated neural models.
Table 5. R 2 comparison (mean ± std over 10 runs). For each horizon H, we compare raw-price vs. delta-targeted training across all evaluated neural models.
Model H = 1 H = 3 H = 7
Raw Delta Raw Delta Raw Delta
RNN0.986 ± 0.0010.991 ± 0.0000.962 ± 0.0010.967 ± 0.0000.893 ± 0.0020.903 ± 0.001
GRU0.987 ± 0.0010.991 ± 0.0000.963 ± 0.0010.967 ± 0.0000.894 ± 0.0010.901 ± 0.002
LSTM0.975 ± 0.0020.990 ± 0.0000.948 ± 0.0030.966 ± 0.0010.870 ± 0.0040.898 ± 0.002
DLinear−1.786 ± 1.2840.989 ± 0.001−2.037 ± 1.1530.964 ± 0.001−1.901 ± 0.9950.898 ± 0.008
N-BEATS−1.279 ± 1.1430.990 ± 0.001−1.446 ± 1.2200.965 ± 0.001−1.346 ± 1.2730.895 ± 0.009
C-KAN−0.476 ± 0.6950.990 ± 0.000−0.529 ± 1.0020.964 ± 0.000−0.423 ± 0.8230.892 ± 0.001
Table 6. RAMP score comparison for raw-price vs. delta-targeted neural models (mean ± std over 10 runs). Lower values indicate better local change-tracking accuracy.
Table 6. RAMP score comparison for raw-price vs. delta-targeted neural models (mean ± std over 10 runs). Lower values indicate better local change-tracking accuracy.
Model H = 1 H = 3 H = 7
Raw Delta Raw Delta Raw Delta
RNN 1.819 ± 0.067 1.652 ± 0.021 1.956 ± 0.072 1.887 ± 0.031 1.980 ± 0.060 2.095 ± 0.035
GRU 1.558 ± 0.009 1.593 ± 0.016 1.698 ± 0.018 1.811 ± 0.015 1.741 ± 0.033 1.990 ± 0.025
LSTM 1.635 ± 0.008 1.575 ± 0.009 1.732 ± 0.014 1.788 ± 0.018 1.876 ± 0.013 1.889 ± 0.031
DLinear 11.477 ± 5.044 2.074 ± 0.041 12.139 ± 5.435 2.294 ± 0.060 11.376 ± 4.873 2.634 ± 0.216
N-BEATS 9.143 ± 2.629 1.666 ± 0.041 9.703 ± 3.645 2.005 ± 0.147 9.447 ± 3.074 2.233 ± 0.395
C-KAN 0.880 ± 0.047 1.491 ± 0.001 0.874 ± 0.030 1.553 ± 0.001 0.873 ± 0.024 1.473 ± 0.002
Table 7. Normalized HD comparison for raw-price vs. delta-targeted neural models (mean ± std over 10 runs). Lower values indicate stronger geometric similarity between predicted and observed trajectories.
Table 7. Normalized HD comparison for raw-price vs. delta-targeted neural models (mean ± std over 10 runs). Lower values indicate stronger geometric similarity between predicted and observed trajectories.
Model H = 1 H = 3 H = 7
Raw Delta Raw Delta Raw Delta
RNN 0.077 ± 0.015 0.024 ± 0.003 0.074 ± 0.013 0.058 ± 0.003 0.077 ± 0.016 0.107 ± 0.006
GRU 0.085 ± 0.011 0.022 ± 0.002 0.076 ± 0.008 0.054 ± 0.003 0.094 ± 0.007 0.108 ± 0.007
LSTM 0.089 ± 0.005 0.022 ± 0.002 0.100 ± 0.007 0.059 ± 0.004 0.123 ± 0.008 0.103 ± 0.009
DLinear 0.47 ± 0.06 0.04 ± 0.00 0.47 ± 0.07 0.06 ± 0.00 0.47 ± 0.05 0.11 ± 0.01
N-BEATS 0.42 ± 0.11 0.03 ± 0.01 0.42 ± 0.09 0.07 ± 0.01 0.41 ± 0.09 0.10 ± 0.01
C-KAN 0.636 ± 0.103 0.016 ± 0.004 0.637 ± 0.114 0.015 ± 0.003 0.606 ± 0.113 0.020 ± 0.006
Table 8. RAMP comparison across all evaluated methods. Lower values indicate better local change-tracking accuracy. Classical baselines are deterministic, whereas neural models are reported as mean ± standard deviation over 10 runs.
Table 8. RAMP comparison across all evaluated methods. Lower values indicate better local change-tracking accuracy. Classical baselines are deterministic, whereas neural models are reported as mean ± standard deviation over 10 runs.
Model H = 1 H = 3 H = 7
Naive1.50161.54021.4282
Seasonal Naive1.44361.43341.4282
SARIMA1.97602.29012.3402
ETS1.68351.88082.0300
Δ -RNN1.652 ± 0.0211.887 ± 0.0312.095 ± 0.035
Δ -GRU1.593 ± 0.0161.811 ± 0.0151.990 ± 0.025
Δ -LSTM1.575 ± 0.0091.788 ± 0.0181.889 ± 0.031
Δ -DLinear2.074 ± 0.0412.294 ± 0.0602.634 ± 0.216
Δ -N-BEATS1.666 ± 0.0412.005 ± 0.1472.233 ± 0.395
Δ -C-KAN1.491 ± 0.0011.553 ± 0.0011.473 ± 0.002
Hybrid1.719 ± 0.1261.914 ± 0.1002.180 ± 0.120
Table 9. Normalized HD comparison across all evaluated methods. Lower values indicate stronger geometric similarity between predicted and observed trajectories. Classical baselines are deterministic, whereas neural models are reported as mean ± standard deviation over 10 runs.
Table 9. Normalized HD comparison across all evaluated methods. Lower values indicate stronger geometric similarity between predicted and observed trajectories. Classical baselines are deterministic, whereas neural models are reported as mean ± standard deviation over 10 runs.
Model H = 1 H = 3 H = 7
Naive0.00160.06860.0628
Seasonal Naive0.01100.06860.0628
SARIMA0.04160.08260.1711
ETS0.02850.08090.1560
Δ -RNN0.024 ± 0.0030.058 ± 0.0030.107 ± 0.006
Δ -GRU0.022 ± 0.0020.054 ± 0.0030.108 ± 0.007
Δ -LSTM0.022 ± 0.0020.059 ± 0.0040.103 ± 0.009
Δ -DLinear0.040 ± 0.0030.060 ± 0.0030.110 ± 0.010
Δ -N-BEATS0.030 ± 0.0100.070 ± 0.0100.100 ± 0.010
Δ -C-KAN0.016 ± 0.0040.015 ± 0.0030.020 ± 0.006
Hybrid0.025 ± 0.0030.055 ± 0.0060.109 ± 0.010
Table 10. Ljung–Box Q(10) residual autocorrelation diagnostics across all evaluated methods. For deterministic baselines, the test statistic (Q) and p-value are reported directly. For neural models, the Ljung–Box statistic and the rejection count out of 10 independent runs are reported. H 0 : no residual autocorrelation up to lag 10.
Table 10. Ljung–Box Q(10) residual autocorrelation diagnostics across all evaluated methods. For deterministic baselines, the test statistic (Q) and p-value are reported directly. For neural models, the Ljung–Box statistic and the rejection count out of 10 independent runs are reported. H 0 : no residual autocorrelation up to lag 10.
Model H = 1 H = 3 H = 7
Q(10)DecisionQ(10)DecisionQ(10)Decision
Naive42.67Reject641.04Reject1936.74Reject
Seasonal Naive1948.42Reject1949.15Reject1936.74Reject
SARIMA11.16Do not reject356.59Reject1160.93Reject
ETS12.55Do not reject394.41Reject1305.23Reject
Δ -RNN4.240/10 reject321.7210/10 reject958.1310/10 reject
Δ -GRU6.080/10 reject344.2710/10 reject978.2910/10 reject
Δ -LSTM8.290/10 reject368.5010/10 reject1080.0710/10 reject
Δ -DLinear13.932/10 reject313.6710/10 reject820.2410/10 reject
Δ -N-BEATS12.751/10 reject386.0610/10 reject1157.7610/10 reject
Δ -C-KAN35.1110/10 reject539.6710/10 reject1567.0510/10 reject
Hybrid12.153/10 reject317.6710/10 reject986.9610/10 reject
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cifci, N.S.; Karatay, M.; Demirel, Y.; Aygul, Y.; Ugurlu, O. A Delta-Targeted Hybrid Deep Learning Architecture for Short-Term Scrap Steel Price Forecasting: A Comparative Study. Appl. Sci. 2026, 16, 4981. https://doi.org/10.3390/app16104981

AMA Style

Cifci NS, Karatay M, Demirel Y, Aygul Y, Ugurlu O. A Delta-Targeted Hybrid Deep Learning Architecture for Short-Term Scrap Steel Price Forecasting: A Comparative Study. Applied Sciences. 2026; 16(10):4981. https://doi.org/10.3390/app16104981

Chicago/Turabian Style

Cifci, Nihan Sena, Melike Karatay, Yasemin Demirel, Yesim Aygul, and Onur Ugurlu. 2026. "A Delta-Targeted Hybrid Deep Learning Architecture for Short-Term Scrap Steel Price Forecasting: A Comparative Study" Applied Sciences 16, no. 10: 4981. https://doi.org/10.3390/app16104981

APA Style

Cifci, N. S., Karatay, M., Demirel, Y., Aygul, Y., & Ugurlu, O. (2026). A Delta-Targeted Hybrid Deep Learning Architecture for Short-Term Scrap Steel Price Forecasting: A Comparative Study. Applied Sciences, 16(10), 4981. https://doi.org/10.3390/app16104981

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop