1. Introduction and Background
Forecasting has been used for centuries to support decisions in everyday life, such as planning agriculture, managing trade, or predicting weather conditions. Early forecasting relied on observation and experience, and later developed into statistical models that describe trends, cycles, and uncertainty. Major milestones included the formalization of probability theory and the introduction of time series models such as autoregressive [
1] and moving average processes [
2], which enabled uncertainty to be quantified and future values to be estimated in a systematic manner. These advances established forecasting as a scientific discipline and laid the foundation for modern predictive methods used across many domains.
Artificial intelligence emerged as a much later development, driven by the ambition to construct systems capable of learning from data rather than relying solely on predefined rules. Early approaches focused on symbolic reasoning systems, while subsequent advances introduced machine learning algorithms such as decision trees and neural networks that adapt their behavior through experience. With the availability of large datasets and increased computational capacity, these models have become effective tools for identifying complex patterns and nonlinear relationships. As a result, artificial intelligence has gradually become a natural extension of forecasting, offering new possibilities for prediction while also introducing new sources of uncertainty and methodological challenges.
1.1. Forecasting Importance in Decision-Making Systems
Forecasts are not merely descriptive summaries of historical patterns; in most operational settings, they function as inputs to decisions. In supply chains, production planning, procurement, and inventory control, forecasts directly influence order quantities, capacity plans, replenishment schedules, and service-level commitments. Consequently, forecast errors propagate through planning systems and can amplify upstream decisions, increasing costs or deteriorating customer service. A classical illustration of such amplification is the bullwhip effect: when demand information is transmitted through orders, variability and distortion tend to grow along the supply chain, affecting inventory and production decisions of upstream members [
3]. This systemic role implies that the value of forecasting cannot be assessed only by statistical fit, but by its impact on downstream decisions and operational performance.
From an inventory perspective, forecast uncertainty is a primary driver of safety stock and service levels. When demand is forecasted rather than assumed known, decision rules need to explicitly account for forecast error; otherwise, inventory buffers can be systematically under- or over-dimensioned. Research on safety-stock calculation under forecasted demand highlights that ignoring the structure of forecast errors (e.g., correlation over time, differences between forecast interval and lead time) can lead to biased service-level performance and suboptimal inventory investments [
4]. This connection between forecasting and inventory performance is especially salient in weekly operational planning, where short cycles leave limited time for corrective actions and where forecast errors quickly translate into material shortages, expedited shipments, or excess stock.
In practice, the economic consequences of forecast errors are often asymmetric. Over-forecasting may inflate holding costs and obsolescence risk, whereas under-forecasting can trigger stockouts, lost sales, or costly expediting. Because these costs differ by item criticality, lead time, and substitution possibilities, the same numerical error can imply very different operational outcomes. This asymmetry motivates forecasting processes that are tailored to decision contexts (e.g., procurement vs. finished goods planning) and underscores that improving accuracy is not an abstract academic objective, but a means to reduce operational risk. Practitioner-oriented frameworks in demand planning emphasize that forecast accuracy targets and evaluation must be aligned to the horizon and use case, rather than treated as universal benchmarks [
5].
At the same time, the forecasting landscape has been transformed by the widespread adoption of machine learning and deep learning methods, raising expectations of “automatic” accuracy gains. Large-scale empirical evidence, however, suggests that progress is nuanced: competitions and benchmarks demonstrate that performance depends strongly on data characteristics and that robust baselines and model combinations often remain hard to beat across heterogeneous time series collections [
6]. This is consistent with the broader forecasting literature, which stresses that practical gains come as much from process quality (appropriate evaluation design, leakage-free validation, and careful metric selection) as from model class [
7]. In other words, even sophisticated models can appear successful under weak evaluation protocols, while well-designed evaluations frequently reveal that accuracy improvements are conditional rather than universal.
Organizational practice further reinforces why forecasting “importance” should be discussed beyond model mechanics. In many firms, forecasts are frequently adjusted judgmentally, and these adjustments can help or harm depending on incentives, information flows, and governance. Empirical and field-based work on forecasting in organizations shows that judgmental interventions are common and can become counterproductive when they systematically override statistical structure without accountability or feedback loops [
8]. These findings highlight that forecasting quality is shaped not only by algorithms but also by how forecasts are produced, challenged, and used in decision-making routines.
Taken together, the literature frames forecasting as a system-critical capability: it affects operational decisions, buffers risk through inventory and capacity, and interacts with organizational behaviors and evaluation practices. Therefore, studies that compare forecasting approaches under controlled and consistent conditions contribute not only to methodological understanding but also to operational decision support, especially when the objective is to identify when improvements are feasible and when uncertainty is structurally irreducible. Introductory forecasting texts similarly emphasize that the central goal is “sensible use” of forecasting methods in context, rather than blind escalation of model complexity [
9]. This perspective motivates empirical evaluations that focus on the decision relevance of accuracy and on the dependency of performance on data structure and operational constraints.
The increasing availability of machine learning tools has led to a rapid adoption of LSTM-based forecasting models in operational environments. However, when such models are applied without explicit assessment of data–model compatibility, substantial practical risks emerge. Deep learning models may converge numerically and produce seemingly smooth forecasts even when the underlying time series lacks a stable temporal structure. In such cases, the model effectively learns a regression-to-the-mean representation rather than meaningful signal dynamics.
From a decision-making perspective, this can create a false sense of reliability. Practitioners may interpret stable-looking forecasts as evidence of predictive power, while in reality, the model merely suppresses volatility. In inventory and procurement systems, such misaligned forecasts can lead to systematic underestimation of peaks, distorted safety stock calculations, increased stockout risk, or unnecessary capital tied up in excess inventory. Therefore, assessing whether data structure justifies the use of advanced learning-based models is not merely a methodological concern but a managerial necessity.
1.2. Overview of Forecasting Methodologies
Statistical forecasting methods form the classical foundation of time series analysis and are built on explicit probabilistic assumptions about how data evolve over time. These models typically decompose an observed series into components such as trend, seasonality, and random noise, allowing each element to be analyzed and modeled separately. Autoregressive and moving average models estimate future values as linear combinations of past observations and past errors, while integrated variants address non-stationarity through differencing [
10]. Exponential smoothing methods assign decaying weights to historical data, making them effective for capturing slowly changing patterns with limited computational effort [
11]. State space models further generalize these ideas by representing time series dynamics through latent variables and observation equations, enabling recursive estimation and uncertainty quantification [
12]. Recent studies have reaffirmed the relevance of these approaches in modern applications, demonstrating their robustness in industrial forecasting and resource management tasks [
13]. Extended seasonal models such as TBATS and advanced ETS variants have been shown to handle multiple seasonal patterns more effectively [
14], while hybrid statistical formulations incorporating Kalman filtering have improved adaptability to evolving systems [
15]. Comparative reviews published in recent years have emphasized that, despite advances in data-driven methods, classical statistical models remain competitive baselines due to their transparency and stability [
16]. However, their effectiveness depends strongly on stable data-generating mechanisms and may decline when relationships are highly nonlinear or subject to abrupt structural changes.
Heuristic forecasting methods are designed to provide practical and easily interpretable predictions without relying on strict statistical assumptions. These approaches typically use simple rules derived from empirical observations, such as moving averages, naïve persistence models, or manually adjusted forecasts based on domain expertise [
17]. Their primary advantage lies in robustness and transparency, as the logic behind the prediction is straightforward and can be readily communicated to decision makers. Recent studies have shown that heuristic models remain valuable as benchmark references, particularly for evaluating more complex forecasting systems [
18]. Heuristic methods are often applied in situations where data quality is poor, historical depth is limited, or rapid operational decisions are required [
19]. However, their simplicity restricts their ability to capture complex temporal structures, interactions, or long-term dependencies, which limits their effectiveness for long-horizon or highly dynamic forecasting tasks [
20].
Artificial intelligence-based forecasting methods range from shallow learning models to deep neural architectures. Shallow learning approaches, such as feedforward neural networks trained via backpropagation, rely on limited network depth and externally defined lag structures. In contrast, deep learning models introduce hierarchical representations and recurrent memory mechanisms that allow the direct modeling of sequential dependencies. In recent forecasting research, deep learning architectures have become dominant, reflecting the growing emphasis on modeling complex temporal dynamics.
Artificial intelligence-based forecasting methods aim to learn predictive relationships directly from data through adaptive algorithms. Techniques such as feedforward neural networks, support vector machines, and recurrent architectures, including LSTM models, can model nonlinear patterns and temporal dependencies without predefined functional forms [
21,
22]. Recent comparative studies have shown that recurrent and hybrid deep learning architectures often outperform classical statistical models on complex financial and industrial time series, particularly when nonlinear dynamics dominate [
23]. Transformer-based and LSTM–Transformer hybrid models have further demonstrated improved robustness and long-horizon accuracy on large-scale forecasting benchmarks [
24]. Empirical applications in demand forecasting and supply chain management have confirmed that AI-driven models can achieve higher predictive accuracy than traditional baselines when sufficient data are available [
25]. At the same time, these methods introduce challenges related to limited interpretability, sensitivity to noise, and strong dependence on training data quality [
26]. As a result, artificial intelligence models are often most effective when combined with careful preprocessing and complementary baseline methods to ensure stability and reliable evaluation [
27].
1.3. Research Gaps in Practical Forecasting Applications
Based on the reviewed methods and practical considerations, several research gaps can be identified. First, despite the widespread use of simple forecasting techniques such as moving averages, exponential smoothing, and naïve persistence in procurement practice, limited empirical evidence exists on how much forecast accuracy is gained when these methods are replaced by LSTM-based models under identical data and evaluation settings. Second, there is insufficient analysis of whether the added complexity of LSTM models translates into consistent accuracy improvements for weekly operational data that are typically handled in spreadsheet-based workflows. Third, the behavior of LSTM forecasts relative to simple baselines remains underexplored, particularly in cases where demand patterns are noisy or weakly structured. Addressing these gaps is essential for understanding when advanced forecasting models provide tangible benefits over established practical approaches.
Beyond methodological comparison, insufficient attention has been given to the operational consequences of misapplying complex forecasting models. While LSTM architectures are often promoted as universally superior, empirical evidence rarely examines how performance deteriorates when data lack sufficient structure. Without such understanding, organizations may adopt computationally intensive models that increase system complexity without improving decision quality.
To synthesize the key insights from the reviewed literature,
Table 1 summarizes the role of forecasting in decision-making systems, the main categories of forecasting methodologies, and the identified research gaps. This structured overview highlights how data characteristics, rather than model complexity alone, determine forecasting performance and motivates the empirical analysis presented in the subsequent sections.
This study contributes to the literature in several ways. First, it provides a controlled, structure-dependent comparison of classical forecasting methods (Naïve, MA, SES, Holt, ARIMA) and a uniformly configured LSTM on the same five real weekly time series under identical training and evaluation conditions, ensuring that performance differences are attributable to data characteristics rather than experimental design. Second, it demonstrates that forecasting accuracy is driven primarily by periodicity, volatility, and noise level rather than model complexity alone, consistent with bias–variance and signal-to-noise considerations. Third, it introduces a deliberately adapted LSTM framework for low-data environments, incorporating a causal moving-average input, omission of dropout due to constrained model capacity, and batch size of one to introduce implicit regularization through stochastic gradient updates; these design decisions were motivated by stability and generalization requirements rather than arbitrary modification. Finally, the study validates the operational feasibility of LSTM-based forecasting in a real corporate weekly procurement context, emphasizing practical applicability over architectural novelty.
To respond to these issues, the remainder of the article is structured as follows.
Section 2 introduces the forecasting problem and describes the underlying data, including their temporal resolution and basic characteristics.
Section 3 presents the simple forecasting methods commonly applied in practice and evaluates their forecast accuracy on the given data.
Section 4 introduces the adapted LSTM model and applies it to the same database under consistent experimental conditions. Finally,
Section 5 compares the results across all methods, with particular emphasis on differences in forecast accuracy between the simple baselines and the LSTM approach.
In addition to methodological comparison, this study aims to clarify the boundary conditions under which LSTM models provide operational value. By explicitly linking data structure, forecast behavior, and decision relevance, the analysis contributes to a more informed and risk-aware application of advanced forecasting methods in practical planning environments.
2. Dataset Collection and Commonly Used Statistical Forecasting Methods
In this study, forecasting is applied to support planning and decision-making in a supply chain context, where reliable demand and material consumption estimates are essential for production, inventory, and procurement activities. To enable this analysis, historical operational data must first be collected and structured in a form suitable for both traditional and learning-based forecasting methods.
2.1. Dataset Description and Forecasting Context
The dataset used in this research was queried from the SAP (SAP S/4HANA 2023 FPS03) database of a real manufacturing company, with all sensitive product identifiers removed to protect confidentiality. It contains weekly demand histories for five distinct materials (Items A, B, C, D, E). The main objective of analyzing this dataset was to determine how accurately a uniformly constructed LSTM model and other statistical models can estimate future values across time series with differing structural properties and noise levels.
To enable both spreadsheet-based training and consistent benchmarking, the first 100 weeks of data were used for model training, and the subsequent 10 weeks served as the evaluation period. Although 10 weeks may appear short in absolute terms, such a horizon is meaningful in operational planning contexts where weekly forecasts directly influence decision-making.
In practical supply chain and procurement settings, typical forecast accuracy expectations vary by time horizon and application. For finished goods forecasting, realistic accuracy ranges are [
5]:
1 week: 90–98% (used for operational planning and daily production)
1 month: 80–90% (short-term production and inventory planning)
3 months: 65–80% (capacity planning and tactical scheduling)
6 months: 55–70% (Sales & Operations Planning level)
1 year: 40–60% (strategic direction rather than detailed prediction)
For raw material procurement and supply planning, typical accuracy ranges tend to be higher due to aggregated order patterns and lower volatility [
5]:
1 week: 95–99% (driven by specific production requirements)
1 month: 85–95% (order quantities and replenishment planning)
3 months: 75–90% (framework agreements/medium-term commitments)
6 months: 65–80% (capacity reservations)
1 year: 55–70% (volume forecasting at category level, not individually)
Accuracy is influenced by many factors. Elements that typically degrade performance include promotional activities, seasonal peaks, new product introductions, customer campaigns, and long supplier lead times. Conversely, forecast quality improves with stable demand, fixed ordering agreements, product family or category level aggregation, and rolling forecast practices.
Figure 1 illustrates the raw weekly time series of Items A through E and already reveals substantial differences in their behavior. Item A exhibits a highly regular and repeating pattern with low noise, making it visually easy to predict. Item B also shows a clearly recurring structure, although it contains intermittent zero values and sharp peaks. Despite these discontinuities, its overall behavior remains deterministic and appears well suited for forecasting.
Item C is characterized by extreme variability and irregular spikes, indicating a high noise level and weak structural regularity. Item D shows moderate variability with only occasional repetition, suggesting limited and unstable predictability. Item E is dominated by strong and erratic fluctuations with no visible repeating structure, and therefore represents a highly noise-driven and poorly predictable series.
Differences in absolute magnitudes across items should not influence the interpretation of their temporal behavior. For this reason,
Figure 2 presents a normalized representation of all series. Each item was scaled relative to its own maximum value, while the minimum was consistently set to zero, even in cases where the observed minimum was higher. This choice reflects a realistic production scenario, where a given item may have weeks with no production or usage at all. In addition, this normalization step was required for the LSTM training process, as learning-based models are sensitive to scale differences and require comparable value ranges to ensure stable and effective optimization.
The statistical analyses of the data are presented in
Table 2. Based on the regression results obtained from the Excel Data Analysis ToolPak (M365), the following observations may be made.
Across all five items, substantial differences in scale and variability are observed. The coefficient of variation indicates moderate to high relative volatility for each series. Items A and B exhibit particularly high dispersion relative to their means (CV = 0.71 and 0.99, respectively), while Items C, D, and E show moderate variability (CV between 0.39 and 0.60). This confirms that the datasets are characterized by non-negligible dispersion, which may influence forecasting performance.
With respect to linear trend diagnostics, it is observed that none of the estimated slope coefficients are statistically significant at the conventional 5% level. For all items, the p-values associated with the slope parameter substantially exceed 0.05 (ranging from 0.4395 to 0.9767). Furthermore, the 95% confidence intervals for all slope estimates include zero, indicating that the presence of a deterministic linear trend cannot be supported statistically. Although positive slopes are estimated for Items A, B, D, and E and a negative slope for Item C, these effects are small in magnitude and statistically indistinguishable from zero.
The explanatory power of the linear models is negligible. The R2 values are extremely low for all items (between 0.0000 and 0.0061), and the adjusted R2 values are negative in each case. This implies that the linear time index explains virtually none of the variance in the observed demand. The correlation coefficients (R) are also close to zero, further confirming the absence of a meaningful linear association between time and demand.
Taken together, these results indicate that the examined time series do not exhibit statistically significant linear trends over the analyzed horizon. Instead, variability appears to be driven primarily by irregular fluctuations rather than systematic upward or downward movement. Consequently, the forecasting behavior observed in subsequent analyses cannot be attributed to strong deterministic trend components but must instead be interpreted in light of volatility, periodicity, and structural characteristics of the individual series. The Excel Data Analysis ToolPak was used, and the following observations may be made.
2.2. Commonly Used Statistical Forecasting Methods
In operational forecasting practice, a small set of simple methods continues to dominate daily decision-making, despite the availability of more advanced techniques. These approaches are still widely applied because they are easy to implement, well understood, and deliver reliable performance across a broad range of use cases. Methods such as moving averages and exponential smoothing, including Holt’s formulation, have remained in use for decades, as they provide stable and consistent forecasts regardless of the specific operational context. Their robustness and transparency make them trusted tools in practice and natural reference points for evaluating more complex forecasting models [
2,
11].
The naïve persistence method assumes that the future value of a time series equals its most recent observed value. Formally, the forecast at time
is given by:
This method contains no parameters and no learning mechanism. It is widely used because it represents the simplest possible forecasting rule and often reflects implicit decision-making in practice. Its importance lies in its role as a benchmark, since any advanced model is expected to outperform it in forecast accuracy to justify additional complexity [
2].
The moving average method estimates future demand as the arithmetic means of the most recent
observations. The forecast is defined as
This approach smooths short-term fluctuations and reduces random noise. It is frequently used in operational planning because it is easy to compute, transparent, and effective when demand exhibits no strong trend or seasonality.
The simple exponential smoothing (SES) method improves upon moving averages by assigning exponentially decreasing weights to past observations [
11]. The forecast is updated recursively as
where
is the smoothing parameter. Recent observations receive higher importance, making the method more responsive to changes. SES is commonly applied to stable time series without pronounced trends and is widely supported in spreadsheet-based tools.
The Holt exponential smoothing method extends SES by explicitly modeling a trend component [
11]. It is defined by
where
denotes the level,
the trend,
are smoothing parameters and
denotes the forecast horizon and represents the number of time steps ahead for which the prediction is made. Holt’s method is used when demand shows a consistent upward or downward movement. It remains popular in practice because it balances simplicity with the ability to capture systematic trends.
Together, these methods represent the most widely adopted forecasting techniques in operational environments. They are used not because they are optimal in all cases, but because they provide stable, interpretable, and computationally efficient baselines against which more complex models, such as LSTM networks, can be evaluated.
2.3. ARIMA Forecasting Method
In addition to classical statistical forecasting techniques, a widely established statistical method was applied to support data analysis and to provide a structured baseline for comparison. Among these methods, the Autoregressive Integrated Moving Average model, commonly referred to as ARIMA, was selected due to its long-standing theoretical foundation and its extensive use in time series forecasting. ARIMA models have been thoroughly studied and continue to serve as a standard reference in both academic research and practical applications [
2].
The ARIMA framework describes a univariate time series through three components. The autoregressive component captures linear dependence between current and past observations. The moving average component models the influence of past forecast errors on current values. The integrated component addresses nonstationarity by applying differencing to the original series. Together, these elements enable ARIMA models to represent a wide range of temporal behaviors under relatively simple assumptions.
An ARIMA model is denoted by the parameter triplet (
p,
d,
q), whose interpretation is summarized in
Table 3. Here,
p represents the order of the autoregressive component,
d denotes the degree of differencing applied to achieve stationarity, and
q specifies the order of the moving average component. Model identification typically begins with an examination of the time series properties, including trend presence, variance stability, and autocorrelation structure. Stationarity is a critical requirement for ARIMA modeling, and differencing is applied when trends or slow level shifts are detected.
In this study, ARIMA models were used primarily for exploratory data analysis and as a classical forecasting benchmark. Before model estimation, each time series was inspected visually and through summary statistics to assess its structural characteristics. Stationarity was formally evaluated using the Augmented Dickey–Fuller (ADF) test at a 5% significance level. If non-stationarity was detected, first- or second-order differencing was applied (d ∈ {0, 1, 2}). The autoregressive and moving average orders were determined through a grid search over p ∈ {0, 1, 2, 3} and q ∈ {0, 1, 2, 3}, resulting in up to 48 candidate ARIMA (p, d, q) specifications per series. Parameters were estimated using maximum likelihood, and the Akaike Information Criterion (AIC) was applied to select the optimal specification, ensuring a balance between goodness-of-fit and model parsimony. Only convergent models satisfying stationarity and invertibility conditions were retained.
Residual diagnostics were performed on the selected models to verify adequacy. The Ljung–Box test (lag = 10) was applied to confirm the absence of significant residual autocorrelation, and residual ACF plots were visually inspected to ensure that remaining linear dependence was minimal. Forecasts were generated analytically for a 10-week horizon using the fitted ARIMA models without recursive re-estimation. The same chronological train–test split and evaluation protocol was applied as in the LSTM experiments to ensure a fair and transparent benchmark comparison.
Within this study, ARIMA modeling served two main purposes. First, it provided an interpretable statistical description of the temporal dynamics of each item, allowing differences in stability and persistence to be identified through model order and fit quality. Series with regular behavior were typically well represented by low-order ARIMA models, while highly volatile items showed weaker performance.
Second, ARIMA forecasts were used as a conservative reference baseline. Due to their linear assumptions and limited flexibility, ARIMA models offer a meaningful benchmark against which the performance of more complex forecasting approaches can be evaluated. Performance gains beyond this baseline can therefore be attributed to modeling capabilities not captured by classical linear structure.
Overall, ARIMA models were found to perform reliably on stable series and short horizons, while their limitations became apparent for irregular and highly noisy data. Nevertheless, their transparency and well-understood properties make them a valuable analytical tool and an essential point of comparison in forecasting studies.
3. LSTM-Based Forecasting and Methodological Adaptations
In this section, the methodology applied for LSTM-based time series forecasting is described. The objective was to construct a unified modeling framework that can be applied consistently across multiple items and temporal resolutions, while enabling multi-step prediction and clear interpretation of model behavior. A Long Short-Term Memory network was selected due to its ability to capture temporal dependencies in sequential data and to retain relevant information over longer horizons [
23,
24].
The analysis was conducted on multiple weekly time series for five items denoted as A through E. All series were preprocessed individually to ensure numerical stability and comparability across items. Each time series was normalized to the interval [0, 1] using a normalization function, where the minimum value was fixed at zero, and the maximum was defined by the observed peak value. This normalization reflects realistic production conditions, where zero usage may occur at any time, and was also required to ensure stable and efficient LSTM training, particularly for items with large amplitude variation.
Although a standard LSTM architecture was used as the foundation, several deliberate methodological adaptations were introduced in order to align the model with the properties of the available data and the requirements of the forecasting task. These adaptations should not be interpreted as minor implementation details, but as explicit modeling and training decisions motivated by limited data availability, heterogeneous noise characteristics, and the need for stable prediction [
23].
A single-layer LSTM architecture with 32 memory units was applied (see
Figure 3), followed by a fully connected output layer whose dimensionality matched the selected forecast horizon. The objective was not to maximize model complexity, but to evaluate whether a compact and uniformly specified architecture can reliably learn different temporal structures across items with varying levels of noise and regularity.
For the forecasting process, training samples were generated using a sliding window approach. In the weekly setting, a window of 12 past observations was used to predict the subsequent 12 weeks. This configuration was selected after repeated experimental calibration, as it consistently provided the most stable validation performance across the examined items. Notably, this result was somewhat unexpected, since the moving-average optimization did not identify 12 weeks as an optimal smoothing horizon for any item, indicating that the effective memory length of the LSTM does not necessarily coincide with classical statistical parameters. It should also be emphasized that the optimal window size is data-dependent and may differ for other datasets.
Due to the sequential nature of time series data, all samples were split strictly in chronological order rather than randomly. The first 60 percent of the observations were used for training, the subsequent 20 percent for validation, and the remaining 20 percent for testing. This separation ensured that model evaluation reflected realistic forecasting conditions and prevented information leakage from future observations.
Modifications and Adaptations of LSTM Method
Although this configuration follows a standard LSTM modeling framework, several deliberate methodological adaptations were introduced to better align the model with the characteristics of the data and the requirements of the forecasting task.
The first and most substantial modification is the introduction of a moving average representation as an additional input feature alongside the raw time series as a second dimensional input (Xt = [Xta;XtMA]). This modification represents a data representation strategy rather than a change to the LSTM cell itself. The moving average was computed using a strictly causal rolling window, ensuring that only past information was used and preventing any form of information leakage. By explicitly providing a smoothed trend component, the network was relieved from learning basic low frequency structure directly from noisy observations. This is particularly important in small datasets with large outliers, where jointly learning trend and volatility can lead to unstable parameter updates and poor generalization.
Second, the Dropout layer was intentionally omitted from the network. In recurrent architecture, dropout acts as an explicit regularization mechanism by randomly suppressing activations during training, which can be beneficial in large or highly expressive models. In the present setting, however, the model capacity was intentionally kept low and no systematic divergence between training and validation error was observed under time-ordered validation. Under these conditions, additional regularization would unnecessarily constrain the representational capacity of the network. Preserving full information flow through the LSTM cells was therefore preferred, allowing the model to converge more accurately to subtle temporal dependencies present in structured items.
Third, a batch size of one was applied during training, deviating from common mini batch conventions. This choice results in highly noisy gradient updates, as each parameter update is based on a single sequence. From an optimization perspective, this stochasticity introduces implicit regularization, which can improve generalization when learning from small datasets. At the same time, such training dynamics can be unstable, particularly in recurrent networks. To ensure convergence, careful learning rate selection and early stopping were employed. This configuration allowed the model to explore flatter regions of the loss landscape without relying on explicit regularization layers.
Finally, a direct multi-step forecasting strategy was adopted, in which the output layer generates the full forecast horizon in a single forward pass. This approach differs fundamentally from recursive one step forecasting, where predictions are iteratively fed back into the model. Recursive strategies are known to suffer from error accumulation, as small inaccuracies propagate and amplify over time. By predicting all future steps simultaneously, this source of drift was reduced. While direct multi output forecasting introduces the challenge of jointly optimizing multiple horizons within a single loss function, it provides a more controlled and interpretable evaluation of long-range predictive behavior.
Taken together, these methodological adaptations form a coherent and purpose-driven LSTM framework tailored to the structure and constraints of the studied data. They directly influence learning dynamics, stability, and forecast behavior, and they provide a transparent basis for the comparative analysis presented in the subsequent sections.
4. Training the LSTM Model
In this chapter, the training procedure of the LSTM model is described in detail. The objective of the training phase was to ensure stable convergence while allowing the model to fully exploit its capacity to learn temporal patterns from the available data. Emphasis was placed on robustness and generalization rather than architectural complexity.
For this study, a single-layer LSTM architecture with 32 memory units was applied, followed by a fully connected Dense output layer whose dimensionality matched the selected forecast horizon. This compact architecture was chosen deliberately, as the goal was not to maximize depth or parameter count, but to evaluate how effectively a simple and uniform LSTM configuration can learn different temporal structures across items. The model was optimized using the Adam optimizer (with a 0.001 learning rate, β1 = 0.9, β2 = 0.999, ε = 1 × 10−8) and trained with mean squared error as the loss function, which is well suited for continuous forecasting tasks.
Training was performed using an early stopping strategy in order to prevent unnecessary epochs and reduce the risk of overfitting. Validation loss was monitored throughout training, and training was terminated when no improvement was observed for 20 consecutive epochs. The model parameters corresponding to the lowest validation error were automatically restored. A maximum of 800 epochs was allowed, although early stopping typically resulted in significantly shorter effective training durations.
A batch size of one was used during training. This choice ensured that parameter updates were performed after each individual sequence, preserving the temporal structure of the data and introducing beneficial stochasticity into the optimization process. Model performance during training was evaluated using a chronologically separated validation set, ensuring that only past information was used to predict future observations.
In the following subsections, the trained model is evaluated separately for each item. Results are presented with particular focus on learning dynamics, forecast behavior, and error characteristics. Item A serves as an initial reference case, as it represents a highly structured and minimally noisy time series. For this item, training curves converge rapidly, and both weekly and monthly forecasts closely follow the observed values. The model achieved R2 values close to 1.00, indicating very high explanatory power, and the prediction errors remain small and uniformly distributed across the forecast horizon. This behavior illustrates an ideal scenario in which the LSTM model can fully exploit repeating temporal patterns and produce stable forecasts.
4.1. Item A and B Forecasting: An Easily Learnable Time Series
The visual results further confirm the strong learning behavior observed for Item A. As shown in
Figure 4, the loss curves display a smooth and monotonic decrease for both training and validation error, with the two curves closely aligned throughout the training process. This indicates stable convergence and the absence of overfitting, as no divergence between training and validation loss is observed. The rapid reduction in mean squared error in the early epochs suggests that the dominant temporal structure is learned efficiently, while the gradual flattening at later epochs reflects fine scale adjustment rather than unstable optimization.
The prediction error heatmap in
Figure 5. provides a compact and informative representation of forecast errors across both time and horizon steps. Each row corresponds to a test sequence, while each column represents a forecast step within the horizon. For Item A, the heatmap is dominated by near zero values with alternating but balanced positive and negative errors. This pattern indicates that errors are small, unbiased, and consistent across horizons. The absence of systematic drift or horizon dependent degradation demonstrates that the direct strategy is effective for this structured series.
The direct comparison of weekly observations and forecasts of item quantity shown in
Figure 6. illustrates that the predicted values closely follow the cyclical pattern of the original series. The forecasted segment continues the established periodic structure without phase shift or amplitude distortion. This alignment confirms that the model has learned the underlying generative pattern rather than merely extrapolating recent values. Together, these visualizations provide consistent evidence that Item A represents an ideal scenario for LSTM-based forecasting.
Item A represents an ideal case for LSTM-based forecasting, as the time series exhibits a stable and near deterministic repeating structure. Under such conditions, temporal dependencies are consistently preserved, allowing the model to learn and reproduce the underlying pattern with high accuracy. It is observed that reliable forecasts can be achieved even when only a limited portion of the historical data is available for training. This result confirms that, when strong regularity and low noise are present, the LSTM model is able to generalize effectively and capture the essential dynamics of the series.
Item B also exhibits a recurring temporal structure, although it is characterized by frequent alternation between zero values and sharp peaks. Despite these abrupt transitions, the repetition pattern remains sufficiently consistent for the model to learn. The results show that peak amplitudes are approximated with high accuracy, indicating that the dominant signal strength is well captured. Forecasts on the monthly scale closely approximate the observed values, while weekly predictions occasionally display a slight phase shift relative to the observed series. Throughout training, convergence remains stable and no signs of overfitting are observed.
Figure 7 confirms that the model can handle discontinuous yet deterministic patterns effectively.
4.2. Item C, D and E Forecasting: Low Prediction and Low Structured Time Series
Item C represents a high amplitude and weakly structured time series, with weekly values ranging from very small magnitudes to extreme peaks close to one thousand. Such behavior introduces severe instability and strong outliers, which significantly complicate the learning process. As shown in
Figure 8, the model responds by applying pronounced smoothing, resulting in forecasts that capture only the general direction of the series while failing to reconstruct individual peaks. Even after temporal aggregation, performance remains poor, as the underlying irregularity persists despite reduced noise. This behavior is reflected in consistently negative R
2 values, indicating that predictions are inferior to simple baseline estimates. Overall, no stable or repeatable pattern can be identified as to allow reliable learning, and the results demonstrate that the model is not suitable for forecasting data dominated by extreme variability and weak structural regularity.
Item D exhibits a moderate level of noise combined with occasional periodic behavior. While the series is less erratic than Item C, the underlying structure remains weak and inconsistent, which limits the ability of a generic LSTM model to generalize effectively. As illustrated in
Figure 9, the results show systematic behavior, as peaks are consistently underestimated while troughs tend to be overestimated. This indicates a regression toward the mean rather than precise pattern reproduction. Forecasts at the monthly level appear excessively smoothed, suggesting that aggregation further suppresses already weak structural signals. The coefficient of determination fluctuates around zero, confirming that predictive performance remains close to that of a naïve baseline. Overall, the model achieves only limited accuracy for Item D, as the temporal pattern is not sufficiently strong or stable to support a reliable learned representation.
Item E represents the most challenging case among all analyzed series due to its extremely erratic and highly volatile behavior. Weekly values fluctuate roughly between 300 and 1400, while monthly aggregates reach ranges of approximately 800 to 3800, leaving little consistent structure for the model to exploit. As shown in
Figure 10, the model fails to reproduce extreme values and instead collapses toward overly smoothed estimates. Prediction errors remain large and unstable, and forecast quality deteriorates further as peaks dominate the series. Monthly level forecasts appear even more distorted, as aggregation amplifies the influence of extreme observations rather than stabilizing the pattern. This behavior is reflected in drastically negative R
2 values, reaching as low as minus seven, indicating performance far below simple baseline expectations. Overall, Item E demonstrates that when a time series is dominated by irregular spikes and lacks repeatable structure, reliable forecasting is not achievable with the applied modeling approach.
This chapter shows that the performance of the applied LSTM model is strongly driven by the internal characteristics of the time series, including noise intensity, regularity, and the presence of stable temporal patterns. Although an identical model architecture and training procedure were used for all items, markedly different learning behaviors and outcomes were observed.
Table 3 summarizes the training results across all items and provides a concise overview of how the LSTM model responds to varying data structures.
Table 4 reveals a clear contrast between structured and unstructured series. Items A and B achieve R
2 values approaching unity and very low error levels, although training times are longer due to the presence of strong and learnable temporal structure. In contrast, Items C and E converge quickly but yield poor predictive performance, as reflected by extremely high error values and negative R
2 scores. Item D shows intermediate behavior, with modest improvement over baseline and limited explanatory power. These results emphasize that training time alone does not indicate model success, and that meaningful learning occurs only when sufficient structure is present in the data.
All experiments were implemented in Python 3.12 using standard scientific computing libraries. Model development and training were carried out using TensorFlow 2.20.0. The experiments were executed in a CPU-based environment, and consistent results were ensured through fixed random seeds, with the specific hardware: AMD Ryzen 7 5700X (8-core, 4.65 GHz) CPU, 32 GB of DDR4 RAM, and an NVIDIA Ge-Force RTX 4060 Ti GPU (8 GB VRAM). To support reproducibility and transparency, the complete source code and the datasets used in this study are publicly available at:
https://github.com/Strigula93/LSTM_PO_FC, accessed on 1 January 2026.
5. Comparative Analysis of Forecasting Models
In this chapter, the previously presented modified LSTM-based forecasting model is systematically compared with classical statistical forecasting methods, including ARIMA. The comparison is conducted under identical data and evaluation conditions in order to assess relative performance and to examine how model behavior differs across time series with varying structural properties.
In
Table 5, the original observed values from week 101 to week 110 are presented for all items. These data points serve as the evaluation window and are used as the target values for forecasting with the alternative statistical methods. Forecast performance is assessed by examining how closely each method reproduces the observed values over this period.
The first method considered is the simplest baseline, the naïve forecasting approach, which is presented in
Table 6. Using this forecasting method, the following predicted values were obtained.
Table 7 presents the results obtained using the moving average method. The moving average window length was determined using Excel’s moving average function, which selects the number (MA n.) of past observations that minimizes the overall deviation from the actual values. This procedure identifies the optimal averaging horizon by evaluating how different window sizes affect the total forecasting error relative to the original data.
Table 8 presents the forecasting results obtained using simple exponential smoothing. The smoothing parameter alpha was optimized separately for each item using the built-in Excel Solver, with the objective of minimizing the overall forecasting error relative to the observed values.
Table 9 presents the forecasting results obtained using the Holt exponential smoothing method. Both the alpha and beta parameters were optimized using the Excel Solver in order to minimize forecasting error. As can be observed from the table and will be further confirmed by the graphical results, this method provides the strongest performance among the simple statistical forecasting approaches considered.
Table 10 presents the forecasting results generated using the ARIMA method, implemented through an existing Python (Python 3.14.3) library. The selected (
p,
d,
q) parameter values for each item are reported at the end of the corresponding rows in the table.
Table 11 presents the forecasting results obtained using the previously described LSTM model, evaluated over the same weekly horizon as the statistical methods to ensure a consistent and fair comparison.
For comparison purposes, as previously described, the evaluation is based on the sum of deviations between the normalized actual values from
Table 4 and the corresponding forecasts over the ten predicted weeks. This aggregated measure reflects how closely each method reproduces the observed behavior across items. Smaller values indicate better forecasting performance. The resulting comparison values are reported in
Table 11.
Table 12 presents a compact and consistent overview of forecast performance across all five items based on the aggregated normalized error over the ten-week evaluation horizon, where smaller values indicate more accurate forecasts. This metric can be interpreted in practical terms by noting that the evaluation spans ten weeks. By scaling the aggregated error accordingly, an approximate percentage of the average weekly deviation can be obtained, allowing the results to be expressed in an intuitive and operationally meaningful form.
For instance, in the case of Item A, the LSTM model yields an aggregated error of 0.3070, which corresponds to an average weekly deviation of roughly 3 percent. This implies that the predicted values remain very close to the actual observations on a week-by-week basis. In contrast, the statistical methods produce aggregated errors in the range of 2 to 4 for the same item, corresponding to average weekly deviations of approximately 20 to 40 percent, which indicates a substantially lower level of precision.
For Items A and B, the modified LSTM model clearly outperforms all statistical alternatives by a wide margin. The markedly lower error values demonstrate that the strong periodic structure present in these series is effectively learned and reproduced by the model. Traditional methods, while internally consistent, are unable to achieve comparable accuracy for these items. This performance difference is further illustrated in
Figure 11 for Item A, where the alignment between predicted and observed values is visually apparent.
For Items C, D, and E, a different behavior is observed on
Table 10 and
Figure 12. These series are characterized by weaker structure and higher noise levels, and the modified LSTM does not achieve the lowest error. Instead, the Holt exponential smoothing method yields the best performance across all three items. This suggests that for irregular or moderately structured series, simpler statistical models with explicit trend handling can be more robust than learning-based approaches.
Across all items, the naive method consistently produces the highest error values, confirming its role as a baseline rather than a competitive forecasting solution. Moving average and simple exponential smoothing methods generally improve upon the naïve baseline but remain inferior to Holt or LSTM depending on the data structure.
ARIMA is the most complex statistical method included in the comparison, yet its performance does not consistently exceed that of simpler approaches and is often matched or surpassed by Holt exponential smoothing. This confirms that higher model complexity alone does not guarantee improved accuracy. It is noteworthy that the LSTM performs best on items where the ARIMA differencing parameter d equals zero or one, indicating stable or only mildly nonstationary behavior, suggesting that learning-based models benefit most when temporal structure is largely preserved.
Overall, the table demonstrates that no single forecasting method is optimal for all item types. The results highlight a clear dependence of model performance on data characteristics, with learning-based approaches excelling on highly structured series and classical statistical methods proving more reliable in noisy or weakly structured scenarios.
6. Discussion
The results of this study demonstrate that forecasting effectiveness is strongly determined by the structural properties of the underlying time series. The applied LSTM framework performs exceptionally well when clear periodicity, low noise, and stable temporal patterns are present, confirming that learning-based models can generalize repeating dynamics even from relatively limited historical data. In contrast, for weakly structured or highly volatile series dominated by outliers, LSTM struggles to form a stable representation, and simpler statistical methods, particularly Holt exponential smoothing, remain more robust and reliable. This pattern is well aligned with findings from large-scale forecasting benchmarks and comparative studies, which consistently report that the relative advantage of learning-based models depends strongly on the presence of stable and repeatable temporal structure. Evidence from forecasting competitions and industrial case studies indicates that neural network models tend to outperform classical approaches primarily when seasonality, trend persistence, or deterministic repetition is clearly embedded in the data, while simpler exponential smoothing techniques often remain competitive or superior for noisy and weakly structured series. The present results therefore reinforce the view that forecasting performance is driven more by data characteristics than by model class alone.
Several limitations should be acknowledged. The dataset is limited in size and scope, and a single unified LSTM configuration was applied without extensive hyperparameter optimization or deeper architectures. In addition, the forecasting horizon is relatively long for unstructured series, which amplifies uncertainty and degrades accuracy. External explanatory variables were also not incorporated, although they may significantly influence real-world behavior. From a learning-theoretical perspective, the observed behavior of the LSTM model on highly volatile series can be interpreted as a regression-to-the-mean effect. When trained using a squared-error loss function on data with a low signal-to-noise ratio, neural networks tend to suppress extreme fluctuations and converge toward smoothed representations of the conditional mean. This effect is further amplified in small-sample settings, where limited historical depth restricts the model’s ability to distinguish between persistent structure and stochastic variation. As a consequence, the LSTM forecasts systematically underestimate peaks and overestimate troughs, leading to poor explanatory power despite stable numerical convergence.
These findings also have direct implications for operational decision support. In planning environments such as procurement, production scheduling, and inventory control, forecast robustness and stability can be more valuable than marginal improvements in point accuracy. While learning-based models may offer superior performance for highly structured items, their sensitivity to noise can introduce additional uncertainty in downstream decisions. In contrast, classical methods such as Holt exponential smoothing, despite their simplicity, often produce smoother and more predictable forecasts that can lead to more stable inventory and capacity decisions when demand patterns are irregular. Based on these findings, several practical improvements are suggested. Forecast performance for noisy series could benefit from stronger preprocessing, including noise filtering, outlier handling, or smoothed representations. More expressive architectures, such as CNN–LSTM hybrids, attention-based models, or Transformer variants tailored for time series, may further enhance learning. Shorter forecasting horizons would likely improve accuracy for irregular data, and item-specific model configurations could better capture heterogeneous demand patterns.
The comparative analysis suggests that no single forecasting approach can be considered universally optimal. Instead, effective forecasting requires aligning model choice and configuration with observable data properties such as regularity, volatility, and noise intensity. Learning-based models provide substantial benefits when these properties support pattern generalization, while classical statistical methods remain indispensable baselines for poorly structured and highly uncertain series. Overall, the study demonstrates that LSTM models are highly effective for time series with predictable structure, but their performance is inherently limited when noise and irregular fluctuations dominate. These results highlight the importance of aligning model choice and configuration with measurable data characteristics rather than relying on model complexity alone.
7. Conclusions
The quantitative evaluation confirms that forecasting methods differ substantially in how accurately they reproduce observed weekly values, and these differences can be interpreted consistently using relative deviations. For time series with strong regularity, clear periodic behavior, and low noise, the LSTM model achieves near zero mean squared error and coefficients of determination close to one. In practical terms, this corresponds to an average weekly deviation of only a few percent, even when results are aggregated over the full ten-week evaluation horizon. These results confirm that forecasting accuracy gains achieved by learning-based models are not uniform across all demand patterns. Instead, they emerge primarily when the underlying time series exhibits sufficient regularity, persistence, or periodic repetition to support generalization. In such cases, the LSTM model is capable of learning stable temporal dependencies and delivering highly accurate forecasts even under limited data availability. This finding supports the growing consensus in the forecasting literature that model effectiveness is conditional rather than universal.
For classical statistical methods, deviations are generally larger. Simple baselines such as naïve and moving average forecasting typically result in average weekly deviations in the range of 10 to 20 percent, which accumulate proportionally across the evaluated weeks. Simple exponential smoothing reduces these deviations moderately, while Holt exponential smoothing achieves the lowest errors among the statistical approaches, often reaching single-digit percentage deviations per week for moderately structured series. The consistently strong performance of Holt exponential smoothing across moderately structured and noisy series highlights the continued relevance of classical statistical forecasting methods in operational contexts. Despite their simplicity, these approaches provide stable and interpretable forecasts that are less sensitive to extreme observations and random fluctuations. This robustness makes them particularly suitable for short-term planning tasks, where forecast stability can be as important as point accuracy.
For highly irregular and noise-dominated series, all methods exhibit substantial discrepancies. In these cases, Holt exponential smoothing consistently limits the deviation more effectively than ARIMA or the LSTM model. ARIMA, despite its higher complexity, does not provide a systematic accuracy advantage over simpler statistical techniques. The results further indicate that increased model complexity does not automatically translate into improved forecasting performance. Although ARIMA models incorporate a richer linear structure than exponential smoothing, their accuracy advantages remain limited in highly volatile settings. This observation reinforces the principle that forecasting models should be selected based on data properties and decision requirements rather than theoretical sophistication alone.
From a practical perspective, the findings suggest several actionable guidelines. Learning-based models such as LSTM are most appropriate for items with strong and repeatable temporal structure, where their ability to capture nonlinear dependencies provides a clear advantage. For items dominated by noise or irregular fluctuations, simpler statistical methods remain preferable due to their robustness and transparency. Consequently, forecasting systems should incorporate data-driven criteria for model selection rather than relying on a single method across all items. Overall, the results show that learning-based models deliver substantially lower total error when a strong and stable temporal structure is present. Within the examined dataset, the LSTM model performs excellently on two items and very well on two additional items out of five, demonstrating that it can achieve low percentage deviations in most cases when sufficient regularity exists. At the same time, the findings reinforce that classical statistical methods remain more suitable for series dominated by noise or weak structure.