Next Article in Journal
From Data to Decisions: Using Explainable Machine Learning to Predict EuroLeague Basketball Outcomes
Previous Article in Journal
Effect of Glass Cullet Content on the Mechanical and Compaction Behavior of Cement-Bound Granular Mixtures for Road Base/Subbase Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Prediction Model Using Statistical Forecasters and Deep Neural Networks

by
Renan Otvin Klehm
1,
Wemerson Delcio Parreira
2,*,
Rudimar Luís Scaranto Dazzi
1,
Anita Maria da Rocha Fernandes
1,
David Cruz García
3 and
Gabriel Villarrubia González
3
1
Polytechnic School, University of Vale do Itajaí (UNIVALI), Uruguai St. n.458, Itajai 88302-901, SC, Brazil
2
Faculty of Electrical Engineering, Polytechnic School, Pontifical Catholic University of Campinas, Professor Doutor Euryclides de Jesus Zerbini St. n.1516, Campinas 13087-571, SP, Brazil
3
Expert Systems and Applications Lab, Faculty of Science, University of Salamanca, Plaza de los Caídos s/n, 37008 Salamanca, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12393; https://doi.org/10.3390/app152312393
Submission received: 10 September 2025 / Revised: 26 October 2025 / Accepted: 20 November 2025 / Published: 21 November 2025

Abstract

The ability to accurately predict future time series behavior in multiple steps, known as multi-horizon forecasting, is a vital aspect in various industries, including retail sales, energy consumption, server load, healthcare, weather, and others. We have proposed, in this paper, the use of statistical forecasters as covariates in a Deep Neural Network (DNN) model and evaluated its impact on forecast metrics. Our analysis covered four diverse datasets: M5, Stallion, Stock Market, and Synthetic. The results demonstrated that the inclusion of statistical predictors in the DNN model led to varying degrees of improvement in forecast performance, depending on the dataset and the chosen evaluation metric. In general, our findings suggest that incorporating statistical prediction as a covariate can be a valuable approach to improving multi-horizon prediction, especially in scenarios with data scarcity and intermittency. The hybrid model achieved consistent improvements, particularly on Symmetric Mean Absolute Percentage Error (SMAPE) across datasets, with statistically significant gains on synthetic and stock market series. Specifically, SMAPE was reduced by approximately 33% on synthetic and stock market datasets, by 15–20% on Stallion, and by around 6% on M5. These results confirm that integrating statistical forecasts as covariates can substantially enhance predictive accuracy, especially for volatile or synthetic series.

1. Introduction

Time series forecasting is a critical task across multiple domains, including finance, retail, energy, and healthcare, where accurate predictions enable informed decision-making, strategic planning, and operational optimization [1,2]. Despite its relevance, forecasting real-world time series remains highly challenging due to intermittency, irregular patterns, and limited historical data [3,4,5]. These difficulties make traditional forecasting techniques prone to inaccuracies in practical applications.
Classical statistical models, such as Autoregressive Integrated Moving Average (ARIMA) and Exponential Smoothing (ETS), have been extensively used for decades [6,7,8]. They remain attractive because of their interpretability, efficiency, and ability to model seasonality and trend in structured datasets. Recent advances, including automated procedures like AutoARIMA and AutoETS, further reinforce their applicability [7,8]. However, these models typically operate on one series at a time and often fail to capture the nonlinear dependencies present in complex or noisy data [9].
In contrast, machine learning and deep neural network (DNN) approaches—such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and transformer-based architectures—have demonstrated strong performance in capturing intricate temporal dependencies and feature interactions [9,10,11,12,13,14,15,16]. These models can learn representations directly from raw data without explicit feature engineering [17,18,19]. Nevertheless, they are computationally intensive, data-hungry, and susceptible to overfitting, especially in scenarios with scarce or intermittent data [20,21,22].
The limitations of both paradigms have motivated the development of hybrid and ensemble methods that combine statistical and deep learning approaches. Evidence from large-scale forecasting competitions, such as the M4 and M5, has shown that hybrid solutions frequently outperform single-model methods [23,24]. These results indicate that statistical models provide robustness and interpretability, while neural networks contribute to learning complex and nonlinear dynamics, making them complementary [25,26].
Motivated by these findings, this paper proposes a hybrid forecasting framework that incorporates statistical forecasters as covariates in a deep neural network. The novelty lies in a multibranch architecture that learns a weighted average of the original series and its statistical forecasts, allowing the network to dynamically emphasize the most informative components. The alternative hypothesis ( H a ) is that including statistical covariates improves forecasting accuracy compared to using a pure DNN, while the null hypothesis ( H 0 ) states that no significant improvement is achieved.
The contributions of this study are as follows:
i.
We introduce a hybrid deep neural network that integrates ARIMA, ETS, and Linear Regression as covariates alongside the original time series.
ii.
We evaluate the model on four datasets (M5, Stallion, Stock Market, and Synthetic), each representing different forecasting challenges such as intermittency, volatility, and nonlinearity.
iii.
We perform statistical validation using paired t-tests to assess the significance of improvements across multiple error metrics (Mean Absolute Error–MAE, Mean Squared Error–MSE, and Symmetric Mean Absolute Percentage Error–SMAPE).
iv.
We analyze the implications of incorporating statistical covariates, highlighting their role in improving relative accuracy, robustness, and stability of forecasts.
The paper is structured into five sections. Section 2 reviews previous research on statistical and neural forecasting approaches. Section 3 presents the preprocessing pipeline and the proposed hybrid model. Section 4 discusses the experimental setup and results, followed by Section 5, which concludes the study and outlines future research directions.

2. Overview of Statistical Models and Related Works

In this section, we provide a comprehensive overview of previous work on time series forecasting and how it has influenced the development of this study.

2.1. Statistical Models

Statistical forecasting methods have been widely used for decades and remain a fundamental approach in time series prediction. Classical models such as ARIMA [7,27] and ETS [28] are among the most frequently applied techniques due to their interpretability, computational efficiency, and ability to capture trend and seasonality in structured data. In addition, specialized approaches such as Croston’s method [29] and its variants have been extensively employed for intermittent demand forecasting [4,30].
Recent developments include automated frameworks such as AutoARIMA and AutoETS [31], as well as scalable open-source libraries such as statsforecast and Prophet [32], which allow practitioners to apply statistical forecasting across large collections of series with minimal manual tuning. These tools have kept statistical methods relevant in both academic research and industrial applications, particularly in retail, energy, and finance.
Despite their advantages, statistical models often struggle with highly nonlinear or noisy series, and they generally operate on one series at a time, which limits scalability in big-data scenarios. Nevertheless, they remain essential in forecasting competitions such as M3, M4, and M5 [33,34,35], where statistical methods frequently appear in top-performing hybrid and ensemble models.
In the context of this study, statistical forecasters such as ARIMA, ETS, and Linear Regression are not only considered as baselines but are also integrated as covariates into the deep learning model. This design leverages their complementary strengths, combining the interpretability and robustness of statistical approaches with the representation power of neural networks.

2.2. Deep Neural Network Models

Deep Neural Network models have gained significant popularity in time series forecasting due to their ability to capture complex patterns and relationships within data [9]. Unlike statistical models, DNN models can effectively handle noisy, nonlinear, and nonstationary datasets [14]. Various types of DNN architecture have been applied to time-series forecasting, including RNNs, LSTM networks [15], CNNs, and models based on transformers [16].
In particular, RNNs are well-suited for sequential data and have been widely used in forecasting tasks [36]. As observed by [37], LSTMs tend to perform better compared to other DNN architectures, mainly because they can learn time-dependent features without suffering from the vanishing gradient problem, unlike vanilla RNNs.
One of the main advantages of DNN models is their ability to learn and train on different time series [38]. This makes them suitable for scenarios where a single model is needed, which can predict a wide variety of time series [9]. Despite their advantages, DNN models can be computationally intensive and require a large dataset to achieve optimal performance [39,40,41]. Another problem is overfitting, where the model memorizes the training data and performs poorly on unseen data [42].
As presented in [43] DNN models can be applied for medical applications, especially considering time series [44]. It also is applied to violence recognition [45], power systems [20,46,47], industry applications [48,49], and so on. The use of hybrid methods for time series forecasting is increasing [23] given its potential to denoise high frequencies [25] and a combination of more advantages of several techniques [26], where ensemble learning methods are outstanding [50].
Beyond the classical architectures such as RNNs, LSTMs, and CNNs, recent years have witnessed the development of a new generation of state-of-the-art models for time series forecasting. The Temporal Fusion Transformer (TFT) introduced by [10] combines sequence-to-sequence attention with interpretable variable selection, achieving competitive performance in multi-horizon forecasting across diverse domains. Similarly, N-BEATS [51] and its successor N-HiTS [52] represent deep fully connected architectures specifically designed for univariate and multivariate time series forecasting, providing both accuracy and interpretability in large-scale benchmark competitions such as M4 and M5. Another influential contribution is DeepAR [9], a probabilistic autoregressive recurrent model that enables scalable forecasting across large collections of related time series.
In addition to purely neural approaches, hybrid and modular models such as Prophet [32] have gained popularity in practice due to their ability to decompose seasonality, trend, and holidays in a flexible and interpretable framework, offering an alternative for business and industrial applications where transparency is crucial. These developments demonstrate the breadth of available architectures, highlighting that the forecasting landscape extends far beyond traditional CNNs, RNNs, and LSTMs. In this context, our proposed hybrid model contributes by bridging statistical forecasting methods and deep neural networks, aiming to combine interpretability and robustness with the representational capacity of modern deep architectures.

2.3. Dropout

To address the problem of overfitting, most DNN models apply a regularization technique called Dropout, which randomly sets the input elements to zero with a specified probability ( λ ) [53]. The remaining elements are multiplied by 1 / ( 1 λ ) to maintain the sum of all elements in the time series. The primary purpose of the dropout layer is to prevent overfitting and improve the generalizability of the model, in other words, to improve the accuracy of the model when forecasting unseen data [54]. This is possible because setting previous random inputs to zero ensures the model cannot over-rely on specific input data.
As discussed in [55], most DNN models use dropout layers only during the training phase and leave them inactive during inference. This is done to reduce the probability of overfitting and improve generalization but at the same time to get consistent inferences. However, another way of using the dropout normalization is to place it after each layer of the DNN, and not turn it off during inference. This technique is often referred to as Monte Carlo Dropout. As a result, after running the model forward several times, we get a distribution of predictions centered on the model’s mean forecast. This is due to the random nature of dropout normalization [55].
Figure 1 shows that the one-step-ahead prediction obtained from the stacked LSTM model with interleaved dropout layers follows an approximately Gaussian distribution. This result was obtained from 1000 stochastic forward passes using the same input sequence, allowing the estimation of the model’s predictive uncertainty and providing insight into its confidence in the generated forecasts.
The spread of this Gaussian distribution could be interpreted as the uncertainty of the model [55]. In other words, it would represent how confident the model is about its prediction. We have incorporated this technique into the proposed model, not only because it significantly improves the generalization ability of the model [54], but also because it makes it easier to interpret the predictions.

2.4. Hybrid Models

Hybrid models occupy an intermediate position between the two preceding classes [56]. Here, the model uses both neural networks and statistical forecasters, usually in a meta-ensemble model. The concept is that ensemble methods can mitigate the weaknesses of each individual forecaster, and therefore improve the overall accuracy. This technique was proven to work during the M4 Kaggle competition, where the most successful submissions involved some kind of hybrid model [24].
In addition to assembling methods, statistical forecasters can also be used to perform data augmentation or even as direct input to the DNN [57]. As stated in the initial hypothesis, the idea is that the modeled behavior of the time series that each statistical forecaster creates may contain useful information that the DNN can learn, and therefore generate more accurate predictions [58]. Section 3.2 provides comprehensive details on the architecture of the proposed model.

3. Methodology

In this section, we will discuss our methodology for testing the hypothesis and provide a detailed overview of the DNN used and each of its layers.

3.1. Data Preprocessing

The data preparation pipeline involves two main steps: generating statistical forecasts to be used as covariates and normalizing the data to ensure stable model training. The models receive a lookback window of historical data and are tasked to predict a future forecast horizon.

3.1.1. Data Normalization

Before being fed into the neural network, all input features (the original time series and the statistical covariates) are normalized. Normalization is a critical step that scales the data to a consistent range, helping to stabilize the training process and improve model convergence. For each time series, the minimum and maximum values are computed from the training portion of the data and then applied to scale the entire series. The training portion corresponds to the subset of data defined by the train_split ratio, which divides the available samples into training and testing subsets to prevent data leakage and ensure proper model validation.
We applied Min–Max scaling to transform all data to a range of [0, 1]. For a given time series X, the scaled value X t is calculated as:
X t = X t min ( X train ) max ( X train ) min ( X train ) .
This normalization ensures that all features contribute proportionally during training and prevents features with larger magnitudes from dominating the optimization process.

3.1.2. Generation of Statistical Covariates

The first step in our proposed model is to generate statistical predictions that will be used as covariates. This process is managed by a custom-implemented ‘Preprocessor()’ class, which takes several arguments to configure the feature generation: a ‘forecast_horizon’ (how many steps ahead to predict), a ‘season_length’ (e.g., 12 for monthly data), the ‘date_freq’ of the data, the ‘train_split’ ratio, and a list of ‘models’ to use for generating covariates.
In this research, the following statistical forecasters were used as covariates: ARIMA [7], ETS [28], and Linear Regression [59]. These models were implemented in Python 3.10 using the statsforecast [60] and scikit-learn [59] libraries, selected for their broad acceptance and reliability in time series forecasting.
Each statistical forecaster (ARIMA, ETS, and Linear Regression) was independently trained on the training portion of each time series. Using the fitted parameters, the models produced h-step-ahead forecasts according to the predefined forecast_horizon parameter in the custom Preprocessor() class. In all experiments, we adopted a forecast horizon of 12 time steps, meaning that each model—both statistical and neural—was trained to predict the next twelve future values of the series simultaneously. These forecasts were stored as additional features temporally aligned with the original time series, forming multivariate input tensors that integrate both raw and statistically derived information. This configuration explicitly defines a multi-horizon forecasting setting, allowing the proposed hybrid model to evaluate performance across multiple future periods rather than a single-step prediction. The entire procedure was automated by the custom Preprocessor() class, ensuring reproducibility across datasets. The hyperparameters of ARIMA and ETS were automatically selected using the auto_arima() and auto_ets() functions from the statsforecast library, while Linear Regression coefficients were estimated by Ordinary Least Squares (OLS) using scikit-learn. In this way, the statistical forecasters were consistently obtained and incorporated as covariates for subsequent DNN training.

3.1.3. Autoregressive Integrated Moving Average

The ARIMA method is a widely used time series forecasting technique [61]. It combines three components to model the temporal structure in the data: Autoregression (AR), Integration (I), and Moving Average (MA). A non-seasonal ARIMA model is generally denoted ARIMA (p, d, q), where p is the order of the autoregressive part, d is the degree of first differencing involved, and q is the order of the moving-average part. The model for a differenced series Y t = ( 1 B ) d X t , where B is the backshift operator, is given by:
1 i = 1 p ϕ i B i Y t = c + 1 + j = 1 q θ j B j ε t
where X t is the original series, c is a constant, ϕ i are the autoregressive coefficients, θ j are the moving average coefficients, and ε t is white noise. The values of p, d, and q are automatically determined using the auto_arima() function available in the Python statsforecast library, which provides a flexible and reliable approach for time series forecasting.

3.1.4. Exponential Smoothing

The ETS method is a widely used time series forecasting technique that applies a weighted average of past observations to predict future values [8]. It is based on the assumption that recent observations are more important in forecasting than older ones. The ETS method includes three main components: level, trend, and seasonality.
The level component represents the current estimated level of the time series and is denoted by L t . It is updated on the basis of a weighted average of the previous level and the recent observation. The equation for updating the level is:
L t = α X t + ( 1 α ) ( L t 1 + T t 1 )
where X t represents the current observation at time t, T t 1 is the estimated trend at time t 1 , and α is the smoothing parameter.
The trend component represents the estimated trend of the time series and is denoted by T t . It is updated on the basis of a weighted average of the previous trend and the difference between the current level and the previous level. The equation for updating the trend is:
T t = β ( L t L t 1 ) + ( 1 β ) T t 1
where β is the smoothing parameter for the trend component.
The seasonality component represents the seasonal pattern in the time series and is denoted by S t . It is updated on the basis of a weighted average of the previous seasonal component and the recent observation. The equation for updating the seasonality is:
S t = γ X t L t + ( 1 γ ) S t m
where m is the length of the seasonal cycle and γ is the smoothing parameter for the seasonality component.
The forecasted value for the next time period is calculated by combining the level, trend, and seasonality components:
X ^ t + 1 = ( L t + T t ) S t m + 1 .
The ETS model captures level, trend, and seasonal patterns that may not be effectively learned by the neural network, especially in datasets with irregular periodicity or limited length. Therefore, the forecasts generated by ETS are used as additional covariates in the hybrid model. These statistically derived signals provide interpretable temporal structure that complements the data-driven features extracted by the deep neural network.
The ETS method can handle different variations and combinations of the level, trend, and seasonality components, such as additive or multiplicative models, the parameters for the model are also automatically found with the statsforecast library. In summary, the ETS method provides a flexible and intuitive approach for time series forecasting by dynamically adjusting the weights of past observations based on their recency and importance.

3.1.5. Linear Regression

Linear regression is a widely used supervised learning algorithm for modeling the relationship between a dependent variable and one or more independent variables [59]. It fits a linear equation to the given data by minimizing the sum of squared residuals.
This model assumes a linear relationship between the dependent variable y and the independent variables X, and it can be represented as:
y = β 0 + β 1 X 1 + β 2 X 2 + + β n X n + ε
where ( X 1 , X 2 , , X n ) are the independent variables, ( β 0 , β 1 , β 2 , , β n ) are the coefficients to be estimated, y is the dependent variable, and ε is the error term.
Linear regression, in turn, models short-term linear dependencies in the recent time steps, providing a complementary perspective to the nonlinear mappings learned by the DNN. For each dataset, one-step-ahead forecasts from ETS and Linear Regression are computed using the same input window as the neural network. These predictions are concatenated with the raw input features and fed into the hybrid architecture as additional channels (statistical covariates). This integration allows the neural network to learn residual nonlinearities beyond those explained by the statistical models.
In this research, we applied the LinearRegression class from scikit-learn library to fit the linear regression model by estimating the β coefficients using OLS method. It minimizes the residual sum of squares between the observed and predicted values.
Once the model is fitted, it can be used to make predictions for new input data by simply computing the dot product between the input variables and the estimated coefficients. Linear regression is widely used for various applications, including predictive modeling, trend analysis, and relationship exploration.
By combining the forecasts of statistical models (ETS and Linear Regression) with deep neural representations, the proposed hybrid framework leverages complementary information from both paradigms. This design supports the central hypothesis that integrating statistical forecasters as covariates enhances predictive accuracy and robustness across heterogeneous time series datasets.

3.2. Model Architecture

The proposed model consists of several input branches, where each branch receives a different representation of the input time series. Specifically, we use four branches: one for the original (normalized) time series and one for each of the three statistical forecasts (ARIMA, ETS, and Linear Regression). Each branch has a single fully connected layer, referred to as a dense layer in this paper. Then each branch’s output is multiplied by a trainable scalar parameter called weight, and the results are averaged, essentially performing a weighted average of the branches. By applying this approach, the model can learn which input representation is the most influential for prediction [62]. The analysis of these learned weights can provide insights into which forecasters the model finds most useful, an area we suggest for future research.
After all the branches have been consolidated into a single tensor, the resulting tensor is forwarded to a stack of convolution and max-pooling layers, followed by an LSTM layer. An overview of the model is represented in Figure 2 and the detailed formulae of all the layers are explained in the following sections.

3.2.1. Dense Layer

As implemented in the Keras Python package, dense layers are conventional fully connected neural network layers. They consist of a set of weights W R a × hidden_size and biases B R hidden_size , where a denotes the input size and hidden_size represents the output size [18]. The forward pass of a Dense layer can be described as follows:
Dense ( X ) = σ ( X W + B )
where X R ξ × a , with ξ representing any dimension, and σ represents an activation function. In this model, we use the Rectified Linear Unit (ReLU) activation function [63], which is defined as σ ( x ) = max ( 0 , x ) . The primary role of an activation function is to introduce non-linearity between the input and output of the Dense layer.
The proposed model uses Dense layers in three different parts of its architecture. The first is at the root of each branch, as described in Section 3.1. The last dimension of each input X is 1. Since the Dense operates on the last dimension, the output of this last layer will be batch_size × t × hidden_size .

3.2.2. Convolutional Layer

Convolutional layers are the fundamental building block of convolutional neural networks (ConvNets), widely used in image and video analysis tasks. A convolutional layer applies a small filter to the input data, sliding it over to compute the dot product between the data and the filter weights. The filter applies the same weights to every position in the input data, resulting in a new sequence. In this sequence, each element represents a weighted combination of a local region from the input data [64].
Essentially, by learning essential features with the filter, the model is no longer bound by spatial dependency. In other words, the model can locate important features anywhere in the input data [64]. This property is particularly advantageous in image analysis, where it is crucial to recognize objects like water bottles regardless of their position within the image. However, it is also beneficial in time series forecasting, as it enables the recognition of recurring patterns, such as seasonal patterns, at any point along the temporal axis.
In image analysis tasks, the convolutional layer typically operates in two dimensions, representing the width and height of the image [65]. In video analysis tasks, the convolutional layer commonly operates in three dimensions, including an additional dimension for the temporal axis. However, a one-dimensional convolution (Conv1D) is used for time series data with only one axis [66].
Given an input X R t × n to a Conv1D layer, where t is the length of the time series, and n is the number of channels in the input data, and a filter of size kernel_size , the Conv1D layer computes the dot product of the filter weights W R n × kernel_size and the input X at each position, adds a bias term B R n , and applies an activation function σ to produce the result, as shown in the following equation:
Conv 1 D ( X ) i = σ j = 0 kernel_size 1 W j X i + j + B ,
where i ranges from 0 to t kernel_size ; additionally, the Conv1D layer has two more parameters: padding and stride. The padding parameter allows for padding the starting and leading data points out of the input data. In contrast, the stride parameter controls the increment in the filter position during the sliding process [64].
Typically, after a convolutional layer, there is a pooling layer. Its primary purpose is to downsample the feature maps by retaining only the most critical information while discarding the rest [64]. The approach to performing the pooling operation can differ between models. However, recent studies have shown a preference for MaxPooling over other pooling methods due to its superior results and more stable training process [67]. MaxPooling performs a sliding operation of a fixed-size window on the input feature map, retaining the maximum value within each window while discarding all other values.
By retaining only the maximum value, MaxPooling ensures the preservation of the most vital feature within each window. This process effectively reduces the dimensionality of the feature maps and improves the computational efficiency of ConvNet by reducing the number of parameters that need to be learned. Recent studies have favored MaxPooling over other pooling methods due to its superior results and more stable training process [67].

3.2.3. Long Short-Term Memory

The LSTM is an RNN architecture widely used in sequence-to-sequence tasks [11]. Unlike traditional RNNs, LSTMs have a mechanism for handling the vanishing gradient problem, which is the tendency of gradients to decrease exponentially as the number of layers in a neural network increases. Because of this, LSTMs are a popular choice for time series forecasting due to their ability to capture long-term dependencies and patterns in sequential data [36].
A traditional LSTM cell, as shown in Figure 3, has three inputs: the Current Input X t R , the Previous Hidden State H t 1 R 1 × h , where h is the number of units in the hidden state, and the Previous Cell State C t 1 R 1 × h . The hidden and cell states are updated at each time step, allowing the LSTM to remember information from the previous steps [11].
In addition to the inputs, the LSTM cell incorporates four gates. The first gate is the Forget Gate f t , which is described by the equation:
f t = σ ( W f × ( H t 1 + X t ) + B f )
Here, the Sigmoid activation function, denoted by σ , is used. The weight matrix of the Forget Gate is represented by W f R h , × , 1 , and the bias is denoted by B f R h . As implied by its name, the Forget Gate controls which information should be retained (passed to the cell state) and which should be forgotten. This gate accomplishes this by learning the optimal weight matrix W f . The resulting matrix f t R h , × , h is then multiplied by the Previous Cell State C t 1 . In cases with no previous cell states (i.e., the first time step), it is assumed that C t 1 = 0 . The next gate is the Input Gate i t , which is defined as follows:
i t = σ ( W i × ( H t 1 + X t ) + B i )
where W i R h × 1 is the weight matrix of the Input Gate, and B i R h is the bias, resulting in i t R h × h . Parallel to the Input Gate is the Cell Gate C t defined by:
C t = tan h ( W c × ( C t 1 + X t ) + B c )
Different from the previous gates, W c is a scalar defined by W c R 1 × 1 , as well as the bias B c R 1 ; the resulting Cell Gate is C ˜ t R 1 × h . The Input Gate i t is then multiplied by the Cell Gate C t , and the resulting multiplication is then summed with the multiplication of the Previous Cell State C t 1 and the Forget Gate f t , resulting in the Cell State C t R 1 × h , as follows:
C t = C t 1 × f t + i t × C t .
The Input Gate i t and Cell Gate C t control which part of the input X t should be retained in the Cell State and which part should be ignored. Lastly, the Output Gate O t can be calculated as:
O t = σ ( W o × ( H t 1 + X t ) + B o )
where W o R h × 1 is the weight matrix of the Input Gate, and B o R h is the bias, resulting in O t R h × h ; the function of the Output Gate O t is to control what goes to the next Hidden State, and, simultaneously, control what should be the output of the current cell [15]. Then, the Hidden State H t can be calculated as:
H t = O t × tan h C t .
In practice, stacked LSTM cells form an LSTM layer. This set of equations for the states and gates of an LSTM cell controls the flow of information within the cell and allows the LSTM to decide which information to store and which to discard. The number of LSTM cells in an LSTM layer and the number of LSTM layers in a network can be tuned as hyperparameters to optimize performance for a given task [11].

3.3. Training Details and Hyperparameter Selection

The models were trained using the Adam optimizer with a learning rate of 0.001. The batch size was 64, and training ran for 100 epochs with early stopping based on validation loss, using a patience of 10 epochs. Hyperparameters were tuned through an exploratory grid search guided by empirical validation on the training set and informed by prior studies on LSTM-based forecasting models. The search varied learning rate, batch size, dropout rate, layer size, and model depth to balance predictive accuracy and computational cost. The grid search was performed for Model A (without covariates), and the same values were adopted for Model B. The best hyperparameters were rounded to conventional values (e.g., 963 dense units rounded to 1024). The selected configuration was:
i.
Dense layer: 1024 units;
ii.
Conv1D: 32 filters, kernel size 5;
iii.
MaxPooling: pool size 4;
iv.
Stacks: 5;
v.
LSTM: 2048 units.
This process ensured that the final configuration resulted from systematic tuning rather than arbitrary selection. The models were implemented using TensorFlow and Keras.

3.4. Metrics

In this section, we discuss the evaluation metrics used to evaluate the performance of our predictive model. We employ four commonly used metrics: MAE, MSE, and SMAPE. All of these metrics are widely used in both academic and practical applications [21], and were chosen to evaluate the proposed hypothesis because each metric prioritizes a different characteristic of the time series.

3.4.1. Mean Absolute Error

The MAE measures the average absolute difference between the predicted values y ^ and the actual values y. It is defined as:
MAE ( y , y ^ ) = 1 s i = 1 s | y i y ^ i | .
Equation (16) is widely used because it provides a simple and intuitive measure of the average prediction error. It is particularly useful when the magnitude of errors is important and outliers or extreme values in the dataset should not be severely penalized [21].

3.4.2. Mean Squared Error

The MSE metric calculates the average of the squared differences between the predicted values y ^ and the actual values y. It is given by:
MSE ( y , y ^ ) = 1 s i = 1 s ( y i y ^ i ) 2 .
Equation (17) is a commonly used metric that emphasizes larger errors due to the squaring operation. It provides a measure of the average squared deviation between the predicted and actual values. Contrary to MAE, MSE heavily penalizes outliers and extreme predictions [21].

3.4.3. Symmetric Mean Absolute Percentage Error

The SMAPE metric computes the average percentage difference between the predicted values y ^ and the actual values y, considering the mean of their magnitudes in the denominator. It is given by:
SMAPE ( y , y ^ ) = 1 s i = 1 s | y i y ^ i | ( y i + y ^ i ) / 2 .
Equation (18) is a symmetric variant of the Mean Absolute Percentage Error (MAPE). The main difference is the denominator, where MAPE uses only y as the denominator, and SMAPE uses the average between y and y ^ . This is done to avoid a possible division by zero when y = 0 [68]. SMAPE is a useful metric because it represents the overall error as a percentage, allowing for comparison of the accuracy of a model in multiple domains with heterogeneous time series [69]. A general overview of the metrics is summarized in Table 1, in which sensitivity to outliers refers to how much the metric is impacted by the presence of an outlier in the dataset; explainability refers to how easy it is to explain the metric in non-technical terms; and interpretability refers to how easy it is to evaluate a model based on this metric alone. All these three interpretations are qualitative terms defined by the authors to summarize the metrics.
These metrics allow us to quantitatively evaluate the performance of our predictive model from different angles. MAE and MSE provide information on the magnitude of errors, while SMAPE offers information on the relative percentage deviations.

3.5. Hypothesis Test

The methodology for testing the alternative hypothesis ( H a ) consists of training two seemingly identical DNN models, where the only difference lies in the input of each model. The first model, named (Model A), was trained on the dataset without any feature engineering. The second model, named (Model B), was trained on a dataset with the statistical forecasters as covariates, as shown in Figure 4. In this figure, the circle marked with an “X” represents the merging of the original dataset with the statistical forecasts to generate covariates for Model B. Both models had the same training configuration and parameters. This procedure ensures a fair comparison between both models under identical experimental conditions. By employing the paired t-test, the evaluation focuses on the mean difference of performance metrics between the two models for each dataset, quantifying whether the improvements observed in Model B are statistically significant or could have occurred by chance. This analytical framework provides formal statistical evidence to support or reject the central hypothesis of this study.
Each model was evaluated with the three proposed metrics (MAE, MSE and SMAPE) for each dataset (M5, Stallion, Stock Market, and Synthetic), resulting in 12 evaluation matrices for each model. Next, we performed a two-sample paired t-test (also known as Student’s t-test). The objective of this test is to measure whether the mean of two samples with the same origin shows a significant difference [70]. The two-sample paired test is often used to measure the influence that an event had on the samples. For example, it could be applied to a group of students to measure their grades before and after they studied a subject and to determine how effective the study was.
This is achieved by calculating the t-statistic (t) between the observations of the samples, where t represents the difference in terms of standard deviations ( σ ) between them. To calculate t, it is necessary first to calculate the difference between the two paired observations (d), where d = x y , with x as the first observation and y as the second. Next, the mean difference between pairs ( μ d ) can be defined as:
μ d = i d i n .
The next step is to calculate the standard deviation of the differences ( σ d ), defined as:
σ d = i ( d i μ d ) 2 n 1 .
Finally, t can be defined as:
t = μ d σ d .
In Equation (21), a large and positive value of t indicates that the first observation of the pair is significantly different from the second, and vice versa. After obtaining the t-statistic, a lookup table or specialized software can be used to find its corresponding p-value. The p-value represents the probability of obtaining a statistic as extreme or more extreme than that observed in the experiment [70].

Significance Level

It It is important to note that the Student t-test assumes that H 0 is true. Therefore, the resulting conclusion of the test provides proof that H 0 is, in fact, true or evidence that H 0 is not true, in which case H a can be assumed to be true [70].
Since the p-value is essentially the probability that H 0 is true, if we want evidence that H a is true, we need a very low probability that H 0 is true. Due to this, it is safe to assume a significance level ( α ) of 0.05, as demonstrated in Figure 4 [71].
However, as discussed by many authors over the decades, the meaning of the p-value and its associated thresholds is abstract and has received many criticisms over the years [72,73,74]. Because of this, we will adopt a region of uncertainty where H 0 can be neither accepted nor denied, formalized as:
α 0.05 H 0 is denied 0.05 < α 0.20 H 0 can be neither accepted nor denied α > 0.20 H 0 is accepted
Using the methodology outlined in this section, we successfully tested the proposed model and assessed the hypothesis presented in Section 1. The results will be detailed in the following section.

4. Results and Discussion

This study uses four distinct datasets, each selected to represent different challenges in time series forecasting, such as intermittency, high variance, and non-linearity. The details of each dataset are as follows:
A.
M5 competition: Released by Walmart for the Kaggle competition; comprises 30,490 daily series with 1840 observations each (around 5 years). The sales volume exhibits high variability, with a mean daily sales of 7.9 units and standard deviation of 21.5. A total of 6859 series show intermittence above 50%, confirming a predominance of erratic but dense time series.
B.
Stallion competition: The Stallion dataset contains 1392 monthly time series (60 months each) across 24 SKUs and 58 agencies. It represents alcoholic beverage sales in liters. The data show high variance and short history, with a mean volume of 2340 L and standard deviation of 4120 L per SKU–agency pair. It is markedly more erratic than M5 but less intermittent.
C.
Stock market: This dataset comprises 3457 stock tickers from NASDAQ, NYSE, and the S&P 500, with daily records spanning up to 2022. Each entry includes Open, High, Low, Close, and Volume values; however, only the Close price was used as the target in this study. The data exhibit no intermittency and show high volatility, with average volumes of 5.01, 7.06, and 9.48 for NASDAQ, NYSE, and the S&P 500, respectively. The corresponding standard deviations are 1.01, 0.83, and 1.68, which is consistent with typical stock behavior characterized by random-walk dynamics.
D.
Synthetic data: This dataset is generated by the sum of four components, namely, (i) seasonality, modeled by a sine wave with random amplitude, phase, and frequency; (ii) trend, modeled by a random linear coefficient, either positive, negative, or null; and (iii) noise, modeled by Gaussian white noise. The dataset contains 500 time series with 60 time steps each. Lastly, there is (iv) gain, which is a random scalar value that multiplies the entire series.
These datasets have been selected because they contain all the problems stated in Section 1, namely, intermittence, high gains and variance, and non-linearities. This provides a challenging and realistic scenario for training and evaluating the performance of the model. In this study, the model was trained independently on each dataset, allowing us to assess its performance under distinct data characteristics without inter-dataset dependencies.

4.1. Predictive Models Results in the Selected Datasets

The results obtained from the evaluation of the predictive model in different datasets are presented in Table 2. The table shows the metrics (MAE, MSE, and SMAPE) for both Model A (without covariates) and Model B (with covariates), along with the p-value and the t-statistic obtained from the two-sample paired t-test. Since the data was scaled for training and inference, we also provided the normalized metrics for MAE and MSE. However, the p-value and t-statistic are the same due to the linearity of the transformation. To facilitate interpretation, Figure 5 provides a visual comparison of the models’ performance across datasets, highlighting relative improvements and trends that are not immediately apparent from the numerical values in the table. Also, Figure 6 shows a prediction output to better illustrate the proposed model’s behavior.
For the M5 dataset, Model B with covariates showed slightly lower MAE and MSE compared to Model A without covariates. However, the performance difference between the two models was not statistically significant based on the p-value, which is greater than 0.05 . The t-statistic is also negative, indicating that Model B performed slightly better than Model A on average. In particular, the SMAPE metric showed a statistically significant improvement in performance for Model B. The relatively low p-value and the negative t-statistic suggest that including covariates in the model resulted in a more accurate percentage-based error estimation.
For the Stallion dataset, Model B with covariates exhibited slightly lower MAE and MSE compared to Model A without covariates. However, similar to the M5 dataset, the performance difference was not statistically significant, as indicated by the p-value greater than 0.05 . The t-statistic is also close to zero, indicating that there is no significant difference between the two models. It is important to note that the SMAPE metric showed a notable improvement in performance for Model B. While the p-value for SMAPE is greater than 0.05, the relatively low value indicates that the difference might be significant with a larger dataset.
Similar to the Stallion and M5 datasets, the improvement in MAE and MSE for the Stock market dataset was not statistically significant; however, for this dataset, the difference was basically negligible. On the contrary, there was a massive improvement in the SMAPE metric for Model B.
The Synthetic dataset showed the most significant improvement when using covariates. Both MAE and SMAPE exhibited substantial improvements, with p-values close to zero, indicating high statistical significance. The negative t-statistics further support the conclusion that Model B significantly outperformed Model A. The large improvement in SMAPE is noteworthy, as it suggests that the inclusion of covariates allowed for a better representation of the relative percentage error (see Figure 5).

4.2. Comparison with Baseline Statistical Models

To provide a broader context for our results, we also evaluated the performance of the individual statistical models used as covariates. Table 3 presents the metric scores for ARIMA, ETS, and Linear Regression on each dataset. This comparison highlights that while our proposed hybrid model (Model B) does not achieve a significant p-value in every case, its main contribution is providing a robust framework that consistently improves upon the pure DNN approach (Model A) and its base forecasters. Figure 7 illustrates the distribution of performance metrics for both models, emphasizing patterns and variations that support the interpretation of the results.

4.3. Overall Implications

The results indicate that the inclusion of covariates in the predictive model showed varying degrees of impact on performance, depending on the data set and the evaluation metric chosen. In some cases, such as the Stock Market and Synthetic datasets, including covariates significantly improved the model’s performance, particularly on the SMAPE metric. However, for the M5 and Stallion datasets, the impact of covariates on MAE and MSE was not statistically significant.
The varying results could be attributed to several factors, such as the nature of the time series data, the quality and relevance of the selected statistical forecasters, and the size of the datasets. However, a crucial finding is that performance never degraded in any scenario due to the usage of statistical covariates. This robustness is likely due to the weighted average mechanism of the input branches (Figure 2). The model can learn to assign lower weights to, or effectively ignore, inputs that do not contribute to improving the predictive accuracy during training.
Another key observation is that the most significant improvements were observed with the SMAPE metric. This was not unexpected, since this is the metric most sensitive to outliers and relative errors; therefore, any minor difference in the outputs, especially for low-volume series, results in a large difference in the metric. Finally, we observed that the performance benefit of Model B over Model A was most prominent in the synthetic dataset, where the underlying patterns (trend, seasonality) are well-defined and can be effectively captured by the statistical models.
Therefore, based on the statistical analysis obtained from the paired t-tests, the null hypothesis (H0) was rejected for datasets with p-values ≤ 0.05, while for cases with 0.05 < p ≤ 0.20, H0 could be neither accepted nor rejected according to the adopted significance criterion. Overall, these results confirm that incorporating statistical covariates into the DNN architecture tends to improve forecasting accuracy compared to the baseline model without covariates.
These results align with the findings reported in the M4 and M5 forecasting competitions, where hybrid and ensemble strategies have shown superior generalization performance.

5. Conclusions

In this paper, we address the challenge of improving multi-horizon forecasting in scenarios with limited data and functional covariates. We proposed the usage of statistical forecasters as covariates in a DNN model and evaluated its impact on forecast metrics. Our analysis covered four diverse datasets: M5, Stallion, Stock Market, and Synthetic.
The results demonstrated that the inclusion of statistical forecasters as covariates in the DNN model led to varying degrees of improvement in forecast performance, depending on the dataset and evaluation metric. While the improvements in MAE and MSE were modest and not always statistically significant, the SMAPE metric showed substantial and often significant gains across all datasets. This suggests that the hybrid approach is particularly effective at improving the relative accuracy of forecasts, which is critical for intermittent and low-volume time series. Crucially, the proposed method (Model B) consistently outperformed the pure DNN model (Model A) and never resulted in a degradation in performance against its base statistical forecasters.
Finally, the main limitations of this work include computational cost, limited hyperparameter tuning, and dataset restrictions. Although incorporating statistical covariates increases the computational cost of training and inference, our results suggest that this trade-off can be worthwhile. Notably, in none of the tested scenarios did the model with covariates underperform compared to the baseline model. This robustness can be attributed to the model’s architecture, which dynamically learns to weight the contribution of each covariate. When a covariate does not add relevant information, its influence is minimized during training. As a result, even when performance gains in terms of MAE or MSE are modest, the use of covariates offers a more stable and reliable forecasting strategy—particularly in datasets characterized by intermittency or high variance.
In general, our findings suggest that the incorporation of statistical forecasters as covariates in a DNN model can be a valuable approach to improving multi-horizon forecasting, especially in scenarios with data scarcity and intermittence. However, the effectiveness of this approach may depend on the nature of the data and the specific forecasting task at hand.
Future research could extend this framework by incorporating additional statistical or probabilistic forecasters—such as Prophet or Bayesian models—within the proposed covariate-based architecture. Although recent studies have explored hybridizations involving advanced neural architectures like Temporal Fusion Transformers or N-HiTS, integrating these models under a unified covariate-driven framework and statistically evaluating their benefits remains an open direction.
Overall, the findings highlight that integrating statistical forecasters as covariates in deep neural models represents a robust and effective approach for improving multi-horizon forecasting, especially in environments characterized by limited or irregular data.

Author Contributions

Writing—original draft, R.O.K.; writing—review and editing, software, methodology, and validation, R.O.K. and W.D.P.; writing—review and editing, and supervision, W.D.P., R.L.S.D., A.M.d.R.F., D.C.G., and G.V.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors sincerely thank the Foundation for Research and Innovation of the State of Santa Catarina (FAPESC) for providing the Master’s scholarship that supported the research presented in this paper. Without the financial support of FAPESC, this work would not have been possible.

Informed Consent Statement

Not applicable.

Data Availability Statement

Available upon request to the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zuege, C.V.; Stefenon, S.F.; Yamaguchi, C.K.; Mariani, V.C.; Gonzalez, G.V.; dos Santos Coelho, L. Wind speed forecasting approach using conformal prediction and feature importance selection. Int. J. Electr. Power Energy Syst. 2025, 168, 110700. [Google Scholar] [CrossRef]
  2. Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef]
  3. Lopes, H.; Pires, I.M.; Sánchez San Blas, H.; García-Ovejero, R.; Leithardt, V. PriADA: Management and Adaptation of Information Based on Data Privacy in Public Environments. Computers 2020, 9, 77. [Google Scholar] [CrossRef]
  4. Kourentzes, N.; Athanasopoulos, G. Elucidate structure in intermittent demand series. Eur. J. Oper. Res. 2021, 288, 141–152. [Google Scholar] [CrossRef]
  5. Tian, X.; Wang, H.; Erjiang, E. Forecasting intermittent demand for inventory management by retailers: A new approach. J. Retail. Consum. Serv. 2021, 62, 102662. [Google Scholar] [CrossRef]
  6. Jain, G.; Mallick, B. A study of time series models ARIMA and ETS. SSRN Electron. J. 2017. [Google Scholar] [CrossRef]
  7. Hyndman, R.J.; Khandakar, Y. Automatic time series forecasting: The forecast package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
  8. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
  9. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  10. Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  11. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  12. Chevalier, G. LARNN: Linear attention recurrent neural network. arXiv 2018, arXiv:1808.05578. [Google Scholar] [CrossRef]
  13. Bui, V.; Le, N.T.; Nguyen, V.H.; Kim, J.; Jang, Y.M. Multi-behavior with bottleneck features LSTM for load forecasting in building energy management system. Electronics 2021, 10, 1026. [Google Scholar] [CrossRef]
  14. da Silva, E.C.; Finardi, E.C.; Stefenon, S.F. Enhancing hydroelectric inflow prediction in the Brazilian power system: A comparative analysis of machine learning models and hyperparameter optimization for decision support. Electr. Power Syst. Res. 2024, 230, 110275. [Google Scholar] [CrossRef]
  15. Klaar, A.C.R.; Stefenon, S.F.; Seman, L.O.; Mariani, V.C.; Coelho, L.S. Optimized EWT-Seq2Seq-LSTM with attention mechanism to insulators fault prediction. Sensors 2023, 23, 3202. [Google Scholar] [CrossRef] [PubMed]
  16. Stefenon, S.F.; Seman, L.O.; da Silva, L.S.A.; Mariani, V.C.; dos Santos Coelho, L. Hypertuned temporal fusion transformer for multi-horizon time series forecasting of dam level in hydroelectric power plants. Int. J. Electr. Power Energy Syst. 2024, 157, 109876. [Google Scholar] [CrossRef]
  17. Aquino, L.S.; Seman, L.O.; Mariani, V.C.; Coelho, L.D.S.; Stefenon, S.F.; González, G.V. Spatiotemporal wind energy forecasting: A comprehensive survey and a deep equilibrium-based case study with StemGNN. IEEE Access 2025, 13, 131461–131482. [Google Scholar] [CrossRef]
  18. Gardner, M.W.; Dorling, S. Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences. Atmos. Environ. 1998, 32, 2627–2636. [Google Scholar] [CrossRef]
  19. Ranganathan, A. The levenberg-marquardt algorithm. Tutoral Algorithm 2004, 11, 101–110. [Google Scholar]
  20. Stefenon, S.F.; Seman, L.O.; Yamaguchi, C.K.; Coelho, L.D.S.; Mariani, V.C.; Matos-Carvalho, J.P.; Leithardt, V.R.Q. Neural Hierarchical Interpolation Time Series (NHITS) for Reservoir Level Multi-Horizon Forecasting in Hydroelectric Power Plants. IEEE Access 2025, 13, 54853–54865. [Google Scholar] [CrossRef]
  21. González-Sopeña, J.; Pakrashi, V.; Ghosh, B. An overview of performance evaluation metrics for short-term statistical wind power forecasting. Renew. Sustain. Energy Rev. 2021, 138, 110515. [Google Scholar] [CrossRef]
  22. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
  23. Branco, N.W.; Cavalca, M.S.M.; Stefenon, S.F.; Leithardt, V.R.Q. Wavelet LSTM for Fault Forecasting in Electrical Power Grids. Sensors 2022, 22, 8323. [Google Scholar] [CrossRef]
  24. Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
  25. Stefenon, S.F.; Kasburg, C.; Freire, R.Z.; Silva Ferreira, F.C.; Bertol, D.W.; Nied, A. Photovoltaic power forecasting using wavelet neuro-fuzzy for active solar trackers. J. Intell. Fuzzy Syst. 2021, 40, 1083–1096. [Google Scholar] [CrossRef]
  26. Seman, L.O.; Stefenon, S.F.; Mariani, V.C.; dos Santos Coelho, L. Ensemble learning methods using the Hodrick–Prescott filter for fault forecasting in insulators of the electrical power grids. Int. J. Electr. Power Energy Syst. 2023, 152, 109269. [Google Scholar] [CrossRef]
  27. Box, G.; Jenkins, G. Analysis: Forecasting and Control; Holden Day: San Francisco, CA, USA, 1976. [Google Scholar]
  28. Hyndman, R.; Koehler, A.B.; Ord, J.K.; Snyder, R.D. Forecasting with Exponential Smoothing: The State Space Approach; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  29. Croston, J.D. Forecasting and stock control for intermittent demands. J. Oper. Res. Soc. 1972, 23, 289–303. [Google Scholar] [CrossRef]
  30. Syntetos, A.A.; Boylan, J.E. The accuracy of intermittent demand estimates. Int. J. Forecast. 2005, 21, 303–314. [Google Scholar] [CrossRef]
  31. Panagiotelis, A.; Athanasopoulos, G.; Gamakumara, P.; Hyndman, R.J. Forecast reconciliation: A geometric view with new insights on bias correction. Int. J. Forecast. 2021, 37, 343–359. [Google Scholar] [CrossRef]
  32. Taylor, S.J.; Letham, B. Forecasting at scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
  33. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: Results, findings, conclusion and way forward. Int. J. Forecast. 2018, 34, 802–808. [Google Scholar] [CrossRef]
  34. Makridakis, S.; Hyndman, R.J.; Petropoulos, F. Forecasting in social settings: The state of the art. Int. J. Forecast. 2020, 36, 15–28. [Google Scholar] [CrossRef]
  35. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M5 competition: Background, organization, and implementation. Int. J. Forecast. 2022, 38, 1325–1336. [Google Scholar] [CrossRef]
  36. Stefenon, S.F.; Seman, L.O.; Aquino, L.S.; dos Santos Coelho, L. Wavelet-Seq2Seq-LSTM with attention for time series forecasting of level of dams in hydroelectric power plants. Energy 2023, 274, 127350. [Google Scholar] [CrossRef]
  37. Khaldi, R.; El Afia, A.; Chiheb, R.; Tabik, S. What is the best RNN-cell structure to forecast each time series behavior? Expert Syst. Appl. 2023, 215, 119140. [Google Scholar] [CrossRef]
  38. Stefenon, S.F.; Silva, M.C.; Bertol, D.W.; Meyer, L.H.; Nied, A. Fault diagnosis of insulators from ultrasound detection using neural networks. J. Intell. Fuzzy Syst. 2019, 37, 6655–6664. [Google Scholar] [CrossRef]
  39. Stefenon, S.F.; Singh, G.; Yow, K.C.; Cimatti, A. Semi-ProtoPNet deep neural network for the classification of defective power grid distribution structures. Sensors 2022, 22, 4859. [Google Scholar] [CrossRef] [PubMed]
  40. Starke, L.; Hoppe, A.F.; Sartori, A.; Stefenon, S.F.; Santana, J.F.D.P.; Leithardt, V.R.Q. Interference recommendation for the pump sizing process in progressive cavity pumps using graph neural networks. Sci. Rep. 2023, 13, 16884. [Google Scholar] [CrossRef]
  41. Stefenon, S.F.; Seman, L.O.; Klaar, A.C.R.; Ovejero, R.G.; Leithardt, V.R.Q. Hypertuned-YOLO for interpretable distribution power grid fault location based on EigenCAM. Ain Shams Eng. J. 2024, 15, 102722. [Google Scholar] [CrossRef]
  42. Stefenon, S.F.; Seman, L.O.; Singh, G.; Yow, K.C. Enhanced insulator fault detection using optimized ensemble of deep learning models based on weighted boxes fusion. Int. J. Electr. Power Energy Syst. 2025, 168, 110682. [Google Scholar] [CrossRef]
  43. Salazar, L.H.A.; Leithardt, V.R.Q.; Parreira, W.D.; da Rocha Fernandes, A.M.; Barbosa, J.L.V.; Correia, S.D. Application of Machine Learning Techniques to Predict a Patient’s No-Show in the Healthcare Sector. Future Internet 2022, 14, 3. [Google Scholar] [CrossRef]
  44. Fernandes, F.; Stefenon, S.F.; Seman, L.O.; Nied, A.; Ferreira, F.C.S.; Subtil, M.C.M.; Klaar, A.C.R.; Leithardt, V.R.Q. Long short-term memory stacking model to predict the number of cases and deaths caused by COVID-19. J. Intell. Fuzzy Syst. 2022, 6, 6221–6234. [Google Scholar] [CrossRef]
  45. Vieira, J.C.; Sartori, A.; Stefenon, S.F.; Perez, F.L.; de Jesus, G.S.; Leithardt, V.R.Q. Low-Cost CNN for Automatic Violence Recognition on Embedded System. IEEE Access 2022, 10, 25190–25202. [Google Scholar] [CrossRef]
  46. Larcher, J.H.K.; Stefenon, S.F.; dos Santos Coelho, L.; Mariani, V.C. Enhanced multi-step streamflow series forecasting using hybrid signal decomposition and optimized reservoir computing models. Expert Syst. Appl. 2024, 255, 124856. [Google Scholar] [CrossRef]
  47. Ribeiro, M.H.D.M.; da Silva, R.G.; Moreno, S.R.; Canton, C.; Larcher, J.H.K.; Stefenon, S.F.; Mariani, V.C.; dos Santos Coelho, L. Variational mode decomposition and bagging extreme learning machine with multi-objective optimization for wind power forecasting. Appl. Intell. 2024, 54, 3119–3134. [Google Scholar] [CrossRef]
  48. Stefenon, S.F.; Seman, L.O.; Schutel Furtado Neto, C.; Nied, A.; Seganfredo, D.M.; Garcia da Luz, F.; Sabino, P.H.; Torreblanca González, J.; Quietinho Leithardt, V.R. Electric field evaluation using the finite element method and proxy models for the design of stator slots in a permanent magnet synchronous motor. Electronics 2020, 9, 1975. [Google Scholar] [CrossRef]
  49. Stefenon, S.F.; Cristoforetti, M.; Cimatti, A. Automatic digitalization of railway interlocking systems engineering drawings based on hybrid machine learning methods. Expert Syst. Appl. 2025, 281, 127532. [Google Scholar] [CrossRef]
  50. Stefenon, S.F.; Bruns, R.; Sartori, A.; Meyer, L.H.; Ovejero, R.G.; Leithardt, V.R.Q. Analysis of the ultrasonic signal in polymeric contaminated insulators through ensemble learning methods. IEEE Access 2022, 10, 33980–33991. [Google Scholar] [CrossRef]
  51. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the VIII International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  52. Challu, C.; Olivares, K.G.; Oreshkin, B.N.; Ramirez, F.G.; Canseco, M.M.; Dubrawski, A. Nhits: Neural hierarchical interpolation for time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 6989–6997. [Google Scholar]
  53. Corso, M.P.; Stefenon, S.F.; Singh, G.; Matsuo, M.V.; Perez, F.L.; Leithardt, V.R.Q. Evaluation of visible contamination on power grid insulators using convolutional neural networks. Electr. Eng. 2023, 105, 3881–3894. [Google Scholar] [CrossRef]
  54. Baldi, P.; Sadowski, P.J. Understanding dropout. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
  55. Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
  56. Stefenon, S.F.; Ribeiro, M.H.D.M.; Nied, A.; Mariani, V.C.; Coelho, L.S.; Leithardt, V.R.Q.; Silva, L.A.; Seman, L.O. Hybrid wavelet stacking ensemble model for insulators contamination forecasting. IEEE Access 2021, 9, 66387–66397. [Google Scholar] [CrossRef]
  57. Javeri, I.Y.; Toutiaee, M.; Arpinar, I.B.; Miller, J.A.; Miller, T.W. Improving Neural Networks for Time-Series Forecasting using Data Augmentation and AutoML. In Proceedings of the 2021 IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService), Oxford, UK, 23–26 August 2021; pp. 1–8. [Google Scholar] [CrossRef]
  58. Stefenon, S.F.; Seman, L.O.; da Silva, E.C.; Finardi, E.C.; Coelho, L.d.S.; Mariani, V.C. Hypertuned wavelet convolutional neural network with long short-term memory for time series forecasting in hydroelectric power plants. Energy 2024, 313, 133918. [Google Scholar] [CrossRef]
  59. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  60. Garza, F.; Max Mergenthaler Canseco, C.C.; Olivares, K.G. StatsForecast: Lightning Fast Forecasting with Statistical and Econometric Models; PyCon: Salt Lake City, UT, USA, 2022. [Google Scholar]
  61. Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  62. Zhang, X.; Xu, M.; Li, Y.; Su, M.; Xu, Z.; Wang, C.; Kang, D.; Li, H.; Mu, X.; Ding, X.; et al. Automated multi-model deep neural network for sleep stage scoring with unfiltered clinical data. Sleep Breath. 2020, 24, 581–590. [Google Scholar] [CrossRef] [PubMed]
  63. Dubey, A.K.; Jain, V. Comparative study of convolution neural network’s relu and leaky-relu activation functions. In Applications of Computing, Automation and Wireless Systems in Electrical Engineering: Proceedings of MARC 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 873–880. [Google Scholar]
  64. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  65. Borré, A.; Seman, L.O.; Camponogara, E.; Stefenon, S.F.; Mariani, V.C.; Coelho, L.S. Machine fault detection using a hybrid CNN-LSTM attention-based model. Sensors 2023, 23, 4512. [Google Scholar] [CrossRef] [PubMed]
  66. dos Santos, G.H.; Seman, L.O.; Bezerra, E.A.; Leithardt, V.R.Q.; Mendes, A.S.; Stefenon, S.F. Static attitude determination using convolutional neural networks. Sensors 2021, 21, 6419. [Google Scholar] [CrossRef]
  67. Nagi, J.; Ducatelle, F.; Di Caro, G.A.; Cireşan, D.; Meier, U.; Giusti, A.; Nagi, F.; Schmidhuber, J.; Gambardella, L.M. Max-pooling convolutional neural networks for vision-based hand gesture recognition. In Proceedings of the 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), Kuala Lumpur, Malaysia, 16–18 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 342–347. [Google Scholar]
  68. Makridakis, S. Accuracy measures: Theoretical and practical concerns. Int. J. Forecast. 1993, 9, 527–529. [Google Scholar] [CrossRef]
  69. Gustriansyah, R.; Ermatita, E.; Rini, D.P. An approach for sales forecasting. Expert Syst. Appl. 2022, 207, 118043. [Google Scholar] [CrossRef]
  70. Montgomery, D.C.; Runger, G.C. Applied Statistics and Probability for Engineers; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
  71. Wasserstein, R.L.; Lazar, N.A. The ASA statement on p-values: Context, process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
  72. Goodman, S.N. Toward evidence-based medical statistics. 1: The P value fallacy. Ann. Intern. Med. 1999, 130, 995–1004. [Google Scholar] [CrossRef]
  73. Dixon, P. The p-value fallacy and how to avoid it. Can. J. Exp. Psychol. 2003, 57, 189. [Google Scholar] [CrossRef]
  74. Bertolaccini, L.; Viti, A.; Terzi, A. Are the fallacies of the P value finally ended? J. Thorac. Dis. 2016, 8, 1067. [Google Scholar] [CrossRef][Green Version]
Figure 1. One-step-ahead prediction distribution illustrating predictive uncertainty. Source: Authors’ own elaboration.
Figure 1. One-step-ahead prediction distribution illustrating predictive uncertainty. Source: Authors’ own elaboration.
Applsci 15 12393 g001
Figure 2. Model Architecture with n input branches. In this study, we use n + 1 = 4 branches: one for the original series and three for the statistical forecasters. Source: Authors’ own elaboration.
Figure 2. Model Architecture with n input branches. In this study, we use n + 1 = 4 branches: one for the original series and three for the statistical forecasters. Source: Authors’ own elaboration.
Applsci 15 12393 g002
Figure 3. LSTM cell diagram, with omitted bias component to improve readability. Source: Authors’ own elaboration.
Figure 3. LSTM cell diagram, with omitted bias component to improve readability. Source: Authors’ own elaboration.
Applsci 15 12393 g003
Figure 4. Workflow of the hypothesis testing procedure: Two deep neural network configurations were compared: Model A, using only the original time series, and Model B, which incorporates statistical forecasts (ARIMA, ETS, and Linear Regression) as covariates. Source: Authors’ own elaboration.
Figure 4. Workflow of the hypothesis testing procedure: Two deep neural network configurations were compared: Model A, using only the original time series, and Model B, which incorporates statistical forecasts (ARIMA, ETS, and Linear Regression) as covariates. Source: Authors’ own elaboration.
Applsci 15 12393 g004
Figure 5. Comparison for Model A (DNN only) and Model B (DNN with covariates) across all datasets. Lower values indicate better performance. Source: Authors’ own elaboration.
Figure 5. Comparison for Model A (DNN only) and Model B (DNN with covariates) across all datasets. Lower values indicate better performance. Source: Authors’ own elaboration.
Applsci 15 12393 g005
Figure 6. Model B prediction on unseen data for the M5 Dataset, the shaded area indicates the model’s uncertainty modeled with the dropout inference. Source: Authors’ own elaboration.
Figure 6. Model B prediction on unseen data for the M5 Dataset, the shaded area indicates the model’s uncertainty modeled with the dropout inference. Source: Authors’ own elaboration.
Applsci 15 12393 g006
Figure 7. Comparison for the proposed model against its covariate forecasters, in which lower values indicate better performance. Source: Authors’ own elaboration.
Figure 7. Comparison for the proposed model against its covariate forecasters, in which lower values indicate better performance. Source: Authors’ own elaboration.
Applsci 15 12393 g007
Table 1. An overview of the metrics.
Table 1. An overview of the metrics.
Sensitivity to OutliersExplainabilityInterpretability
MAELowMediumEasy
MSEMediumHardEasy
SMAPEHighEasyHard
Table 2. Summary of Predictive Model Evaluation Results Across Various Datasets.
Table 2. Summary of Predictive Model Evaluation Results Across Various Datasets.
Model AModel Bp-Valuet-Statistic
M5MAE 15.71 15.12 0.22710 1.20
nMAE 0.262 0.252 --
MSE 1.41 × 10 3 1.19 × 10 3 0.15500 1.47
nRMSE 0.627 0.573 --
SMAPE 0.31 0.29 0.00710 2.69
StallionMAE 222.27 220.06 0.97100 0.04
nMAE 0.186 0.184 --
MSE 2.76 × 10 3 2.51 × 10 3 0.85800 0.17
nRMSE 0.045 0.043 --
SMAPE 0.78 0.63 0.16100 1.41
Stock MarketMAE 776.92 774.14 0.99610 0.01
nMAE 0.174 0.172 --
MSE 3.51 × 10 3 3.53 × 10 3 0.99650 0.01
nRMSE 0.013 0.013 --
SMAPE 0.44 0.29 0.00002 4.62
SyntheticMAE 18.49 17.51 0.02890 2.21
nMAE 0.370 0.351 --
MSE 1.17 × 10 3 1.09 × 10 3 0.18800 1.34
nRMSE 0.682 0.659 --
SMAPE 0.82 0.55 0.00000 5.31
Table 3. Performance of Model B versus Covariate Forecasters (ARIMA, ETS, and LR) Across Datasets and Metrics.
Table 3. Performance of Model B versus Covariate Forecasters (ARIMA, ETS, and LR) Across Datasets and Metrics.
Model BARIMAETSLR
M5MAE 15.12 21.57 20.97 18.22
MSE 1.19 × 10 3 3.34 × 10 3 2.81 × 10 3 2.11 × 10 3
SMAPE 0.29 0.39 0.40 0.34
StallionMAE 220.06 348.20 301.40 300.50
MSE 2.51 × 10 3 3.66 × 10 3 2.54 × 10 3 2.52 × 10 3
SMAPE 0.63 0.65 0.65 0.64
Stock MarketMAE 774.14 730.70 810.60 790.20
MSE 3.53 × 10 3 3.81 × 10 3 3.91 × 10 3 3.61 × 10 3
SMAPE 0.29 0.33 0.31 0.32
SyntheticMAE 17.51 18.04 19.88 17.79
MSE 1.09 × 10 3 1.21 × 10 3 1.38 × 10 3 1.16 × 10 3
SMAPE 0.55 0.59 0.63 0.57
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Klehm, R.O.; Parreira, W.D.; Dazzi, R.L.S.; Fernandes, A.M.d.R.; García, D.C.; González, G.V. A Hybrid Prediction Model Using Statistical Forecasters and Deep Neural Networks. Appl. Sci. 2025, 15, 12393. https://doi.org/10.3390/app152312393

AMA Style

Klehm RO, Parreira WD, Dazzi RLS, Fernandes AMdR, García DC, González GV. A Hybrid Prediction Model Using Statistical Forecasters and Deep Neural Networks. Applied Sciences. 2025; 15(23):12393. https://doi.org/10.3390/app152312393

Chicago/Turabian Style

Klehm, Renan Otvin, Wemerson Delcio Parreira, Rudimar Luís Scaranto Dazzi, Anita Maria da Rocha Fernandes, David Cruz García, and Gabriel Villarrubia González. 2025. "A Hybrid Prediction Model Using Statistical Forecasters and Deep Neural Networks" Applied Sciences 15, no. 23: 12393. https://doi.org/10.3390/app152312393

APA Style

Klehm, R. O., Parreira, W. D., Dazzi, R. L. S., Fernandes, A. M. d. R., García, D. C., & González, G. V. (2025). A Hybrid Prediction Model Using Statistical Forecasters and Deep Neural Networks. Applied Sciences, 15(23), 12393. https://doi.org/10.3390/app152312393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop