Next Article in Journal
An Ontology Development Methodology Based on Ontology-Driven Conceptual Modeling and Natural Language Processing: Tourism Case Study
Previous Article in Journal
Unsupervised Deep Learning for Structural Health Monitoring
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning

by
José Manuel Oliveira
1,2,† and
Patrícia Ramos
2,3,*,†
1
Faculty of Economics, University of Porto, rua Dr. Roberto Frias, 4200-464 Porto, Portugal
2
Institute for Systems and Computer Engineering, Technology and Science, rua Dr. Roberto Frias, 4200-465 Porto, Portugal
3
CEOS.PP, ISCAP, Polytechnic of Porto, rua Jaime Lopes Amorim s/n, 4465-004 São Mamede de Infesta, Portugal
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Big Data Cogn. Comput. 2023, 7(2), 100; https://doi.org/10.3390/bdcc7020100
Submission received: 10 April 2023 / Revised: 8 May 2023 / Accepted: 16 May 2023 / Published: 18 May 2023

Abstract

:
Global models have been developed to tackle the challenge of forecasting sets of series that are related or share similarities, but they have not been developed for heterogeneous datasets. Various methods of partitioning by relatedness have been introduced to enhance the similarities of sets, resulting in improved forecasting accuracy but often at the cost of a reduced sample size, which could be harmful. To shed light on how the relatedness between series impacts the effectiveness of global models in real-world demand-forecasting problems, we perform an extensive empirical study using the M5 competition dataset. We examine cross-learning scenarios driven by the product hierarchy commonly employed in retail planning to allow global models to capture interdependencies across products and regions more effectively. Our findings show that global models outperform state-of-the-art local benchmarks by a considerable margin, indicating that they are not inherently more limited than local models and can handle unrelated time-series data effectively. The accuracy of data-partitioning approaches increases as the sizes of the data pools and the models’ complexity decrease. However, there is a trade-off between data availability and data relatedness. Smaller data pools lead to increased similarity among time series, making it easier to capture cross-product and cross-region dependencies, but this comes at the cost of a reduced sample, which may not be beneficial. Finally, it is worth noting that the successful implementation of global models for heterogeneous datasets can significantly impact forecasting practice.

1. Introduction

Sales forecasts at the SKU (stock-keeping unit) level are essential for effective inventory management, production planning, pricing and promotional strategies, and sales performance tracking [1]. SKUs represent individual products or product variants within a larger product line. By forecasting sales at the SKU level, businesses can optimize their inventory levels to ensure they have enough stock on hand to meet demand without overstocking and tying up capital. This helps to reduce inventory holding costs and avoid stockouts, which can result in lost sales and dissatisfied customers [2]. SKU-level sales forecasts can also help businesses plan their production schedules and ensure they have enough raw materials and resources to meet demand. This can help reduce production downtime and minimize waste and inefficiencies. SKU-level sales forecasts can help businesses determine the optimal pricing and promotional strategies for each SKU. For example, if a particular SKU is expected to have high demand, a business may choose to increase the price to maximize profit margins. Alternatively, if a SKU is not selling as well as expected, a business may choose to offer discounts or promotions to stimulate sales. SKU-level sales forecasts also allow businesses to track the performance of individual products and identify trends and patterns in consumer behaviour. This can help businesses make data-driven decisions and adjust their strategies accordingly [3].
Retailers typically offer a vast range of products, from perishable items such as fresh produce to non-perishable goods such as electronics and clothing. Each product has distinct demand patterns that may differ based on location, time, day of the week, season, and promotional events. Forecasting sales for each of these items can be a daunting and complicated task, particularly since retailers often sell products through multiple channels, including physical stores, online platforms, mobile apps, and marketplaces, each with its own set of difficulties and opportunities that must be considered when forecasting sales. Additionally, in the retail sector, demand forecasting is a regular occurrence, often performed weekly or daily, to ensure optimal inventory levels. As a result, advanced models and techniques are necessary to tackle the forecasting problem, which must be automated to reduce manual intervention, robust to handle various data types and scenarios, and scalable to accommodate large data volumes and changing business requirements [4].

1.1. Local versus Global Forecasting Models

For decades, the prevailing approach in time-series forecasting has been to view each time series as a standalone dataset [5,6]. This has led to the use of localized forecasting techniques that treat each series individually and make predictions based solely on the statistical patterns observed in that series. The Exponential Smoothing State Space Model (ETS) [7] and Auto-Regressive Integrated Moving Average Model (ARIMA) [8] are notable examples of such methods. While these approaches have been widely used and have produced useful results in many cases, they have their limitations [9]. Currently, businesses often collect vast amounts of time-series data from similar sources on a regular basis. For example, retailers may collect data on the sales of thousands of different products, manufacturers may collect data on machine measurements for predictive maintenance, and utility companies may gather data on smart-meter readings across many households. While traditional local forecasting techniques can still be used to make predictions in these situations, they may not be able to fully exploit the potential for learning patterns across multiple time series. This has led to a paradigm shift in forecasting, where instead of treating each individual time series separately, a set of series is seen as a dataset [10].
A global forecasting model (GFM) has the same set of parameters, such as weights in the case of a neural network [11], for all the time series (all time series in the dataset are forecast using the same function), in contrast to a local model, which has a unique set of parameters for each individual series. This means that the global model takes into account the interdependencies between the variables across the entire dataset, whereas local models focus only on the statistical properties of each individual series. In the retail industry, it is possible to capture cross-product and cross-region dependencies, which can result in more-accurate forecasts across the entire range of products. When we talk about cross-product dependencies, we are referring to the connection between different products. Alterations in one product can have an impact on the demand or performance of another product. For instance, if two products are complementary or substitutable, changes in the sales of one product can affect the sales of the other. Conversely, the demand for a particular product may exhibit a similar pattern for all varieties, brands, or packaging options in various stores. Cross-region dependencies refer to the link between different regions or locations. Changes in one region, such as fluctuations in economic conditions or weather patterns, may have an effect on the demand or performance in another region. Global forecasting models, typically built using advanced machine learning techniques such as deep learning and artificial neural networks, are gaining popularity, as seen in the works of [12,13,14,15], and have outperformed local models in various prestigious forecasting competitions such as the M4 [16,17] and the recent M5 [18,19,20], as well as those held on the Kaggle platform with a forecasting purpose [21]. In summary, the recent paradigm shift in forecasting recognizes that analysing multiple time series together as a dataset can yield significant improvements in accuracy and provide valuable insights into underlying patterns and trends. This shift has opened up new opportunities for businesses to leverage machine learning and other advanced techniques to gain a competitive advantage in forecasting and decision making. However, there are still many challenges to overcome, such as the need for skilled data scientists, significant amounts of data and time for training the models, and sufficient computational and data infrastructures. Additionally, to promote the adoption and sustained usage of GFMs in practice, it is essential to have expertise within the organization, along with model transparency and intelligibility, which are crucial attributes for establishing user trust.

1.2. Relatedness between Time Series

The successful aforementioned studies are based on the assumption that GFMs are effective because there exists a relationship between the series (all come hypothetically from similar data-generating processes), enabling the model to recognize complex patterns shared across them. Nevertheless, none of these studies endeavours to elucidate or establish the characteristics of this relationship. Some research has connected high levels of relatedness between series with greater similarity in their shapes or patterns and stronger cross-correlation [22,23], while other studies have suggested that higher relatedness corresponds to greater similarity in the extracted features of the series being examined [24].
Montero-Manso and Hyndman’s recent work [9] is the first to provide insights into this area. Their research demonstrates that it is always possible to find a GFM capable of performing just as well or even better than a set of local statistical benchmarks for any dataset, regardless of its heterogeneity. This implies that GFMs are not inherently more restricted than local models and can perform well even if the series are unrelated. Due to the utilization of more data, global models can be more complex than local ones (without suffering from overfitting) while still achieving better generalization. Montero-Manso and Hyndman suggest that the complexity of global models can be achieved by increasing the memory/order of autoregression, using non-linear/non-parametric methods, and employing data partitioning. The authors provide empirical evidence of their findings through the use of real-world datasets.
Hewamalage et al. [25] aimed to investigate the factors that influence GFM performance by simulating various datasets with controlled characteristics, including the homogeneity/heterogeneity of series, pattern complexity, forecasting model complexity, and series number/length. Their results reveal that relatedness has a strong connection with other factors, including data availability, data complexity, and the complexity of the forecasting approach adopted, when it comes to GFM performance. Furthermore, in challenging forecasting situations, such as those involving short or heterogeneous series and limited prior knowledge of data patterns, GFMs’ complex non-linear modelling capabilities make them a competitive option.
Rajapaksha et al. [26] recently introduced a novel local model-agnostic interpretability approach to address the lack of interpretability in GFMs. The approach employs statistical forecasting techniques to explain the global model forecast of a specific time series using interpretable components such as trend, seasonality, coefficients, and other model attributes. This is achieved by defining a locally defined neighbourhood, which can be done through either bootstrapping or model fitting. The authors conducted experiments on various benchmark datasets to evaluate the effectiveness of this framework. They evaluated the results both quantitatively and qualitatively and found that the two approaches proposed in the framework provided comprehensible explanations that accurately approximated the global model forecast.
Nevertheless, the major real-world datasets are, by nature, heterogeneous, including series that are clearly unrelated, such as the M4 forecasting competition, whose dataset is a broad mix of unaligned time series across many different domains [25].

1.3. Model Complexity

Kolmogorov’s theory [27] explains the concept of complexity, which can be technically described as follows. We begin by establishing a syntax for expressing all computable functions, which could be an enumeration of all Turing machines or a list of syntactically correct programs in a universal programming language such as Java, Lisp, or C. From there, we defined the Kolmogorov complexity of a finite binary string (every object can be coded as a string over a finite alphabet—say, the binary alphabet) as the length of the shortest Turing machine, Java program, etc., in the chosen syntax. Thus, to each finite string is assigned a positive integer as its Kolmogorov complexity through this syntax. Ultimately, the Kolmogorov complexity of a finite string represents the length of its most-compressed version and the amount of information (in the form of bits) contained within it. Although Kolmogorov complexity is generally believed to be theoretically incomputable [28], recent research by Cilibrasi and Vitanyi [29] has demonstrated that it can be approximated using the decompressor of modern real-world compression techniques. This approximation involves determining the length of a minimum and efficient description of an object that can be produced by a lossless compressor. As a result, to estimate the complexity of our models in this experiment, we rely on the size of their gzip compressions, which are considered very efficient and are widely used. If the output file of a model can be compressed to a very small size, it suggests that the information contained within it is relatively simple and structured and can be easily described using a small amount of information. This would indicate that the model is relatively simple. Conversely, if the output file of a model is difficult to compress, and requires a large amount of storage space, this suggests that the information contained within it is more complex and is structured in a way that cannot be easily reduced. This indicates that the model is more complex. It is worth noting that this approach to measuring algorithmic complexity of models may depend on the data used, but since all models in our experiment are based on the same data, we do not factor the data into the compression.
The number of parameters in a model can also be a useful heuristic for measuring the model’s complexity [30]. Each parameter represents a degree of freedom that the model has in order to capture patterns in the data. The more parameters a model has, the more complex its function can be, and the more flexible it is to fit a wide range of training data patterns. Deep learning models differ structurally from traditional machine learning models and have significantly more parameters. These models are consistently over-parametrised, implying that they contain more parameters than the optimal solutions and training samples. Nonetheless, research has demonstrated that extensively over-parametrised neural networks often show strong generalization capabilities. In fact, several studies suggest that larger and more-complex networks generally achieve superior generalization performance [31].

1.4. Key Contributions

Despite all of the aforementioned efforts, there has been a lack of research on how the relatedness between series impacts the effectiveness of GFMs in real-world demand-forecasting problems, especially when dealing with challenging conditions such as the highly lumpy or intermittent data very common in retail. The research conducted in this study was driven precisely by this motivation: to investigate the cross-learning scenarios driven by the product hierarchy commonly employed in retail planning that enable global models to better capture interdependencies across products and regions. We provide the following contributions that help understand the potential and applicability of global models in real-world scenarios:
  • Our study investigates possible dataset partitioning scenarios, inspired by the hierarchical aggregation structure of the data, that have the potential to more effectively capture inter-dependencies across regions and products. To achieve this, we utilize a prominent deep learning forecasting model that has demonstrated success in numerous time-series applications due to its ability to extract features from high-dimensional inputs.
  • We evaluate the heterogeneity of the dataset by examining the similarity of the time-series features that we deem crucial for accurate forecasting. Some features, which are deliberately crafted, prove especially valuable for intermittent data.
  • In order to gauge the complexity of our models during the experiment, we offer two quantitative indicators: the count of parameters contained within the models and the compressibility of their output files as determined by Kolmogorov complexity.
  • A comprehensive evaluation of the forecast accuracy achieved by the global models of the various partitioning approaches and local benchmarks using two error measures is presented. These measures are also used to perform tests on the statistical significance of any reported differences.
  • The empirical results we obtained provide modelling guidelines that are easy for both retailers and software suppliers to implement regarding the trade-off between data availability and data relatedness.
The layout of the remainder of this paper is as follows. Section 2 describes our forecasting framework developed for the evaluation of the cross-learning approaches, and Section 3 provides the details about its implementation. Section 4 presents and discusses the results obtained, and Section 5 provides some concluding remarks and promising areas for further research.

2. Forecasting Models

Due to the impressive accomplishments of deep learning in computer vision, its implementation has extended to several areas, including natural language processing and robot control, making it a popular choice in the machine learning domain. Despite being a significant application of machine learning, the progress of using deep learning in time-series forecasting has been relatively slower compared to other areas. Moreover, the lack of a well-defined experimental protocol makes its comparison with other forecasting methods difficult. Given that deep learning has demonstrated superior performance compared to other approaches in multiple domains when trained on large datasets, we were confident that it could be effective in the current context. However, few studies have focused on deep learning approaches for intermittent demand [32]. Forecasting intermittent data involves dealing with sequences that have sporadic values [33]. This is a complex task, as it entails making predictions based on irregular observations over time and a significant number of zero values. We selected DeepAR, which is an autoregressive recurrent neural network (RNN) model that was introduced by Amazon in 2018 [23]. DeepAR is a prominent deep learning forecasting model that has demonstrated success in several time-series applications.

2.1. DeepAR Model

Formally, denoting the value of item i at time t by z i , t , the goal of DeepAR is to predict the conditional probability P of future sales z i , t 0 : T based on past sales z i , 1 : t 0 1 and covariates x i , 1 : T , where t 0 and T are, respectively, the first and last time points of the future
P ( z i , t 0 : T | z i , 1 : t 0 1 , x i , 1 : T ) .
Note that the time index t is relative, i.e.,  t = 1 may not correspond to the first time point of the time series. During training, z i , t is available in both time ranges [ 1 , t 0 1 ] and [ t 0 , T ] , known respectively as the conditioning range and the prediction range (corresponding to the encoder and decoder in a sequence-to-sequence model), but during inference, z i , t is not available in the prediction range. The network output at time t can be expressed as
h i , t = h ( h i , t 1 , z i , t 1 , x i , t ; Θ ) ,
where h is a function that is implemented by a multi-layer RNN with long short-term memory (LSTM) cells [34] parameterised by Θ . The model is autoregressive in the sense that it uses the sales value at the previous time step z i , t 1 as an input, and recurrent in the sense that the previous network output h i , t 1 is fed back as an input at the next time step. During training, given a batch of N items { z i , 1 : T } i = 1 , , N and corresponding covariates { x i , 1 : T } i = 1 , , N , the model parameters are learned by maximizing the log-likelihood of a fixed probability distribution as follows
L = i = 1 N t = t 0 T log l ( z i , t | θ ( h i , t ) ) ,
where θ denotes a linear mapping from the function h i , t to the distribution’s parameters, while l represents the likelihood of the distribution. Since the encoder model is the same as the decoder, DeepAR uses all of time range [ 0 , T ] to calculate this loss (i.e., t 0 = 0 in Equation (3)). DeepAR is designed to predict a 1-step forwarded value. To forecast multiple future steps in the inference, the model repeatedly generates forecasts for the next period until the end of the forecast horizon. Initially, the model is fed with past sequences ( t < t 0 ), and the forecast of the first period is generated by drawing samples from the trained probability distribution. The forecast of the first period is then used as an input to the model for generating the forecast of the second period, and so on for each subsequent period. As the forecast is based on past samples from the predicted distribution, the model’s output is probabilistic and not deterministic, and it represents a distribution of sampled sequences. This sampling process is advantageous as it generates a probability distribution of forecasts, which can evaluate the accuracy of the forecasts.
To address the issue of zero-inflated distribution in sales demands, we employed the negative log-likelihood of the Tweedie distribution for the loss function. The Tweedie distribution is a family of probability distributions that is characterized by two parameters: the power parameter, denoted as p, and the dispersion parameter, denoted as ϕ . The probability density function of the Tweedie distribution is defined as:
f ( y ; μ , ϕ , p ) = y p 1 exp y μ 1 p ϕ ( 1 p ) ϕ ( 1 p ) y p Γ 1 1 p , y > 0 ,
where μ is the mean parameter of the distribution, Γ is the gamma function, and p and ϕ are positive parameters. When 1 < p < 2 , the Tweedie distribution is a compound Poisson–gamma distribution, which is commonly used to model data with a large number of zeros and positive skewness. The dispersion parameter ϕ controls the degree of variability or heterogeneity in the data. When ϕ is small, the data are said to be highly variable or dispersed, while a large value of ϕ indicates low variability or homogeneity in the data.
Our implementation of the DeepAR models used the PyTorch AI framework [35] with the DeepAREstimator method from the GluonTS Python library [36].

2.2. Benchmarks

Benchmarks are used to evaluate the performance of forecasting models by providing a standard against which the models can be compared [37]. By using benchmarks, researchers and practitioners can objectively assess the forecasting accuracy of different models and identify which model performs best for a given forecasting task. Comparing the accuracy of a forecasting model against a benchmark provides a baseline measure of its performance and helps to identify the added value of the model. The two most-commonly utilized models for time-series forecasting are Exponential Smoothing and ARIMA (AutoRegressive Integrated Moving Average). These benchmark models are good references for evaluating the forecasting performance of more-complex models. They provide a baseline for comparison and help to identify whether a more-complex model is justified based on its added accuracy on the benchmark. The seasonal naïve method can be very effective at capturing the seasonal pattern of a time series and is also frequently adopted as a benchmark to compare against more complex models.

2.2.1. ARIMA Models

The seasonal ARIMA model, denoted as ARIMA ( p , d , q ) × ( P , D , Q ) m , can be written as:
ϕ p ( B ) Φ P ( B m ) ( 1 B ) d ( 1 B m ) D η t = c + θ q ( B ) Θ Q ( B m ) ε t ,
ϕ p ( B ) = 1 ϕ 1 B ϕ p B p , Φ P ( B m ) = 1 Φ 1 B m Φ P B P m , θ q ( B ) = 1 + θ 1 B + + θ q B q , Θ Q ( B m ) = 1 + Θ 1 B m + + Θ Q B Q m ,
where η t is the target time series, m is the seasonal period, D and d are the degrees of seasonal and ordinary differencing, respectively, B is the backward shift operator, ϕ p ( B ) and θ q ( B ) are the regular autoregressive and moving average polynomials of orders p and q, respectively, Φ P ( B m ) and Θ Q ( B m ) are the seasonal autoregressive and moving-average polynomials of orders P and Q, respectively, c = μ ( 1 ϕ 1 ϕ p ) ( 1 Φ 1 Φ P ) , where μ is the mean of ( 1 B ) d ( 1 B m ) D η t and ε t is a white-noise series (i.e., serially uncorrelated with zero mean and constant variance). Stationarity and invertibility conditions imply that the zeros of the polynomials ϕ p ( B ) , Φ P ( B m ) , θ q ( B ) , and Θ Q ( B m ) must all lie outside of the unit circle. Non-stationary time series can be made stationary by applying transformations such as logarithms to stabilise the variance and by taking proper degrees of differencing to stabilise the mean. After specifying values for p , q , P , and Q, the parameters of the model c , ϕ 1 , ϕ p , θ 1 , , θ q , Φ 1 . , Φ P , Θ 1 , , Θ Q can be estimated by maximising the log likelihood. The Akaike’s Information Criteria (AIC), which is based on the log likelihood and on a regularization term (that includes the number of parameters in the model) to compensate for potential overfitting, can be used for determining the values of p , q , P , and Q. To implement the ARIMA models, we used the AutoARIMA function from the StatsForecast Python library [38], which is a mirror of Hyndman’s [39] auto.arima function in the forecast package of the R programming language.

2.2.2. Exponential Smoothing Models

Exponential smoothing models comprise a measurement (or observation) equation and one or several state equations. The measurement equation describes the relationship between the time series and its states or components, i.e., the level, the trend, and the seasonality. The state equations express how the components evolve over time [7,40]. The components can interact with themselves in an additive (A) or multiplicative (M) manner; an additive damped trend (Ad) or multiplicative damped trend (Md) is also possible. For each model, an additive or multiplicative error term can be considered. Each component is updated by the error process, which is the amount of change controlled by the smoothing parameter. For more details, the reader is referred to [7] and [41]. The existence of a consistent multiplicative effect on sales led us to use a logarithm transformation and, consequently, to adopt only linear exponential smoothing models. Table 1 presents the equations for these models in the state-space modelling framework: y t is the time-series observation in period t, l t is the local level in period t, b t is the local trend in period t, s t is the local seasonality in period t, and m is the seasonal frequency; α , β , γ , and ϕ are the smoothing parameters, and ε t is the error term usually assumed to be normally and independently distributed with mean 0 and variance σ 2 , i.e.,  ε t NID ( 0 , σ 2 ) . To implement the exponential smoothing models, we used the AutoETS function from the StatsForecast Python library [38], which is a mirror of Hyndman’s [7] ets function in the forecast package of the R programming language.

2.2.3. Seasonal Naïve

The Seasonal Naïve model is a simple time-series forecasting model that assumes the future value of a series will be equal to the last observed value from the same season. It can be formulated as follows:
y ^ t = y t m ,
where y ^ t is the forecasted value of the series at time t, y t m is the last observed value from the same season (m periods ago), and m is the number of periods in a season (e.g., seven for daily data with weekly seasonality).

3. Empirical Setup

In this section, we present experimental scenarios that use hierarchical aggregation structure-based data partitioning to investigate quantitatively how the relatedness between series impacts the effectiveness of GFMs and their complexity.

3.1. Dataset

To ensure the significance of a study’s findings, it is crucial that it can be reproduced and compared with other relevant studies. Therefore, in this study, we used the M5 competition’s well-established and openly accessible dataset, which is widely recognized as a benchmark for the development and evaluation of time-series forecasting models. The M5 dataset is a large time-series set consisting of sales data for Walmart stores in the United States. The dataset was released in 2020 as part of the M5 forecasting competition, which was organized by University of Nicosia and sponsored by Kaggle [19]. The M5 dataset includes daily sales data for 3049 products and spans a period of 5 years, from 29 January 2011 to 19 June 2016 (1969 days). The dataset is organized hierarchically, with products being grouped into states, stores, categories, and departments. The 3049 products were sold across ten different stores, which were located in three states of the USA: California (CA), Texas (TX), and Wisconsin (WI). California covers four stores (CA1, CA2, CA3, and CA4), while Texas and Wisconsin represent three stores each (TX1, TX2, TX3 and WI1, WI2, WI3). For every store, the products are classified into three main categories: Household, Hobbies, and Foods. These categories are further divided into specific departments. Specifically, the Household and Hobbies categories are each subdivided into two departments (Household1, Household2 and Hobbies1, Hobbies2), while the Foods category is subdivided into three departments (Foods1, Foods2, and Foods3). The main goal of the M5 competition was to develop accurate sales forecasts for the last 28 days from 23 May 2016 to 19 June 2016. The M5 dataset has become a standard reference due to its challenging properties, including high dimensionality, hierarchical structure, and intermittent demand patterns (i.e., many products have zero sales on some days).
A dataset is commonly regarded as heterogeneous when it comprises time series that exhibit different patterns, such as seasonality, trend, and cycles, and, conceivably, distinct types of information [9]. Therefore, heterogeneity is often associated with unrelatedness [25]. Our examination of the heterogeneity in the M5 dataset and assessment of the relatedness among its time series followed the methodology proposed by [25], which involved comparing the similarity of the time-series features. Similar to Kang et al.’s methodology [42], we applied Principal Component Analysis (PCA) [43] to decrease the feature dimensionality and depicted the similarity of the time-series features using a 2-D plot. Furthermore, we also identified a set of critical features that significantly impact the forecastability of a series, namely:
  • Spectral entropy (Entropy) to measure forecastability;
  • Strength of trend (Trend) to measure the strength of the trend;
  • Strength of seasonality (Seasonality) to measure the strength of the seasonality;
  • First-order autocorrelation (ACF1) to measure the first-order autocorrelation;
  • Optimal Box–Cox transformation parameter (Box–Cox) to measure the variance stability;
  • Ratio between the number of non-zero observations and the total number of observations (Non-zero demand) to measure the proportion of non-zero demand;
  • Ratio between the number of changes between zero and non-zero observations and the total number of observations (Changes) to measure the proportion of status changes from zero to non-zero demand.
The R programming language’s feasts package [44] was used to calculate time-series features using the features function. Additionally, we utilized the PCA function from the FactoMineR package [45] in the R programming language to conduct principal component analyses. Figure 1 shows the 2-D plot of the M5 dataset’s time-series features selected after applying principal component analysis. As expected, the time-series features of the M5 dataset show a scattered distribution in the 2-D space, indicating dissimilarity among them. This dissimilarity is an indicator of the dataset’s heterogeneity regarding those features, suggesting that we are examining a broad range of series within a single dataset.

3.2. Data Pools

The approach used in the presented framework employs partial pooling and is inspired by the hierarchical structure of Walmart. The multi-level data provided are used to prepare five distinct levels of data, including total, state, store, category, and department, as well as four cross-levels of data, including state–category, state–department, store–category, and store–department. Data pools are then obtained for each level and cross-level. The total pool comprises the entire M5 dataset, which consists of 30,490 time series. At the state level, there are three data pools corresponding to the three states (CA, TX, and WI). CA has 12,196 time series, while TX and WI have 9147 time series each. The store level has ten data pools, including four stores in California (CA1, CA2, CA3, and CA4) and three stores in both Texas and Wisconsin (TX1, TX2, TX3, and WI1, WI2, WI3), each with 3049 time series. The category level has three data pools corresponding to the three distinct categories: Household, Hobbies, and Foods, each with a different number of time series (10,470 for Household, 5650 for Hobbies, and 14,370 for Foods). The department level has seven data pools, consisting of three departments for the Foods category (Foods1, Foods2, and Foods3) and two departments each for the Household and Hobbies categories. The number of time series in each department ranges from 1490 to 8230. The state–category cross-level consists of nine data pools, which result from crossing the three states with the three categories. For instance, CA–Foods contains the products from the Foods category that are available in CA stores. The number of time series in the state–category pools ranges from 1695 to 5748. Similarly, the state–department cross-level comprises 21 data pools that arise from the combination of the three states with the seven departments. For example, CA–Foods3 includes the products from the Foods3 department that are sold in CA stores. The number of time series in the state–department pools varies from 447 to 3292. The store–category cross-level has 30 data pools generated by crossing the ten stores with the three categories. For example, CA3–Foods includes the products from the Foods category that are sold in CA3 store. The number of time series in the store–category pools ranges from 565 to 1437. Lastly, the store–department cross-level has 70 data pools that arise from the combination of the ten stores with the seven departments. For instance, CA3–Foods3 comprises the products from the Foods3 department that are sold in CA3 store. The number of time series in the store–department pools ranges from 149 to 823. All this information is provided in Appendix A.
It is noteworthy that we examined all feasible combinations of partial pools from the multi-level data available. We expect that as the sizes of the data pools decrease and the relatedness of the time series within them increases, the global models’ performance will improve, while their complexity will decrease. It is expected that the cross-learning scenarios developed, driven by the product hierarchy employed by the retailer, will result in improved global models that can capture interdependencies among products and regions more effectively. By utilizing data pools at the state and store levels, it may be possible to better understand cross-region dependencies and the impact of demographic, cultural, economic, and weather conditions on demand. Additionally, category and department data pools have the potential to uncover cross-product dependencies and improve the relationships between similar and complementary products. This partitioning method is simpler to implement than the current literature-based clustering methods that rely on feature extraction to identify similarities among the examined series.

3.3. Model Selection

A deepAR model was trained using all the time series available in each data pool, regardless of any potential heterogeneity. For instance, a deepAR model was trained for each state, namely CA, TX, and WI, making a total of three different models for the state level. Similarly, one deepAR model was trained for each store, resulting in ten distinct deepAR models for the store level, and so forth. Moreover, in the case of the total pool, only one deepAR model was trained, using the entire M5 dataset, which consists of 30,490 time series. As a result, a total of 154 separate deepAR models were trained, with each data pool having one model. Although complete pooling, which involves using a single forecasting model for the entire dataset, can capture interdependencies among products and regions, partial pooling, which uses a separate forecasting model for each pool, is often better suited for capturing the unique characteristics of each group.
We followed the structure of the M5 competition, which kept the last 28 days of each time series as the testing set for out-of-sample evaluation (23 May 2016 to 19 June 2016), while using the remaining data (29 January 2011 to 22 May 2016, 1941 days) for training the models. It is essential to find the appropriate model that can perform well during testing in order to achieve the highest possible level of accuracy. Typically, a validation set is employed to choose the most-suitable model. The effectiveness of a deep learning model largely depends on various factors such as hyperparameters and initial weights. To select the best model, the last 28 days of in-sample training from 25 April 2016 to 22 May 2016 were used for validation. The hyperparameters and their respective ranges that were utilized in model selection are presented in Table 2. The Optuna optimization framework [46] was used to carry out the hyperparameter optimization process by utilizing the Root Mean Squared Error (RMSE) [4] as the accuracy metric for model selection. For both ARIMA and ETS local benchmarks, a model was chosen for each time series using the AICc value, resulting in a total of 30,490 models.

3.4. Model Complexity

Data partitioning based on relatedness enhances a dataset’s similarities, making it easier for a model to identify complex patterns that are shared across time series, thereby reducing the model’s complexity. Therefore, it is essential to have heuristics that can estimate the model’s complexity. As discussed in Section 1.3, one way to do this is by counting the number of parameters (NP) in the model of the data pool and measuring the size of the gzip compression (CMS-compressed model size) of its output file, expressed in bytes. Each parameter represents a degree of freedom that the model has to capture patterns in the data. The more parameters a model has, the more complex and flexible it is to fit a wide range of training data patterns. A model’s output file can be compressed to a small size if the information contained within it is relatively simple, indicating that the model is simple. Conversely, if the output file is difficult to compress and requires significant storage space, this suggests that the information contained within it is more complex, indicating that the model is more complex. To obtain the total number of parameters (TNP) for each partitioning approach, we added up the number of parameters (NP) in the model for each of its data pools. Similarly, we calculated the total compressed model size (TCMS) in bytes by summing the sizes of the gzip output file of the model for each of its data pools.
Additionally, it should be noted that the complexity of a learned model is affected not only by its architecture but also by factors such as the distribution and complexity of the data, as well as the amount of information available. With this in mind, we also computed the weighted average number of parameters (WNP) and the weighted average compressed model size (WCMS) per model for each partitioning approach, as shown below.
WNP = 1 d s i = 1 n p s i × NP i ,
WCMS = 1 d s i = 1 n p s i × CMS i ,
where d s is the dataset size (number of time series), n is the number of data pools of the partitioning approach, and  p s i is the size of the data pool i.
A conservative estimate for the number of parameters in both ARIMA and ETS local benchmark models was considered. For ARIMA, we assumed a maximum of 16 parameters, based on the highest possible orders for the autoregression and moving average polynomials ( p = 5 , q = 5 , P = 2 , Q = 2 ) as well as the variance of the residuals. In the case of ETS, we estimated a maximum of 14 parameters per model by taking into account the number of smoothing parameters ( α , β , γ , and  ϕ ), initial states ( l 0 , b 0 , s 0 , , s 6 ), and the variance of the residuals. The TNP for both ARIMA and ETS models was calculated by multiplying the number of separate models (30,490 in total in this case study) by 16 and 14, respectively. As a result, the WNP per model for ARIMA and ETS are 16 and 14, respectively. To obtain the TCMS in bytes for these benchmark models, the sizes of the gzip output file for each individual model were added together. The WCMS per model can be calculated by dividing the TCMS by the number of models.

3.5. Evaluation Metrics

The performance of global and local models was evaluated with respect to two performance measures commonly found in the literature related to forecasting [47], namely the average of the Mean Absolute Scaled Error (MASE) and the average of the Root Mean Squared Scaled Error (RMSSE):
MASE i = 1 h t = n + 1 n + h z i , t z ^ i , t 1 n 1 t = 2 n z i , t z i , t 1 ,
RMSSE i = 1 h t = n + 1 n + h z i , t z ^ i , t 2 1 n 1 t = 2 n z i , t z i , t 1 2 ,
where z i , t is the value of item i at time t, z ^ i , t is the corresponding forecast, n is the length of the in-sample training, and h is the forecast horizon (28 days in this case study). RMSSE was employed to measure the accuracy of point forecasts in the M5 competition [18]. MASE and RMSSE are both scale-independent measures that can be used to compare forecasts across multiple products with different scales and units. This is achieved by scaling the forecast errors using the Mean Absolute Error (MAE) or Mean Squared Error (MSE) of the 1-step-ahead in-sample naive forecast errors in order to match the absolute or quadratic loss of the numerator. The use of squared errors favours forecasts that closely follow the mean of the target series, while the use of absolute errors favours forecasts that closely follow the median of the target series, thereby focusing on the structure of the data.

3.6. Statistical Significance of Models’ Differences

The MASE and RMSSE errors can be used to conclude if there are any statistically significant differences in the models’ performance. First, a Friedman test is performed to determine if at least one model performs significantly differently. Then, the post-hoc Nemenyi test [48] is used to group models based on similar performance. Both of these tests are nonparametric, meaning that the distribution of the performance metric is not a concern. The post-hoc Nemenyi test ranks the performance of models for each time series and calculates the mean of those ranks to produce confidence bounds. If the confidence bounds of different models overlap, then it can be concluded that the models’ performances are not statistically different. On the other hand, if the confidence bounds do not intersect, then it can only be determined which method has a higher or lower rank. The nemenyi() function in the R package tsutils [49] was used to implement these tests, and a significance level of α = 0.5 was employed for all tests.

4. Results and Discussion

In this section, a comprehensive examination of the results achieved by the DeepAR global models of the various partitioning approaches and local benchmarks is presented. In addition to evaluating the forecast accuracy using MASE and RMSSE, a comparison of the complexities of the models is also provided. The results of the empirical study are presented in Table 3 and Appendix A. Table 3 includes the percentage difference of each partitioning approach and local benchmark from DeepAR-Total in terms of MASE and RMSSE. This comparison aims to evaluate the enhancement achieved by partial pooling using the hierarchical structure of the data. Furthermore, Appendix A exhibits tables that show the percentage difference of every data-pool model from the most-outstanding one within its aggregation level based on MASE and RMSSE. It is important to note that the results presented in these tables are ranked by MASE in each aggregation level. Table 3 highlights the most-effective data-partitioning approach in boldface within the MASE and RMSE columns. In the field of forecasting, it is common to use forecast averaging as a complementary approach to using multiple models. Numerous studies have demonstrated the effectiveness of averaging the forecasts generated by individual models to enhance the accuracy of forecasts. Based on this idea, we computed the arithmetic mean of forecasts generated by the various partitioning approaches that were developed from the available data pools and denoted this as DeepAR-Comb.
The results presented in Table 3 show that the data-partitioning approaches exhibit significantly better performance than the state-of-the-art local benchmarks. This finding suggests that global models are not inherently more limited than local models and can perform well even on unrelated time-series data. In other words, global models can increase model complexity compared to local models due to their generality. They can afford to be more complex than local models because they generalize better.
Overall, the partitioning approaches outperform DeepAR-Total across all levels of aggregation. DeepAR-State–Department achieves the highest performance according to MASE, while DeepAR-Comb performs best based on RMSSE (which can be explained by the use of RMSE as an accuracy metric for model selection).
Generally, the accuracy of data-partitioning approaches improves as the sizes of the data pools decrease. This can be attributed to the increased similarity among the time series in smaller pools, making it easier to capture cross-product and cross-region dependencies. As a result, models with lower complexity are needed when the data become less heterogeneous. We have observed that both the weighted average number of parameters (WNP) and the weighted average compressed model size (WCMS) decrease accordingly per model. As anticipated, the application of ARIMA and ETS to each time series individually leads to substantially lower WNP and WCMS values than those obtained with global models used in the data-partitioning approaches. The global models tend to be over-parameterised, with a higher number of parameters than training samples. In the case of WNP, the difference is four orders of magnitude higher, while in the case of WCMS, it is three orders higher.
We have observed that the performance gain of the partitioning approaches over DeepAR-Total is not significant, with an improvement of less than 1% based on RMSSE and up to 3.6% based on MASE. It is noteworthy that the DeepAR-State-Department approach, which uses only 21 data pools, outperforms the other approaches with a higher number of data pools (namely 30 and 70). This suggests that there is a trade-off between data availability and relatedness, where data partitioning can improve the relatedness and similarities between time series by increasing homogeneity. This allows for more-effective capture of the distinct characteristics of the set but at the cost of a reduced sample size, which has been proven to be harmful. Therefore, the primary goal should be to optimize this trade-off. Notably, in addition to achieving the highest forecasting accuracy, the DeepAR-State-Department approach exhibits the lowest weighted average number of parameters (WNP) and weighted average compressed model size (WCMS) per model.
By referring to Appendix A, it can be observed that the DeepAR models associated with the Foods category or Foods1, Foods2, and Foods3 departments generally outperform the models of other categories/departments (see Figure A3, Figure A4, Figure A5, Figure A6, Figure A7 and Figure A8). This could be attributed to the higher proportion of non-zero demand (ratio between the number of non-zero observations and the total number of observations) in these data pools. However, it is not possible to establish a direct relationship between data homogeneity and model accuracy due to the different sizes of the data pools (with the exception of the Store level) and the different time series included in each data pool at each aggregation level.
Figure 2 presents the mean rank of the global and local models and the post-hoc Nemenyi test results at a 5% significance level for MASE and RMSSE errors, enabling a more-effective comparison of their performance. The forecasts are arranged by their mean rank, with their respective ranks provided alongside their names. The top-performing forecasts are located at the bottom of the plot. The variation in the ranks between Table 3 and Figure 2 can be explained by the distribution of the forecast errors. The mean rank is non-parametric, making it robust to outlying errors.
Once again, we have observed that global models outperform local benchmarks. Based on the MASE errors, there is no significant difference between ARIMA and ETS. In addition, DeepAR-State is grouped together with DeepAR-Store-Category and DeepAR-Store, while DeepAR-Comb does not differ from DeepAR-Store-Department and DeepAR-State-Category. The DeepAR-State-Department approach is ranked first and exhibits significant statistical differences from all other approaches. In a similar manner, there is evidence of significant differences among the other four models (DeepAR-Department, DeepAR-Category, DeepAR-Total, and Seasonal Naïve). With regard to the RMSSE, there is no evidence of statistically significant differences between DeepAR-Department, DeepAR-Total, DeepAR-Store, DeepAR-State-Category, DeepAR-Store-Category, and DeepAR-Store-Department. Likewise, DeepAR-State-Department is grouped together with DeepAR-Category and DeepAR-State, ranking on top. The remaining four models exhibit significant differences.

5. Conclusions

Retailers typically provide a wide range of merchandise, spanning from perishable products such as fresh produce to non-perishable items such as electronics and clothing. Each of these products exhibits unique demand patterns that can differ based on several factors, including location, time, day of the week, season, and promotional events. Forecasting sales for each product can be a daunting and complex undertaking, particularly given that retailers often sell through multiple channels, including physical stores, online platforms, mobile apps, and marketplaces. Furthermore, in the retail industry, demand forecasting is a routine task that is frequently conducted on a weekly or daily basis to maintain optimal inventory levels. Consequently, advanced models and techniques are required to address the forecasting challenge. These models must be automated to minimize manual intervention, robust enough to handle various data types and scenarios, and scalable to handle vast amounts of data and changing business conditions.
GFMs have shown superior performance to local state-of-the-art benchmarks in prestigious forecasting competitions such as the M4 and M5, as well as those on Kaggle with a forecasting purpose. The success of GFMs is based on the assumption that they are effective if there is a relationship between the time series in the dataset, but there are no established guidelines in the literature to define the characteristics of this relationship. Some studies suggest that higher relatedness between series corresponds to greater similarity in the extracted features, while others connect high relatedness with stronger cross-correlation and similarity in shapes or patterns.
To understand how relatedness impacts GFMs’ effectiveness in real-world demand forecasting, especially in challenging conditions such as highly lumpy or intermittent data, we conducted an extensive empirical study using the M5 competition dataset. We explored cross-learning scenarios driven by the product hierarchy, common in retail planning, to allow global models to capture interdependencies across products and regions more effectively.
Our findings demonstrate that global models outperform state-of-the-art local benchmarks by a significant margin, indicating their effectiveness even with unrelated time-series data. We also conclude that data-partitioning-approach accuracy improves as the sizes of data pools and model complexity decrease. However, there is a trade-off between data availability and data relatedness. Smaller data pools increase the similarity among time series, making it easier to capture cross-product and cross-region dependencies but at the cost of reduced information, which is not always beneficial.
Lastly, it is worth noting that the successful implementation of GFMs for heterogeneous datasets will significantly impact forecasting practice in the near future. It would be intriguing for future research to investigate additional deep learning models and assess their forecasting performance in comparison to the deepAR model.

Author Contributions

Conceptualization, J.M.O. and P.R.; methodology, J.M.O. and P.R.; software, J.M.O. and P.R.; validation, J.M.O. and P.R.; formal analysis, J.M.O. and P.R.; investigation, J.M.O. and P.R.; resources, J.M.O. and P.R.; data curation, J.M.O. and P.R.; writing—original draft preparation, J.M.O. and P.R.; writing—review and editing, J.M.O. and P.R.; visualization, J.M.O. and P.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analysed in this study. These data can be found here: https://www.kaggle.com/competitions/m5-forecasting-accuracy/data (accessed on 12 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Performance of data-pool models evaluated with respect to MASE and RMSSE. Model complexity estimated by NP (number of parameters) and CMS (compressed model size).
Table A1. Performance of data-pool models evaluated with respect to MASE and RMSSE. Model complexity estimated by NP (number of parameters) and CMS (compressed model size).
Aggregation LevelData PoolNo. of
Time Series
MASERMSSEModel Complexity
NPCMS
(Bytes)
Total (1) 30,4900.5720.7824204,603776,553
State (3)CA12,1960.5450.79455.88%293,5231,103,829
TX91470.5663.78%0.750482,603310,970
WI91470.5887.81%0.79235.59%204,603763,572
Store (10)CA330490.4420.77157.14%398,4431,485,258
TX230490.4849.50%0.720145,483172,918
CA130490.48710.21%0.75855.33%409,6831,526,328
CA230490.54222.75%0.856018.87%204,603764,226
WI130490.56728.34%0.794910.38%131,683493,603
WI330490.59434.39%0.76986.89%398,4431,484,911
TX130490.59534.76%0.75374.66%28,003107,798
TX330490.60837.59%0.78058.39%33,843129,442
WI230490.61338.73%0.822114.16%177,363660,158
CA430490.67352.42%0.796910.66%293,5231,097,343
Category (3)Foods14,3700.4310.79104.80%556,3632,135,537
Household10,4700.67055.62%0.78123.50%74,763281,886
Hobbies56500.71565.97%0.754846,963178,284
Department (7)Foods382300.4040.80067.15%285,4031,063,266
Foods121600.4357.46%0.79246.05%892336,872
Foods239800.47717.86%0.78154.59%177,363662,457
Household153200.49622.55%0.77423.61%285,4031,064,539
Hobbies141600.65561.97%0.7472177,363663,062
Household251500.832105.76%0.78585.16%285,4031,051,087
Hobbies214900.837106.99%0.76452.32%74,763280,484
State-Category (9)CA-Foods57480.3970.80678.97%45,483172,929
TX-Foods43110.4297.98%0.75081.42%183,523685,556
WI-Foods43110.4338.95%0.81189.66%556,3632,071,918
TX-Household31410.63058.72%0.74871.13%892336,834
CA-Household41880.63961.02%0.79517.40%46,963178,210
CA-Hobbies22600.67770.45%0.75972.62%28,003107,975
TX-Hobbies16950.68171.57%0.74981.28%16,20363,598
WI-Hobbies16950.70878.38%0.740345,483171,773
WI-Household31410.74286.80%0.79887.90%409,6831,527,057
State-Department (21)CA-Foods332920.3710.815913.73%131,683493,415
WI-Foods324690.41311.21%0.816313.78%240,523895,494
TX-Foods324690.41612.07%0.75054.62%293,5231,095,219
TX-Household115960.43016.02%0.717461,203232,020
TX-Foods16480.43116.04%0.76696.90%61,203228,161
CA-Foods18640.43818.17%0.830515.77%33,843128,967
WI-Foods211940.44219.13%0.801211.69%131,683491,591
CA-Foods215920.47728.53%0.78148.92%123,803464,170
WI-Foods16480.48330.05%0.799311.42%46,963177,586
TX-Foods211940.48731.22%0.74383.69%82,603308,889
CA-Household121280.51438.58%0.797111.11%79,843300,371
WI-Household115960.52240.80%0.805812.32%293,5231,093,001
TX-Hobbies112480.62468.23%0.74543.90%240,523897,953
CA-Hobbies116640.62568.42%0.75324.99%74,763281,545
WI-Hobbies112480.68885.39%0.74814.28%123,803457,077
CA-Household220600.72194.25%0.789510.05%556324,078
WI-Hobbies24470.780110.19%0.75585.36%28,003105,157
TX-Hobbies24470.787112.00%0.75074.65%183,523674,968
TX-Household215450.823121.85%0.77498.02%46,963178,373
CA-Hobbies25960.827122.87%0.78209.01%12,28349,220
WI-Household215450.946154.98%0.78889.95%123,803465,523
Store-Category (30)CA3-Foods14370.3060.75035.25%82,603311,464
CA1-Foods14370.35616.09%0.76487.28%398,4431,485,242
TX2-Foods14370.39428.48%0.72441.61%82,603310,872
WI2-Foods14370.41033.94%0.822115.32%240,523894,731
WI1-Foods14370.43241.13%0.824115.60%183,523686,270
WI3-Foods14370.44143.86%0.787510.47%177,363663,389
TX1-Foods14370.46551.94%0.73933.70%123,803463,983
CA2-Foods14370.46852.91%0.930830.56%177,363662,127
TX3-Foods14370.47254.08%0.794911.51%28,003107,712
CA4-Foods14370.51869.28%0.806013.06%79,843301,432
CA3-Household10470.52772.19%0.809913.61%892336,906
TX2-Household10470.53675.11%0.7129760331,610
CA1-Household10470.57086.04%0.75455.84%204,603763,936
CA2-Household10470.58089.37%0.812713.99%74,763281,706
TX2-Hobbies5650.58289.89%0.72732.01%220311,317
CA3-Hobbies5650.59192.99%0.75956.54%556323,999
CA1-Hobbies5650.60798.32%0.74214.09%240,523896,050
WI1-Hobbies5650.626104.50%0.72011.02%760331,629
TX1-Household10470.650112.07%0.76577.40%79,843300,570
CA2-Hobbies5650.656114.04%0.75726.22%45,483172,569
TX3-Household10470.687124.35%0.77108.16%177,363660,928
WI3-Hobbies5650.697127.57%0.73633.29%177,363658,671
WI1-Household10470.708131.05%0.795111.54%398,4431,483,148
TX3-Hobbies5650.727137.51%0.76927.90%74,763279,746
WI2-Hobbies5650.765149.71%0.76377.12%123,803459,177
TX1-Hobbies5650.784155.86%0.76256.96%43,003163,129
WI3-Household10470.786156.67%0.790210.84%556323,963
WI2-Household10470.795159.65%0.844018.40%204,603757,647
CA4-Household10470.796159.78%0.794011.38%177,363663,235
CA4-Hobbies5650.840174.22%0.786310.30%74,763280,400
Store-Department (70)CA3-Foods38230.2790.770412.10%45,483172,555
CA1-Foods38230.31312.15%0.777013.07%79,843300,564
CA2-Foods12160.34925.33%0.854024.28%5,56323,940
TX2-Foods38230.35326.53%0.73296.65%28,003107,549
CA3-Foods23980.36430.53%0.71003.32%398,4431,484,856
TX2-Foods12160.37333.83%0.68840.17%16,20362,830
TX2-Household15320.38538.08%0.6872285,4031,056,979
CA1-Foods12160.38939.41%0.785714.33%177,363654,331
WI2-Foods23980.38939.49%0.821919.60%293,5231,088,478
WI2-Foods38230.39742.40%0.825020.06%20,72380,547
WI1-Foods38230.39842.65%0.833421.28%74,763281,499
CA3-Foods12160.41147.28%0.817618.98%12,28349,156
WI3-Foods38230.41950.46%0.788714.77%74,763281,689
TX1-Foods38230.42351.73%0.73807.39%556,3632,049,755
CA1-Foods23980.42953.99%0.75029.17%293,5231,094,123
CA2-Foods38230.44057.69%0.946037.66%123,803463,412
CA4-Foods38230.44057.96%0.779313.41%28,003107,727
CA1-Household15320.44459.14%0.765411.38%74,763281,499
TX1-Foods23980.44860.77%0.69691.42%183,523686,853
TX2-Foods23980.45061.38%0.73426.84%123,803460,022
WI3-Foods23980.45563.39%0.763611.11%398,4431,473,384
TX1-Household15320.45764.12%0.73416.83%204,603753,716
WI1-Foods12160.45864.35%0.847623.34%5,56323,949
TX3-Foods38230.46165.54%0.804417.06%82,603310,350
CA3-Household15320.46265.76%0.832221.11%79,843300,324
WI2-Foods12160.46767.63%0.793415.46%131,683484,866
TX3-Household15320.46867.83%0.74648.61%46,963178,155
CA2-Household15320.47269.24%0.800116.43%131,683493,217
TX1-Foods12160.47369.64%0.816918.87%220311,276
WI1-Foods23980.48473.55%0.831020.92%79,843295,408
WI3-Household15320.49778.40%0.759010.44%123,803463,353
CA3-Hobbies14160.51785.63%0.74408.27%177,363657,200
WI2-Household15320.52588.37%0.855524.49%123,803463,933
WI1-Household15320.53190.34%0.806917.42%409,6831,526,577
CA2-Foods23980.54194.17%0.872026.89%177,363661,534
CA1-Hobbies14160.561101.24%0.74258.05%285,4031,060,472
TX2-Hobbies14160.563102.04%0.73176.48%123,803463,582
CA4-Foods23980.578107.44%0.809617.81%204,603764,210
WI3-Foods12160.580108.05%0.810817.98%131,683489,188
WI1-Hobbies14160.582108.81%0.71213.62%177,363654,673
CA3-Household25150.603116.32%0.787314.57%892336,765
TX3-Foods12160.608117.95%0.881328.25%760331,574
TX3-Foods23980.631126.41%0.854024.27%204,603760,157
CA4-Household15320.641130.07%0.784014.09%556,3632,051,567
CA4-Foods12160.642130.47%0.892029.80%74,763281,538
WI3-Hobbies14160.654134.46%0.71333.79%177,363656,325
TX2-Hobbies21490.657135.81%0.71514.07%82,603307,310
CA2-Hobbies14160.661137.24%0.767411.67%293,5231,089,744
CA2-Household25150.688146.99%0.825420.11%28,003106,699
TX1-Hobbies14160.689147.29%0.73677.21%45,483172,532
TX2-Household25150.699150.63%0.73456.88%293,5231,074,703
CA1-Household25150.705152.98%0.74298.11%398,4431,469,196
CA2-Hobbies21490.705153.09%0.73106.38%20,72380,526
CA3-Hobbies21490.738164.93%0.776212.96%43,003163,390
WI2-Hobbies14160.740165.62%0.777013.07%123,803463,481
CA4-Hobbies14160.754170.61%0.763411.09%556,3632,053,276
TX3-Hobbies21490.757171.68%0.72665.73%293,5231,088,079
WI2-Hobbies21490.767175.25%0.71974.73%104,043389,680
WI1-Hobbies21490.769175.77%0.75349.64%46,963177,610
TX3-Hobbies14160.790183.56%0.809917.85%61,203229,601
CA1-Hobbies21490.806189.04%0.765611.41%79,843296,575
WI3-Hobbies21490.838200.56%0.800416.48%12,28349,227
WI1-Household25150.852205.65%0.786614.46%61,203231,415
TX1-Household25150.870212.20%0.805217.18%177,363659,714
TX3-Household25150.889218.87%0.790615.05%183,523676,504
WI3-Household25150.900222.93%0.772912.47%104,043386,689
TX1-Hobbies21490.950240.72%0.816918.88%28,003106,599
CA4-Household25150.983252.58%0.806917.42%45,483172,371
CA4-Hobbies21491.006260.81%0.839522.16%79,843299,234
WI2-Household25151.047275.49%0.808417.64%123,803459,988
Figure A1. Time-series features of state data pools after applying principal component analysis, ordered by MASE.
Figure A1. Time-series features of state data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a1
Figure A2. Time-series features of store data pools after applying principal component analysis, ordered by MASE.
Figure A2. Time-series features of store data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a2
Figure A3. Time-series features of category data pools after applying principal component analysis, ordered by MASE.
Figure A3. Time-series features of category data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a3
Figure A4. Time-series features of department data pools after applying principal component analysis, ordered by MASE.
Figure A4. Time-series features of department data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a4
Figure A5. Time-series features of state–category data pools after applying principal component analysis, ordered by MASE.
Figure A5. Time-series features of state–category data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a5
Figure A6. Time-series features of state–department data pools after applying principal component analysis, ordered by MASE.
Figure A6. Time-series features of state–department data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a6
Figure A7. Time-series features of store–category data pools after applying principal component analysis, ordered by MASE.
Figure A7. Time-series features of store–category data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a7
Figure A8. Time-series features of store–department data pools after applying principal component analysis, ordered by MASE.
Figure A8. Time-series features of store–department data pools after applying principal component analysis, ordered by MASE.
Bdcc 07 00100 g0a8

References

  1. Fildes, R.; Ma, S.; Kolassa, S. Retail forecasting: Research and practice. Int. J. Forecast. 2022, 38, 1283–1318. [Google Scholar] [CrossRef]
  2. Oliveira, J.M.; Ramos, P. Assessing the Performance of Hierarchical Forecasting Methods on the Retail Sector. Entropy 2019, 21, 436. [Google Scholar] [CrossRef] [PubMed]
  3. Seaman, B. Considerations of a retail forecasting practitioner. Int. J. Forecast. 2018, 34, 822–829. [Google Scholar] [CrossRef]
  4. Ramos, P.; Oliveira, J.M.; Kourentzes, N.; Fildes, R. Forecasting Seasonal Sales with Many Drivers: Shrinkage or Dimensionality Reduction? Appl. Syst. Innov. 2023, 6, 3. [Google Scholar] [CrossRef]
  5. Ramos, P.; Santos, N.; Rebelo, R. Performance of state space and ARIMA models for consumer retail sales forecasting. Robot. Comput. Integr. Manuf. 2015, 34, 151–163. [Google Scholar] [CrossRef]
  6. Ramos, P.; Oliveira, J.M. A procedure for identification of appropriate state space and ARIMA models based on time-series cross-validation. Algorithms 2016, 9, 76. [Google Scholar] [CrossRef]
  7. Hyndman, R.J.; Koehler, A.B.; Ord, J.K.; Snyder, R.D. Forecasting with Exponential Smoothing: The State Space Approach; Springer Series in Statistics; Springer: Berlin, Germany, 2008. [Google Scholar] [CrossRef]
  8. Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis, 4th ed.; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
  9. Montero-Manso, P.; Hyndman, R.J. Principles and algorithms for forecasting groups of time series: Locality and globality. Int. J. Forecast. 2021, 37, 1632–1653. [Google Scholar] [CrossRef]
  10. Januschowski, T.; Gasthaus, J.; Wang, Y.; Salinas, D.; Flunkert, V.; Bohlke-Schneider, M.; Callot, L. Criteria for classifying forecasting methods. Int. J. Forecast. 2020, 36, 167–177. [Google Scholar] [CrossRef]
  11. Rabanser, S.; Januschowski, T.; Flunkert, V.; Salinas, D.; Gasthaus, J. The Effectiveness of Discretization in Forecasting: An Empirical Study on Neural Time Series Models. arXiv 2020, arXiv:2005.10111. [Google Scholar]
  12. Laptev, N.; Yosinski, J.; Li, L.E.; Smyl, S. Time-series extreme event forecasting with neural networks at Uber. In Proceedings of the International Conference on Machine Learning, Workshop, Sydney, Australia, 6–11 August 2017; Volume 34, pp. 1–5. [Google Scholar]
  13. Gasthaus, J.; Benidis, K.; Wang, Y.; Rangapuram, S.S.; Salinas, D.; Flunkert, V.; Januschowski, T. Probabilistic Forecasting with Spline Quantile Function RNNs. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; Chaudhuri, K., Sugiyama, M., Eds.; Volume 89, pp. 1901–1910. [Google Scholar]
  14. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2020, arXiv:1905.10437. [Google Scholar]
  15. Bandara, K.; Hewamalage, H.; Liu, Y.H.; Kang, Y.; Bergmeir, C. Improving the accuracy of global forecasting models using time series data augmentation. Pattern Recognit. 2021, 120, 108148. [Google Scholar] [CrossRef]
  16. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: 100,000 time series and 61 forecasting methods. Int. J. Forecast. 2020, 36, 54–74. [Google Scholar] [CrossRef]
  17. Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
  18. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M5 competition: Background, organization, and implementation. Int. J. Forecast. 2022, 38, 1325–1336. [Google Scholar] [CrossRef]
  19. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. M5 accuracy competition: Results, findings, and conclusions. Int. J. Forecast. 2022, 38, 1346–1364. [Google Scholar] [CrossRef]
  20. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V.; Chen, Z.; Gaba, A.; Tsetlin, I.; Winkler, R.L. The M5 uncertainty competition: Results, findings and conclusions. Int. J. Forecast. 2022, 38, 1365–1385. [Google Scholar] [CrossRef]
  21. Bojer, C.S.; Meldgaard, J.P. Kaggle forecasting competitions: An overlooked learning opportunity. Int. J. Forecast. 2021, 37, 587–603. [Google Scholar] [CrossRef]
  22. Duncan, G.T.; Gorr, W.L.; Szczypula, J. Forecasting Analogous Time Series. In Principles of Forecasting: A Handbook for Researchers and Practitioners; Armstrong, J.S., Ed.; Springer: Boston, MA, USA, 2001; pp. 195–213. [Google Scholar] [CrossRef]
  23. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  24. Bandara, K.; Bergmeir, C.; Smyl, S. Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach. Expert Syst. Appl. 2020, 140, 112896. [Google Scholar] [CrossRef]
  25. Hewamalage, H.; Bergmeir, C.; Bandara, K. Global models for time series forecasting: A Simulation study. Pattern Recognit. 2022, 124, 108441. [Google Scholar] [CrossRef]
  26. Rajapaksha, D.; Bergmeir, C.; Hyndman, R.J. LoMEF: A framework to produce local explanations for global model time series forecasts. Int. J. Forecast. 2022. [Google Scholar] [CrossRef]
  27. Kolmogorov, A.N. Three approaches to the quantitative definition of information. Int. J. Comput. Math. 1968, 2, 157–168. [Google Scholar] [CrossRef]
  28. Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  29. Cilibrasi, R.; Vitanyi, P. Clustering by compression. IEEE Trans. Inf. Theory 2005, 51, 1523–1545. [Google Scholar] [CrossRef]
  30. Semenoglou, A.A.; Spiliotis, E.; Makridakis, S.; Assimakopoulos, V. Investigating the accuracy of cross-learning time series forecasting methods. Int. J. Forecast. 2021, 37, 1072–1084. [Google Scholar] [CrossRef]
  31. Novak, R.; Bahri, Y.; Abolafia, D.A.; Pennington, J.; Sohl-Dickstein, J. Sensitivity and Generalization in Neural Networks: An Empirical Study. arXiv 2018, arXiv:1802.08760. [Google Scholar]
  32. Kourentzes, N. Intermittent demand forecasts with neural networks. Int. J. Prod. Econ. 2013, 143, 198–206. [Google Scholar] [CrossRef]
  33. Croston, J.D. Forecasting and Stock Control for Intermittent Demands. J. Oper. Res. Soc. 1972, 23, 289–303. [Google Scholar] [CrossRef]
  34. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  35. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  36. Alexandrov, A.; Benidis, K.; Bohlke-Schneider, M.; Flunkert, V.; Gasthaus, J.; Januschowski, T.; Maddix, D.C.; Rangapuram, S.; Salinas, D.; Schulz, J.; et al. GluonTS: Probabilistic and Neural Time Series Modeling in Python. J. Mach. Learn. Res. 2020, 21, 4629–4634. [Google Scholar]
  37. Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Ben Taieb, S.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. Int. J. Forecast. 2022, 38, 705–871. [Google Scholar] [CrossRef]
  38. Garza, F.; Canseco, M.M.; Challú, C.; Olivares, K.G. StatsForecast: Lightning Fast Forecasting with Statistical and Econometric Models; PyCon: Salt Lake City, UT, USA, 2022. [Google Scholar]
  39. Hyndman, R.J.; Khandakar, Y. Automatic time series forecasting: The forecast package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
  40. Hyndman, R.J.; Koehler, A.B.; Snyder, R.D.; Grose, S. A state space framework for automatic forecasting using exponential smoothing methods. Int. J. Forecast. 2002, 18, 439–454. [Google Scholar] [CrossRef]
  41. Ord, J.K.; Fildes, R.; Kourentzes, N. Principles of Business Forecasting, 2nd ed.; Wessex Press Publishing Co.: London, UK, 2017. [Google Scholar]
  42. Kang, Y.; Hyndman, R.J.; Smith-Miles, K. Visualising forecasting algorithm performance using time series instance spaces. Int. J. Forecast. 2017, 33, 345–358. [Google Scholar] [CrossRef]
  43. Jolliffe, I. Principal Component Analysis, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2002. [Google Scholar] [CrossRef]
  44. O’Hara-Wild, M.; Hyndman, R.; Wang, E. feasts: Feature Extraction and Statistics for Time Series. 2022. Available online: https://github.com/tidyverts/feasts/ (accessed on 12 December 2022).
  45. Lê, S.; Josse, J.; Husson, F. FactoMineR: A Package for Multivariate Analysis. J. Stat. Softw. 2008, 25, 1–18. [Google Scholar] [CrossRef]
  46. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
  47. Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
  48. Hollander, M.; Wolfe, D.A.; Chicken, E. Nonparametric Statistical Methods; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2015. [Google Scholar] [CrossRef]
  49. Kourentzes, N. tsutils: Time Series Exploration, Modelling and Forecasting, R Package Version 0.9.3; 2022. Available online: https://github.com/trnnick/tsutils/ (accessed on 12 December 2022).
Figure 1. Time-series features of M5 dataset after applying principal component analysis.
Figure 1. Time-series features of M5 dataset after applying principal component analysis.
Bdcc 07 00100 g001
Figure 2. Post-hoc Nemenyi test results at a 5% significance level based on MASE and RMSSE.
Figure 2. Post-hoc Nemenyi test results at a 5% significance level based on MASE and RMSSE.
Bdcc 07 00100 g002
Table 1. Linear exponential smoothing models.
Table 1. Linear exponential smoothing models.
Seasonal Component
NA
Trend componentN y t = l t 1 + ε t l t = l t 1 + α ε t y t = l t 1 + s t m + ε t l t = l t 1 + α ε t s t = s t m + γ ε t
A y t = l t 1 + b t 1 + ε t l t = l t 1 + b t 1 + α ε t b t = b t 1 + β ε t y t = l t 1 + b t 1 + s t m + ε t l t = l t 1 + b t 1 + α ε t b t = b t 1 + β ε t s t = s t m + γ ε t
Ad y t = l t 1 + ϕ b t 1 + ε t l t = l t 1 + ϕ b t 1 + α ε t b t = ϕ b t 1 + β ε t y t = l t 1 + ϕ b t 1 + s t m + ε t l t = l t 1 + ϕ b t 1 + α ε t b t = ϕ b t 1 + β ε t s t = s t m + γ ε t
Table 2. DeepAR hyperparameter ranges of values considered in the optimization process.
Table 2. DeepAR hyperparameter ranges of values considered in the optimization process.
HyperparameterValues Considered
Context length28
Prediction length28
Number of hidden layers { 1 , 2 , 3 , 4 }
Hidden size { 20 , 40 , 60 , 80 , 100 , 120 , 140 }
Learning rate [ 1 × 10 5 , 1 × 10 1 ]
Dropout rate [ 0 , 0.2 ]
Batch size { 16 , 32 , 64 , 128 }
ScalingTrue
Number of epochs100
Number of parallel samples100
Number of trials50
Table 3. Performance of global and local models evaluated with respect to MASE and RMSSE. Model complexity estimated by TNP (total number of parameters), TCMS (total compressed model size), WNP (weighted average number of parameters), and WCMS (weighted average compressed model size), per model.
Table 3. Performance of global and local models evaluated with respect to MASE and RMSSE. Model complexity estimated by TNP (total number of parameters), TCMS (total compressed model size), WNP (weighted average number of parameters), and WCMS (weighted average compressed model size), per model.
Forecasting MethodsNo. ofMASERMSSETNPWNPTCMSWCMS
Pools (Bytes)(Bytes)
Partitioning approaches
DeepAR-Total10.5720.78245204,603204,603776,553776,553
DeepAR-State30.564−1.38%0.78060−0.24%580,729203,5712,178,371763,894
DeepAR-Store100.560−1.98%0.78241−0.01%2,121,070212,1077,921,985792,199
DeepAR-Category30.566−1.08%0.78094−0.19%678,089296,5912,595,7071,136,317
DeepAR-Department70.559−2.15%0.78138−0.14%1,294,621226,6794,821,767843,542
DeepAR-State-Category90.553−3.23%0.78080−0.21%1,340,627168,2675,015,850629,156
DeepAR-State-Department210.551−3.58%0.78064−0.23%2,419,623131,0809,042,778490,142
DeepAR-Store-Category300.556−2.74%0.783400.12%3,708,210134,90213,867,558504,447
DeepAR-Store-Department700.554−3.11%0.784210.23%10,310,650155,90338,339,800579,556
DeepAR-Comb1540.558−2.37%0.77620−0.80%22,658,222192,63483,740,297723,979
Local benchmarks
ARIMA10.79839.51%0.9343619.42%487,840 *16 *24,431,329801
ETS10.80841.24%0.9285318.67%426,860 *14 *24,411,454801
Seasonal Naïve10.90558.35%1.2376358.17%
* Conservative estimate of the number of parameters in the models. In ARIMA, they include the orders 0 p 5 , 0 q 5 , 0 P 2 , and 0 Q 2 ; c, if it exists, and the residual’s variance; thus, a maximum of 16 parameters. In ETS models, they include the smoothing parameters α , β , γ , and ϕ , the initial states l 0 , b 0 , s 0 , , s 6 , and the residual’s variance; thus, a maximum of 14 parameters. The most-effective data-partitioning approaches within the MASE and RMSE columns are highlighted in boldface.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oliveira, J.M.; Ramos, P. Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning. Big Data Cogn. Comput. 2023, 7, 100. https://doi.org/10.3390/bdcc7020100

AMA Style

Oliveira JM, Ramos P. Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning. Big Data and Cognitive Computing. 2023; 7(2):100. https://doi.org/10.3390/bdcc7020100

Chicago/Turabian Style

Oliveira, José Manuel, and Patrícia Ramos. 2023. "Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning" Big Data and Cognitive Computing 7, no. 2: 100. https://doi.org/10.3390/bdcc7020100

APA Style

Oliveira, J. M., & Ramos, P. (2023). Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning. Big Data and Cognitive Computing, 7(2), 100. https://doi.org/10.3390/bdcc7020100

Article Metrics

Back to TopTop