Assessing the Performance of Hierarchical Forecasting Methods on the Retail Sector

Retailers need demand forecasts at different levels of aggregation in order to support a variety of decisions along the supply chain. To ensure aligned decision-making across the hierarchy, it is essential that forecasts at the most disaggregated level add up to forecasts at the aggregate levels above. It is not clear if these aggregate forecasts should be generated independently or by using an hierarchical forecasting method that ensures coherent decision-making at the different levels but does not guarantee, at least, the same accuracy. To give guidelines on this issue, our empirical study investigates the relative performance of independent and reconciled forecasting approaches, using real data from a Portuguese retailer. We consider two alternative forecasting model families for generating the base forecasts; namely, state space models and ARIMA. Appropriate models from both families are chosen for each time-series by minimising the bias-corrected Akaike information criteria. The results show significant improvements in forecast accuracy, providing valuable information to support management decisions. It is clear that reconciled forecasts using the Minimum Trace Shrinkage estimator (MinT-Shrink) generally improve on the accuracy of the ARIMA base forecasts for all levels and for the complete hierarchy, across all forecast horizons. The accuracy gains generally increase with the horizon, varying between 1.7% and 3.7% for the complete hierarchy. It is also evident that the gains in forecast accuracy are more substantial at the higher levels of aggregation, which means that the information about the individual dynamics of the series, which was lost due to aggregation, is brought back again from the lower levels of aggregation to the higher levels by the reconciliation process, substantially improving the forecast accuracy over the base forecasts.


Introduction
Retailers need demand forecasts at different levels of aggregation to support decision-making at operational and short-term strategic levels [1]. Consider a retailer warehouse storing inventory that is used to replenish multiple retail stores: Store-level forecasts at different product levels are needed to manage inventory in the store or to allocate shelf space, but aggregate forecasts are also required for the inventory decisions of the retailer warehouse [2]. Understanding whether these aggregate forecasts should be generated independently at each level of the hierarchy, based on the aggregated demand, or obtained using an hierarchical forecasting method, which depends on the aggregation constraints of the hierarchy but ensures coherent decision-making at the different levels, is the gap we seek to address in this paper.
were then used to disaggregate the base forecasts of the top level series. The disadvantage of all top-down approaches, including this one, is that they do not produce unbiased coherent forecasts [13].
The remainder of the paper is organized as follows. The next section presents a brief description of the two most widely-used approaches to time-series forecasting: State space models and ARIMA models. The procedure for using information criteria in model selection is also discussed. Section 3 describes the methods more commonly used to forecast hierarchical time-series. Section 4 presents the case study of a Portuguese retailer, explains the evaluation setup implemented and error measures used, and discusses the results obtained. Finally, Section 5 offers the concluding remarks.

Pure Forecasting Models
We consider two alternative forecasting methods for generating the base forecasts used by hierarchical forecasting approaches; namely, state space models and ARIMA models. These are briefly described in this section, giving a special focus on the use of information criteria for model selection.

State Space Models
Forecasts generated by exponential smoothing methods are weighted averages of past observations, where the weights decrease exponentially as the observations get older. The component form representation of these methods comprises the forecast equation and one smoothing equation for each of the components considered, which can be the level, the trend, and the seasonality. The possibilities for each of these components are: Trend = {N, A, A d } and Seasonality = {N, A, M}, where N, A, A d and M mean, respectively, none, additive, additive damped, and multiplicative. By considering all combinations of the trend and seasonal components, nine exponential smoothing methods are possible. Each method is usually labelled by a pair of letters, (T,S), specifying the type of trend and seasonal components. Denoting the time-series by y t , t = 1, 2, . . . , n and the forecast of y t+h , based on all data up to time t byŷ t+h|t , the component form of the additive Holt-Winters' method, (A, A), isŷ t+h|t = l t + hb t + s t+h−m(k+1) (1) where l t , b t , and s t denote, respectively, the estimates of the series level, trend (slope), and seasonality at time t; m denotes the period of seasonality; and k is the integer part of (h − 1)/m. The smoothing parameters α, β * , and γ are constrained, to ensure that the smoothing equations can be interpreted as weighted averages. Fitted values are calculated by setting h = 1 with t = 0, 1, . . . , n − 1. H-step ahead forecasts, for h = 1, 2, . . ., can then be obtained using the last estimated values of the level, trend, and seasonality (t = n). Details about all the other methods may be found in Hyndman and Athanasopoulos [18]. To be able to produce forecast intervals and use a model selection criteria, Hyndman et al. [19] (amongst others) developed a statistical framework, where an innovation state space model can be written for each of the exponential smoothing methods. Each state space model comprises a measurement equation, which describes the observed data, and state equations which describe how the unobserved components (level, trend, and seasonality) change with time. For each exponential smoothing method, two possible state space models are considered, one with additive errors and one with multiplicative errors, giving a total of 18 models. To distinguish state space models with additive and multiplicative errors, an extra letter E was added: The triplet (E, T, S) identifies the type of error, trend, and seasonality. The general state space model is where y t denotes the observation at time t, x t is the state vector, {ε t } is a white noise process with variance σ 2 referred to as the innovation (new and unpredictable), w(.) is the measurement function, r(.) is the error term function, f (.) is the transition function, and g(.) is the persistence function. Equation (2a)

Estimation of State Space Models
Maximum likelihood estimates of the parameters and initial states of the state space model (2) can be obtained by minimizing its likelihood. The probability density function for y = (y 1 , . . . , y n ) is given by [19] p(y | θ, where θ is the parameters vector, x 0 is the initial states vector, and σ 2 is the innovation variance. By assuming that the distribution of {ε t } is Gaussian, this likelihood has the form and its logarithm is The maximum likelihood estimate of σ 2 can be obtained by taking the partial derivative of (10) with respect to σ 2 and setting it to zero:σ This estimate can be used to eliminate σ 2 from the likelihood (9), which becomes Hence, twice the negative logarithm of this likelihood is where c n = n log(2 π e) − n log(n). Thus, maximum likelihood estimates for the parameters θ and the initial states x 0 can be obtained by minimizing The innovations can be computed recursively, using the relationships

Information Criteria for Model Selection
Forecast accuracy measures can be used to select a model for a given time-series, as long as the errors are computed from a test set and not from the training set used to estimate the model. However, the errors usually available are not enough to draw reliable conclusions. One possible solution is to use an information criterion (IC), based on the likelihood L(θ, x 0 | y), that would include a regularization term to compensate for potential overfitting. The Akaike Information Criteria (AIC) for state space models is defined as [18] AIC = −2 log L(θ, where L(θ, x 0 | y) is the likelihood and k is the number of parameters and initial states of the estimated model. Akaike based his model selection criteria on the Kullback-Liebler (K-L) discrimination information, also known as negative entropy, defined by which measures the information lost when the model g is used to approximate the real model f . He found that he could estimate the expectation of K-L information by the maximized log-likelihood corrected for bias. This bias can be approximated by the number of estimated parameters in the approximating model. Thus, the model selection procedure is to choose the model amongst the candidates having the minimum value of the AIC. The Bayesian Information Criteria (BIC) is defined as [20] The BIC is order-consistent, but is not asymptotically efficient like the AIC. The AIC corrected for small-sample bias, denoted by AIC c , is defined as [19] Appropriate models can be selected by minimizing the AIC, the BIC, or the AIC c .

ARIMA Models
ARIMA models are generally accepted as one of the most versatile classes of models for forecasting time-series [21,22]. Many different types of stochastic seasonal and non-seasonal time-series can be represented by them. These include pure autoregressive (AR), pure moving average (MA), and mixed AR and MA processes, all requiring stationary data so that they can be applied. Although many time-series are non-stationary, they can be transformed to stationary time-series by taking proper degrees of differencing (regular and/or seasonal). The multiplicative seasonal ARIMA model, denoted as ARIMA(p, d, q) × (P, D, Q) m , has the following form [23]: where m is the period of seasonality, D is the degree of seasonal differencing, d is the degree of ordinary differencing, B is the backward shift operator, φ p (B) and θ q (B) are the regular autoregressive and moving average polynomials of orders p and q, respectively, Φ P (B m ) and Θ Q (B m ) are the seasonal autoregressive and moving average polynomials of orders P and Q, respectively, and ε t is a zero-mean Gaussian white noise process with variance σ 2 . To ensure causality and invertibility, the roots of the polynomials , and Θ Q (B m ) should lie outside the unit circle. One of the main tasks in ARIMA forecasting is selecting the values of p, q, P, Q, d, and D. Usually, the following steps are used [23]: Plot the series, identify outliers, and choose a proper variance-stabilizing transformation. For that purpose, a Box-Cox transformation may be applied [24]: where the parameter λ is a real number, often between −1 and 2. Then, the sample ACF (Auto-Correlation Function) and sample PACF (Partial Auto-Correlation Function) can be computed to decide appropriate degrees of differencing (d and D). Alternatively, unit-root tests may be applied. The Canova-Hansen test [25] can be used to choose D. After D is selected, d can be chosen by applying successive KPSS (Kwiatkowski, Phillips, Schmidt & Shin) tests [26]. Finally, the sample ACF and sample PACF are matched with the theoretical patterns of known models, to identify the orders of p, q, P, and Q.

Information Criteria for Model Selection
As for state space models, the values of p, q, P, and Q may be selected by an information criterion, such as the Akaike Information Criteria [18]: where k = 1 if c = 0 and 0 otherwise, and log L(θ, σ 2 | y) is the log-likelihood of the model fitted to the properly transformed and differenced data, given by [27] log L(θ, where θ is the parameter vector of the model and σ 2 is the innovation variance (the last term in parentheses in (23) is the total number of parameters that have been estimated, including the innovation variance). Note that the AIC is defined by considering the same principles of maximum likelihood and negative entropy discussed in Section 2.1. The AIC corrected for small sample sizes, AIC c , is defined as The Bayesian Information Criterion is defined as As for the state space models, appropriate ARIMA models may be obtained by minimizing either the AIC, AIC c , or BIC.

Hierarchical Time-Series
For the purpose of illustration, consider the example of the hierarchical structure shown in Figure 1. At the top of the hierarchy (level 0) is the most aggregated time-series, denoted by Total. The observation at time t of the Total series is denoted by y Total,t . The Total series is disaggregated into series A and series B, at level 1. The t-th observation of series A is denoted as y A,t and the t-th observation of series B is denoted as y B,t . The series A and B are disaggregated, respectively, into two and three series that are at the bottom level (level 2). For example, y AA,t denotes the t-th observation of series AA. In this case, the total number of series is n = 8 and the number of series at the bottom level is m = 5. For any time t, the observations at the bottom level will sum to the observations of the series above. Hence, in this case, we have y Total,t = y AA,t + y AB,t + y BA,t + y BB,t + y BC,t , y A,t = y AA,t + y AB,t , y B,t = y BA,t + y BB,t + y BC,t . (27) y Total,t y A,t y B,t y AA,t y BA,t y AB,t y BC,t y BB,t These aggregation constraints can be easily represented using matrix notation where y t = (y Total,t , y A,t , y B,t , y AA,t , y AB,t , y BA,t , y BB,t , y BC,t ) is an n-dimensional vector, b t = (y AA,t , y AB,t , y BA,t , y BB,t , y BC,t ) is an m-dimensional vector, and S is the summing matrix of order n × m, given by Note that the first three rows of S correspond, respectively, to the three aggregation constraints in (27). The identity matrix I 5 below guarantees that each bottom level observation on the right-hand side of the equation is equal to itself on the left hand side. These concepts can be applied to an arbitrary set of n time-series that are subject to an aggregation structure, with m series at the bottom level [18]. The goal is to produce coherent forecasts for each series in the hierarchy; that is, forecasts that add up according to the aggregation constraints of the hierarchical structure.

Hierarchical Forecasting Methods
Letŷ t+h|t be an n-dimensional vector containing the forecasts of the values of all series in the hierarchy at time t + h (with h = 1, 2, . . .), obtained using observations up to and including time t, and stacked in the same order as y t . These are usually called base forecasts. They are calculated independently for each time-series, not taking into account any relationship that might exist between them due to the aggregation constraints. Any forecasting method, such as ETS or ARIMA, can be used to generate these forecasts. The issue is that it is very unlikely that these will be coherent forecasts, hence some reconciliation method should be further applied. All existing reconciliation methods can be expressed asỹ whereỹ t+h|t is an n-dimensional vector of reconciled forecasts, which are now coherent, and P is a matrix of dimension m × n, which maps the base forecastsŷ t+h|t into reconciled bottom level forecasts, which are then aggregated by the summing matrix S. If the bottom-up (BU) approach is used, then is the null matrix of order m × (n − m) and I m is the identity matrix of order m [4][5][6]9,10,28,29]. For the hierarchy shown in Figure 1, P is given by This approach is computationally very efficient, since it only requires summing the bottom level base forecasts. It also has the advantage of forecasting the series at the most disaggregated level and, although it is more difficult to model, no information about the dynamics of the series is lost due to aggregation. However, it usually provides very poor forecasts for the upper levels in the hierarchy [13]. If a top-down (TD) approach is used, then P = [p | 0 m×(n−1) ], where p = [p 1 , . . . , p m ] is an m-dimensional vector containing the disaggregation proportions, which indicate how the top level base forecast at time t + h is to be distributed to obtain forecasts for the bottom level series, which are then summed by S [8,17,[30][31][32][33]. For the hierarchy shown in Figure 1, P is given by The most common top-down methods performed quite well in Gross and Sohl [8]. In method "a" of Gross and Sohl [8] (referred to in the results that follow as TD GSa ), each proportion p i is the average of the historical proportions of bottom level series y i,j , relative to top level series y T,j , over the time period j = 1, . . . , t: In method "f" (referred to in the results that follow as TD GSf ), each proportion p i is the average value of the historical data of bottom level series y i,j , relative to the average value of the historical data of top level series y T,j , over the time period j = 1, . . . , t: These two methods are very simple to implement, since they only require forecasts for the most aggregated series in the hierarchy. They seem to provide reliable forecasts for the aggregate levels. However, they are not able to capture the individual dynamics of the series that is lost due to aggregation. Moreover, since they are based on historical proportions, they tend to produce less accurate forecasts than the bottom-up approach at lower levels of the hierarchy, as they do not take into account how these proportions may change over time. To address this issue, Athanasopoulos et al. [9] proposed to obtain proportions based on forecasts rather than historical data: where k is the level of the hierarchy,ŷ (l) i,t+h|t is the base forecast at the time t + h of the series that corresponds to the node which is l levels above i, andŜ i,t+h|t is the sum of the base forecasts at the time t + h of the series that corresponds to the nodes that are below the node that is l levels above node i and are directly connected to it. In the results that follow, this top-down method is referred as TD fp . In the methods discussed so far, no real reconciliation has been performed, because these have been based on base forecasts from a single level of the hierarchy. However, processes that reconcile the base forecasts from the whole hierarchy structure in order to produce coherent forecasts can also be considered. Hyndman et al. [13] proposed an approach based on the regression model where β t+h|t is the unknown conditional mean of the most disaggregated series and ε h is the coherency error assumed with mean zero and covariance matrix Σ h . If Σ h was known, the generalised least squares (GLS) estimator of β t+h|t would lead to the following reconciled forecasts where Hyndman et al. [13] also showed that, if the base forecastsŷ t+h|t are unbiased, then the reconciled forecastsỹ t+h|t will be unbiased, provided that SPS = S. This condition is true for this reconciliation approach and also for the bottom-up, but not for top-down methods. So, the top-down approaches will never give unbiased reconciled forecasts, even if the base forecasts are unbiased. Recently, Wickramasuriya et al. [15] showed that, in general, Σ h is not identifiable. They showed that the covariance matrix of the h-step ahead reconciled forecast errors is given by for any P such that SPS = S, where W h = Var(y t+h −ŷ t+h|t ) = E(ê t+h|tê t+h|t ) is the covariance matrix of the corresponding h-step ahead base forecast errors. The goal is to find the matrix P that minimises the error variances of the reconciled forecasts, which are on the diagonal of the covariance matrix Var(y t+h −ỹ t+h|t ). Wickramasuriya et al. [15] showed that the optimal reconciliation matrix P that minimises the trace of SPW h P S , such that SPS = S, is Therefore, the optimal reconciled forecasts are given bỹ which is referred to as the MinT (Minimum Trace) estimator. Note that the MinT and GLS estimators only differ in the covariance matrix. We still need to estimate W h , which is a matrix of order n that can be quite large. The following simplifying approximations were considered by Wickramasuriya et al. [15]: In this case, the MinT estimator corresponds to the ordinary least squares (OLS) estimator of β t+h|t . It is the most simplifying approximation considered, being P-independent of the data (it only depends on S), which means that this method does not account for differences in scale between the levels of the hierarchy (captured by the error variances of the base forecasts), or the relationships between the series (captured by the error covariances of the base forecasts). This is optimal only when the base forecast errors are uncorrelated and equivariant, which are unrealistic assumptions for an hierarchical time-series. In the results that follow, this method is referred to as OLS.
( 1 is the sample covariance estimator of the in-sample 1-step ahead base forecast errors. Then, W h is a diagonal matrix with the diagonal entries of W 1 , which are the variances of the in-sample 1-step ahead base forecast errors, stacked in the same order as y t . This approximation scales the base forecasts, using the variance of the residuals. In the results that follow, this specification is referred to as MinT-VarScale. (3) W h = k h Λ for all h with k h > 0, and Λ = diag(S1) where 1 is a unit vector of dimension n. This method was proposed by Athanasopoulos et al. [34] for temporal hierarchies, and assumes that the bottom level base forecasts errors are uncorrelated between nodes and have variance k h . Hence, the diagonal entries in Λ are the number of forecast error variances contributing to each node, stacked in the same order as y t . This estimator only depends on the aggregation constraints, being independent of the data. Therefore, it is usually referred to as structural scaling, and we label it as MinT-StructScale. Notice that this specification only assumes equivariant base forecast errors at the bottom level, which is an advantage over OLS. It is particularly useful when the residuals are not available, which is the case when the base forecasts are generated by judgmental forecasting.
(4) W h = k hŴ * 1,D for all h with k h > 0, whereŴ * 1,D = λŴ 1,D + (1 − λ)Ŵ 1 is a shrinkage estimator that shrinks the off-diagonal elements ofŴ 1 towards zero (while the diagonal elements remain unchanged),Ŵ 1,D is a diagonal matrix with the diagonal entries ofŴ 1 , and λ is the shrinkage intensity parameter. By parameterizing the shrinkage in terms of variances and correlations, rather than variances and covariances, and assuming that the variances are constant, Schäfer and Strimmer [35] proposed the following shrinkage intensity parameter wherer ij is the ij th element ofR 1 , the sample correlation matrix of the in-sample 1-step ahead base forecast errors. In contrast to variance and structure scaling estimators, which are diagonal covariance estimators accommodating only differences in scale between the levels of the hierarchy, this shrinkage estimator, which is a full covariance estimator, also accounts for the relationships between the series, while the shrinkage parameter regulates the complexity of the matrix W h . In the results that follow, this method is referred to as MinT-Shrink. In all estimators, k h is a proportionality constant that needs to be estimated only to obtain prediction intervals.

Case Study Data
The Jerónimo Martins Group is an international company, based in Portugal, with 225 years of accumulated experience in the retail sector. Food distribution is its main business and represents more than 95% of their consolidated sales. In Portugal, it leads the supermarket segment through a supply chain called Pingo Doce. This empirical study was performed using a real database of product sales from one of the largest stores of Pingo Doce. The data were aggregated on a weekly basis and span the period between 3 January 2012 and 27 April 2015, comprising a total of 173 weeks. Only the products that have at least one sale every week were considered, since these are the most challenging for inventory planning. The hierarchical structure of products adopted by the retailer, from the top level to the bottom level, is: Store > Area > Division > Family > Category > Sub-category > SKU. The total number of time-series considered is 1751 (aggregated and disaggregated) and their split in the six levels of the hierarchy is summarised in Table 1. The most aggregated level, referred to as the top level, comprises the total sales at the store level. Level 1 comprises these sales disaggregated by the six main areas: Grocery, specialized perishables, non-specialized perishables, beverages, detergents and cleaning, and personal care. These are further disaggregated, at level 2, into 21 divisions; at level 3, into 73 families; at level 4, into 203 categories; at level 5, into 459 subcategories; and, at the bottom level, into 988 SKUs (Stock Keeping Units).  Figure 2 plots the sales at the top level and at level 1 of the hierarchy, aggregating these by the store and by each of the 6 main areas. The scale on the y axis was removed due to confidentiality reasons. The strong peak in sales in 2012, observed in all series, is relative to a promotional event carried out at a national level by Pingo Doce on 1 May (Labour day), after which the company shifted from an Every Day Low Price strategy to a continuous promotional cycle.
All the series show local upward and downward trends, although less prominent in the detergents/cleaning and personal care time-series. The store time-series shows a similar behaviour to the perishables time-series, as the later represent the major proportion of the total sales. These aggregate series do not show any seasonal variation. For a better understanding of the hierarchical structure of the data, we show, in Table 2, the complete hierarchy for the milk division (level 2). The total sales of the milk division are disaggregated, at level 3, into 2 families: Raw and UHT. The raw family is disaggregated into the Pasteurized category at level 4, which is further disaggregated into the Brik sub-category at level 5, which comprises 5 SKUs. The UHT family is disaggregated into the Current and Special categories. The Current category is disaggregated into the Semi-skimmed and Skimmed sub-categories, which comprise 2 and 3 SKUs, respectively. The Special category is disaggregated into the Semi-skimmed, Skimmed, and Flavored sub-categories, which comprise 10, 10, and 3 SKUs, respectively. The plots in Figure 3 show the sales of the SKUs within each subcategory of the milk division. These help us to visualise the diverse individual dynamics within each sub-category and the relative importance of each SKU. As we move down the hierarchy, the signal-to-noise ratio of the series decreases. Therefore, the series at the bottom level shows a lot more random variation, compared to the higher levels. Year Figure 3. Sales of the SKUs within each sub-category of the milk division.

Experimental Setup
Generating accurate forecasts for each of the 1751 time-series within the hierarchical structure is crucial for the planning operations of the store. We can always forecast the series at each level of the hierarchy independently (we refer to these as base forecasts), based on forecasting models fitted individually for each series. However, by ignoring the aggregation constraints, it is very unlikely that the resulting forecasts will be coherent. To ensure aligned decision-making across the various levels of management, it is essential that these forecasts are reconciled across all levels of the hierarchy.
We consider two alternative forecasting model families for generating the base forecasts; namely, ETS and ARIMA, as discussed in Section 2. The appropriate ETS model for each time-series is chosen from the 18 potential models by minimising AIC c , and the smoothing parameters and initial states are estimated by maximising the likelihood L [19], as implemented in the forecast package in the R software [36]. The ARIMA model is chosen following the algorithm proposed by Hyndman and Khandakar [37], also implemented in the forecast package. First, the number of seasonal and ordinary differences D and d required for stationarity are selected, and then the orders of p, q, P, and Q are identified, based on AIC c . ETS and ARIMA models are the two most widely-used approaches to time-series forecasting. They are based on different perspectives to the problem and often, but not always, perform differently, although they share some mathematically equivalent models [21,22,[38][39][40]. ARIMA can potentially capture higher-order time-series dynamics than ETS [34]. Therefore, we use both approaches to generate base forecasts, in order to evaluate how these can influence the performance of each reconciliation process. To make incoherent ETS and ARIMA forecasts coherent, we use the implementations of the hierarchical forecasting approaches, as discussed in Section 3.2, available in the hts package [41] for R.
We evaluate the forecasting accuracies of several competing methods using a rolling origin, as illustrated in Figure 4. By increasing the number of forecast errors available, we increase the confidence in our results.  We start with the training set containing the first 139 weeks and generate 1-to 12-week ahead base forecasts for each of the 1751 series using ETS and ARIMA. These base forecasts are then reconciled, using the alternative hierarchical methods. The training set is then expanded by one week, and the process is repeated until week 161. This gives a total of 23 forecast origins for each of the 1751 series. For each forecast origin, new ETS and ARIMA models based on the updated training data are specified, from which we generate new base forecasts which are again reconciled using the corresponding errors for both calculated. The performance of the hierarchical forecasting methods was evaluated by using the Average Relative Mean Squared Error (AvgRelMSE) [42]. As we are comparing forecast accuracy across time-series with different units, it is important to use a scale-independent error measure. For each time-series i, we calculate the Relative Mean Squared Error (RelMSE) [43] where MSE i,h is the mean squared error of the forecast of interest averaged across all forecast origins and forecast horizons h, and MSE base i,h is the mean squared error of the base forecast averaged across all forecast origins and forecast horizons h, which is used as a benchmark. If the hierarchical forecasting method reconciles with ARIMA (ETS) base forecasts, then the ARIMA (ETS) base forecasts are taken as the benchmark. For each forecast horizon h, we averaged (42) across the time-series of the hierarchy using the following geometric mean where L is the level (i.e., Top level, Level 1, . . ., Level 5, Bottom level, All). The geometric mean should be used for averaging benchmark ratios, since it gives equal weight to reciprocal relative changes [44]. An advantage of AvgRelMSE is its interpretability. When it is smaller than 1, (1-AvgRelMSE)100% is the average percentage of improvement in MSE of the evaluated forecast over the benchmark. Table 3 presents the results of AvgRelMSE for the series of each hierarchical level, while Table 4 presents the results of AvgRelMSE for the complete hierarchy. BU refers to bottom-up method, TD GSa refers to top-down "a" method of Gross and Sohl [8], TD GSf refers to top-down "f" method of Gross and Sohl [8], TD fp refers to top-down with forecast proportions, OLS refers to Ordinary Least Squares, MinT-VarScale refers to Minimum Trace Variance Scaling estimator, MinT-StructScale refers to Minimum Trace Structural Scaling estimator, MinT-Shrink refers to Minimum Trace Shrinkage estimator and Base refers to base forecasts. The left side of these tables shows the results using ARIMA base forecasts, while the right side shows the results using ETS base forecasts. As the base forecasts were used to scale the errors, in the rows labelled Base the AvgRelMSE is equal to 1 across all columns. We provide forecast results for 1 week, 2 weeks, 4 weeks (about one month), 8 weeks (about two months), and 12 weeks (about three months). The column labelled Rank provides the mean rank of each method across all forecast horizons. A method with rank of 1 is interpreted as being the best on all the horizons, while that with a rank of 9 it is always the worst. To support the comparisons between the methods that are expected to perform better, Figure 5  It is immediately clear that the MinT-Shrink forecasts improved on the accuracy of the ARIMA base forecasts for all levels and for the complete hierarchy, across all forecast horizons. The only exception was the bottom level for the short-term horizons h = 1 and 1 − 2(h = 2), albeit with marginal differences. The gains in forecast accuracy were more substantial at the higher levels of aggregation. This was not the case for all other reconciliation methods, attesting to the difficulty of producing reconciled forecasts that were (at least) as accurate as the base forecasts. Furthermore, the MinT-Shrink method using ARIMA base forecasts returned the most accurate coherent forecasts for all levels, the only exceptions being the Store level, for which the MinT-VarScale returned the most accurate forecasts, and the Area level, where the MinT-StructScale performed best. The improvements on the accuracy of MinT-Shrink forecasts, across all forecast horizons, are more pronounced with the ARIMA base forecasts, compared to the ETS base forecasts (with the exception of horizon h = 1 at the bottom level), although the former was almost always more accurate than the latter (see Table 5). This could have be due to the limitation of the ets() function in the forecast package, which restricts seasonality to have a maximum period of 24. Without this limitation, ARIMA can potentially capture seasonalities of a higher order than ETS.

Results
Clearly, the least accurate method was the OLS, for both ETS and ARIMA forecasts and across all forecast horizons. OLS only improved forecast accuracy over the base forecasts at the top level. This was due to ignoring the differences in scale between the levels of the hierarchy and any relationships between the series. A major drawback of the TD GSa and TD GSf methods was that they only considered information from the top level. Interestingly, their forecasts only improved on the accuracy of the ARIMA base forecasts for the Area level, never improving over the ETS base forecasts (the forecasts at the top level are equal to the base forecasts). The TD fp proportions were based on forecasts from all disaggregated levels of the hierarchy, but it performed badly, never improving the forecast accuracy over the ARIMA base forecasts across all forecast horizons. This could be expected, since top-down approaches never give unbiased reconciled forecasts, even if the base forecasts are unbiased. BU provided poor forecasts for all aggregate levels in the hierarchy, showing average increases in the MSE relative to the base forecasts for all levels of aggregation and all forecast horizons (the forecasts at the bottom level are equal to the base forecasts). These losses in forecast accuracy were more substantial at higher levels of aggregation. Like OLS, MinT-StructScale only depended on the structure of the aggregations and not on the actual data, resulting in poor forecasts, especially at the lower levels of aggregation; in our case, at the Category, Sub-category, and SKU levels, which comprised about 94% of the time-series of the complete hierarchy (see Figure 5). On the other hand, by accommodating the differences in scale between the levels of the hierarchy, MinT-VarScale performed well almost always, generally improving the forecast accuracy over the base forecasts. MinT-Shrink also accounted for the inter-relationships between the series in the hierarchy, always performing better than MinT-VarScale, across both ETS and ARIMA forecasts for all forecast horizons; the only exception being at the Store level (which comprised only one time-series).  To improve on the accuracy of the base forecasts, the reconciliation methods have to take advantage of the combination of informative signals from all levels of aggregation. It is clear that MinT-Shrink was able do this and, hence, improvements in forecast accuracy over the base forecasts were attained. For the complete hierarchy, the accuracy gains generally increased with the forecast horizon varying between 1.7% and 3.7%. It is also evident that the gains in forecast accuracy were more substantial at higher levels of aggregation, which means that information about the individual dynamics of the series which was lost due to aggregation, was brought back again from the lower levels of aggregation to the higher levels by the reconciliation process, substantially improving the forecast accuracy over the base forecasts.
These results are in accordance with those obtained by Kourentzes and Athanasopoulos [45], which compared MinT-Shrink and MinT-VarScale forecasts with base forecasts in the context of generating coherent cross-temporal forecasts for Australian tourism. Both MinT-Shrink and MinT-VarScale improved the forecast accuracy over the base ETS and ARIMA forecasts for the bottom level and the complete hierarchy. MinT-Shrink performed better than MinT-VarScale across both ETS and ARIMA forecasts.
In order to find out if the forecast error differences between the several competing methods were statistically significant or not, we conducted a Nemenyi test [46]. The results of this test are shown in Figure 6. The panels on the left side show the results for the complete hierarchy using ARIMA base forecasts, for each forecast horizon; while the panels on the right side show the respective results using ETS base forecasts. In the vertical axis, the methods are sorted by MSE mean rank. In the horizontal axis, they are ordered as in Tables 3 and 4. In each row, the cell in black represents the method being tested and any blue cell indicates a method with no evidence of statistically significant differences, at a 5% level, while the white cells indicate methods without such evidence. We use the Nemenyi test implementation available in the tsutils [47] package for R.  Analysing the results for ARIMA presented in the panels on the left side, we observe that, for h = 1, BU and Base are grouped together as the top-performing methods. They are immediately followed by MinT-Shrink and MinT-VarScale, which are found to be statistically indifferent. For the forecast horizon 1 − 2 (h = 2), BU, Base, MinT-Shrink, and MinT-VarScale are now grouped together as top-performing methods. For the forecast horizon 1 − 4 (h = 4), MinT-Shrink and MinT-VarScale belong to the top-performing group of forecasts and BU and Base perform significantly worse. For the long-term forecasts, MinT-Shrink performs significantly better than MinT-VarScale, BU, and Base. The TD fp and MinT-StructScale methods perform significantly worse than MinT-Shrink, MinT-VarScale, BU, and Base across all forecast horizons, and are found to be statistically indifferent, outperforming only TD GSa , TD GSf , and OLS.
Analysing the results for ETS presented in the panels on the right side, we observe that, for h = 1, BU and Base are again grouped together as top-performing methods, followed by MinT-VarScale and MinT-Shrink. For the forecast horizon 1 − 2 (h = 2), MinT-VarScale and Base are grouped together as top-performing methods, being immediately followed by MinT-Shrink and BU; which are found to be statistically indifferent. For the other forecast horizons, MinT-VarScale performs better, being always followed by MinT-Shrink. Overall, for both ETS and ARIMA, the MinT approach outperforms the other competing methods, with the exception for the short horizon h = 1.

Conclusions
Retailers need forecasts for a huge number of related time-series which can be organised into an hierarchical structure. Sales at the SKU level can be naturally aggregated into categories, families, areas, stores, and regions. To ensure aligned decision-making across the hierarchy, it is essential that forecasts at the most disaggregated level add up to forecasts at the aggregate levels above. It is not immediately clear if these aggregate forecasts should be generated independently or by using an hierarchical forecasting method that ensures coherent decision-making at the different levels but does not guarantee (at the least) the same accuracy. To give guidelines on this issue, our empirical study investigates the relative performance of independent and reconciled forecasting approaches.
We use weekly data of SKU sales from one big store of a Portuguese retailer, spanning the period between 3 January 2012 and 27 April 2015, and consider the hierarchical structure of products adopted by the company from the top level to the bottom level, comprising six levels overall. We generate the independent forecasts using two alternative forecasting model families; namely, ETS and ARIMA. These are compared to the most commonly-used hierarchical forecasting approaches. We evaluate the forecast accuracies of several competing methods, through the Average Relative Mean Squared Error, by using a cross-validation based on a rolling forecast origin.
It is clear that MinT-Shrink forecasts generally improve on the accuracy of the ARIMA base forecasts for all levels and for the complete hierarchy, across all forecast horizons. The accuracy gains generally increase with the horizon, varying between 1.7% and 3.7% for the complete hierarchy. That is not the case for all other reconciliation methods, attesting to the difficulty of producing reconciled forecasts that are at least as accurate as base forecasts. The improvements on the accuracy of MinT-Shrink forecasts, across all forecast horizons, are more pronounced with the ARIMA base forecasts, compared to the ETS base forecasts (with the exception to horizon h = 1 at the bottom level); although, the former is almost always more accurate than the latter.
To improve on the accuracy of the base forecasts, the reconciliation methods have to take advantage of the combination of informative signals from all levels of aggregation. It is clear that MinT-Shrink is able do this and, hence, improvements in forecast accuracy over the base forecasts are attained. It is also evident that the gains in forecast accuracy are more substantial at higher levels of aggregation, which means that the information about the individual dynamics of the series lost when aggregating, is brought back again from the lower levels of aggregation to the higher levels by the reconciliation process, substantially improving the forecast accuracy over the base forecasts.