1. Introduction
When multiple forecasts are available for a target variable, well-designed forecast combination methods can often outperform the best individual forecaster, as demonstrated in the literature of the applications of forecast combinations in areas, such as economics, finance, tourism and wind power generation in the last fifty years.
Many combination methods have been proposed from different perspectives since the seminal work of forecast combination by Bates & Granger [
1]. See the discussions and summaries in Clemen [
2], Newbold & Harvey [
3] and Timmermann [
4] for key developments and many references. More recently, Lahiri
et al. [
5] provide theoretical and numerical comparisons between adaptive and simple forecast combination methods; Armstrong, Green and Graefe [
6] propose important principles to follow, centered on the golden rule of being conservative, for building accurate forecasts, and verify them empirically based on an examination of previously-published studies; Green & Armstrong [
7] review studies that compare simple and complicated methods and conclude that complexity actually substantially increases forecast error. They advocate the use of sophisticatedly simple methods instead of complicated ones that are hard to understand. This is in line with the fact that complicated methods often incur unnecessarily larger instability and variability in prediction (see, e.g., Subsection 3.1 of Yang [
8]). While it seems clear that researchers agree that forecast combination is very useful, they differ in their opinions on how to do forecast combination properly. Needless to say, there are many possibly drastically different scenarios one can envision for the problem of forecast combination in terms of the accuracy of the candidate forecasts, their relationships, the structure changes, the characteristics of the forecast errors and more, which naturally favor different methods to be top performers. Therefore, the availability of many combination methods and disputes on their rankings and merits, in our view, are not only expected, but also helpful to collectively reach a better understanding of the key issues in the research area by further rigorous theoretical and empirical investigations.
The present work concerns forecast combination when the forecast errors exhibit heavy-tailed behaviors, which means that the decay of the probability density function (or an estimate) of the forecast errors is much slower than that of the normal distribution. To our knowledge, few studies have proposed/discussed forecast combination methods that target such situations, where the familiar forecast combination methods, such as simple average, least squares regression with or without constraints or those based on the variance-covariance of the forecasts, may perform very poorly (some numerical examples are provided in
Section 4 and
Section 5 in this paper).
Heavy-tailed behaviors of forecast errors may come from different sources. First, many important variables in finance, economics and other areas are known to have heavy tails. For example, currency exchange rates have long been believed to have heavy-tailed behaviors, and Marinelli
et al. [
9], for instance, discussed the evidences of heavy-tailed distributions to model them. Some key macroeconomic indices, such as GDP, are also believed to have heavy-tailed tendencies, and Harvey [
10], for instance, modeled the U.S. GDP with Student’s
t-distributions with low degrees of freedom. The heavy tails of the variables to be forecast naturally tend to cause heavy-tailed behaviors of the forecast errors. Second, even if the target variables themselves have light tails, the variables in the information set may have long tails for various reasons, which can induce heavy tails of the forecast errors. Third, for a difficult target variable, we may also observe heavy-tailed forecast errors from predictive models when data available for model training are limited, even if the true data-generating processes have relatively normal tails.
Clearly, when some of the forecast errors of the candidate forecasters are unusually large, if a forecast combination method does not take it into consideration, the final forecast may even fully inherit large prediction errors, which may then have severe practical consequences on decisions based on the forecast. Therefore, it is crucial to devise combination methods that can deal with heavy-tailed forecast errors for robust and reliable final performances. In the rest of the work, for convenience, heavy-tailed distributions may sometimes loosely refer to distributions with tails heavier than Gaussian distributions, although specific choices, such as scaled t-distributions, will be studied.
In this paper, we propose two forecast combination methods. One is specially designed for situations when there is strong evidence that the forecast errors are heavy tailed and can be modeled by a scaled Student’s
t-distribution (see below). The other is designed for more general uses. The design of these two methods follows the spirit of the adaptive forecasting through exponential re-weighting (AFTER) combination scheme by Yang [
8]. The idea of the AFTER scheme is that the exponentiated cumulative historical performances of the candidate forecasts are informative and can be used to assign their combination weights for the future. This way of using the historical performances of the candidate forecasts for weighting has a natural tie to information theory and provides a near optimal final performance in mean forecasting errors. For example, if the random errors in the true model are from a normal distribution, then the weight of a candidate forecast by AFTER is proportional to
, where
is the cumulative historical mean squared forecast error of the forecast. For the fist method mentioned above, we assume that the forecast errors follow a scaled Student’s
t-distribution with a possibly unknown scale parameter and degrees of freedom. Note that if a random variable
X satisfies that
for some
, where
is a standard
t-distribution with degrees of freedom
ν, we say
X has a
distribution with scale parameter
s. For situations when the identification of the heaviness of tails of the forecast errors is not feasible, normal, double-exponential and scaled Student’s
t-distributions are considered at the same time as candidates for the distribution form of the forecast errors for the second method. In either case, no parametric assumptions are needed on the relationships of the candidate forecasts.
Technically, if the forecast errors are assumed to follow a normal or a double-exponential distribution with zero mean, then the conditional probability density functions used in the combining process of the AFTER scheme can be estimated relatively easily for all of the candidate forecasters, because the estimation of the conditional scale parameters is straightforward (see, e.g., Zou & Yang [
11] and Wei & Yang [
12], for more details). However, this is not true if a scaled
t-distribution is assumed. Among the literature discussing the maximum likelihood parameter estimation in Student’s
t-regressions in the last few decades, Fernandez & Steel [
13] and Fonseca
et al. [
14] provided comprehensive summaries of the convergence properties of the parameter estimations in different situations. Both of them showed that the estimation of the degrees of freedom and the scale parameter simultaneously in a scaled Student’s
t-regression model suffer from monotonic likelihood because the likelihood goes to infinity as the scale parameter goes to zero if the degrees of freedom
ν are not large enough. To deal with this difficulty, methods other than the maximum likelihood estimation have been proposed in the literature. For example, one may fix the degrees of freedom first, then estimate the scale parameter using the method of moments or other tools (see, e.g., Kan & Zhou [
15]).
We follow a two-step procedure to estimate the density function given a forecast error sequence. First, estimate the scale parameter for each element in a given candidate pool of degrees of freedom. Note that each combination of the degrees of freedom and the scale parameter leads to a different estimate of the density function. Second, the weight of a density estimate is assigned from its relative historical performance. The final density estimate is a mixture of all of the candidate density estimates using the weights. More details about this procedure, including how to determine the pool of candidate estimates, are available in
Section 2. There are three major advantages of this procedure: First, because a pool of degrees of freedom (rather than a single candidate) is considered, it reduces the potential risk of picking a degree of freedom that is far from the truth. Second, the likelihood that each candidate density estimate is the best is purely decided by data. Third, the calculation of the combined estimator is easy and fast.
It is worth pointing out that some popular combination methods in the literature make assumptions on the distributions of forecast errors that do not necessarily exclude heavy-tailed behaviors. For example, methods that are based on the estimation of the variance-covariance of forecasters require the existence of variances. Regression-based forecast combination methods (see, e.g., Granger and Ramanathan [
16]) assume the existence of certain moments of the forecast errors. However, to our knowledge, these methods are not really designed to handle heavy-tailed errors and are not expected to work well for such situations.
Prior to our work, efforts have been made to deal with error distributions that have tails heavier than normal by adaptive forecast combination methods. For example, Sancetta [
17] assumed that the tails of the target variables are no heavier than exponential decays, which restricts the heaviness of the tails of the forecast errors. Wei & Yang [
12] designed a method for errors heavier than the normal distributions, but not heavier than the double-exponential distributions. More recently, Cheng & Yang [
18] advocate the incorporation of a smooth surrogate of the
-loss in the performance measure for weighting to reduce the occurrence of outlier forecasts. However, none of these methods can deal with forecast errors with tails as heavy as that of Student’s
t-distributions. The new AFTER methods in this paper will be shown to handle such situations.
The performance of the proposed methods will be examined via simulations and a real data example. We consider two simulation settings, depending on the data-generating processes being from regression models or time series models. Several error distributions are used, and they have different degrees of heavy tails. The new methods are compared to earlier versions of AFTER, as well as some popular combination methods. Their performances in heavy-tailed situations are indeed better than the competitors and are still among or close to the best, even if the forecast errors have normal tails. For a real data application, we use 1428 time series variables from M3-competition data (see Makridakis & Hibon [
19]). The M3-competition data are very popular in empirical studies in econometrics, machine learning and statistics to validate the performances of forecasting methods. For each of the variables in this dataset, forecast sequences based on 24 popular forecast methods are provided. The overall evaluation on the 1428 variables shows that our proposed methods, especially the one for general purposes, compare favorably to others. To gain more insight, we pick out a subset of the 1428 variables that have heavy-tailed forecast errors, and it is seen that the the new methods behave nicely, as intended.
The plan of the paper is as follows:
Section 2 introduces the forecast combination method designed for heavy-tailed error distributions. In
Section 3, a more general combination method is proposed. Simulations are presented in
Section 4, and
Section 5 provides a real data example.
Section 6 includes a brief concluding discussion. The proofs of the theoretical results are in the
Appendix.
2. The t-AFTER Methodology
In this section, we propose a forecast combination method when there is strong evidence that the random errors in the data-generating process are heavy tailed and can be modeled by a scaled Student’s t-distribution.
2.1. Problem Setting
Suppose at each time period there are J forecasters available for predicting and the forecast combination starts at . Note that some combination methods may require to be large enough, e.g., 10, to give reasonably accurate combinations. Let be the forecast of from the j-th forecaster. Let be the vector of candidate forecasts for made at time point .
Suppose , where is the conditional mean of given all available information prior to observing and is the random error at time i. Assume is from a distribution with probability density function () , where is the scale parameter that depends on the data before observing and is a with mean zero and scale parameter one.
Let
be a vector of combination weights of
. It is assumed that
and
for any
,
. Let
be the initial weight vector. The combined forecast for
from a combination method is:
where
stands for the inner-product of vectors
and
. Specifically, when needed, we use a superscript
δ on each
to denote the combination weights that correspond to the method
δ. For example, in the following sections,
and
stand for the combination weights from the
- and
-AFTER methods, respectively.
2.2. The Existing AFTER Methods: The - and -AFTER Methods
As one recent method of adaptive forecast combination, the general scheme of adaptive forecast combination via exponential re-weighting (AFTER) was proposed by Yang [
8]. It has been applied and studied in, e.g., Fonseca
et al. [
14], Inoue & Kilian [
20], Sanchez [
21], Altavilla & De Grauwe [
22] and Lahiri
et al. [
5] and Zhang
et al. [
23] handled the case that the variable to be predicted is categorical.
In the general AFTER formulation, the relative cumulative predictive accuracies of the forecasters are used to decide their combining weights. Let be the -norm of vector .
The general form of
for the AFTER approach is:
where
, and for any
,
where
is an estimate of
from the
j-th forecaster at time point
.
Below, the most commonly-used AFTER procedures, the
-AFTER from Zou & Yang [
11] and the
-AFTER from Wei & Yang [
12], are briefly introduced.
-AFTER: When the random errors in the data-generating process follow a normal distribution or a distribution close to a normal distribution, the
-AFTER is both theoretically and empirically competitive in providing combined forecasts that perform at least as well as any individual forecaster in any performance evaluation period plus a small penalty. Let
be the
of
. To get
, first use
as the
h in (
3), then plug the new
into (
2). The
used in the
-AFTER, denoted as
, is the sample standard deviation of
, assuming the random errors are independent and identically distributed.
-AFTER: Let
be the
of a double-exponential distribution with scale parameter one and location parameter zero. To get
, one can follow the same procedure for
, but use
as the
h in (
3). The
used in the
-AFTER, denoted as
, is the mean of
. The
-AFTER method was designed for robust combination when the random errors have occasional outliers. See Wei and Yang [
12] for details.
2.3. The t-AFTER Methods
Since the estimation of the degrees of freedom and the scale parameter simultaneously in a scaled Student’s
t-regression setting suffers from certain theoretical difficulties, as mentioned in the Introduction, we use a different strategy in this paper. Specifically, we take an estimation procedure that has two steps:
We decide a pool of candidate degrees of freedom with size K. The elements in the pool are considered to be close to the degrees of freedom of the Students’ t-distribution that describes the random errors well. For each element in the set, we assume it is the true degrees of freedom to estimate the related scale parameter. Therefore, we have K sets of estimate for the degrees of freedom and scale parameter pair.
For each of the K sets of the estimate, we find its probability to be the true one based on the relative historical performances.
This two-step procedure is used in the t-AFTER method for forecast combination when the random errors have heavy tails that can be described well by a Students’ t-distribution.
Let be a set of degrees of freedom for Student’s t-distributions. The choice of Ω will be discussed later in this subsection. Let ( and ) be the initial combination weight of the forecaster j under the degrees of freedom .
Let the combining weight of
from a
t-AFTER method be
and the combined forecast be
. Then,
and
are obtained via the following steps:
Estimate (e.g., by MLE) for each and for each candidate forecaster. The estimate for from the j-th forecaster given is denoted as .
Calculate
and
:
where
and for
and any
,
where
is the
of a Student’s
t-distribution with degrees of freedom
ν.
It is assumed that the elements in Ω are natural numbers for the sake of convenience. In general, when no specific information is available to estimate the size of candidate degrees of freedom efficiently, one can start with a large, but relatively sparse pool (say, ) and then may narrow it down based on the performances on some training datasets. When there is strong evidence that the tails of the forecast errors are heavy, the size of Ω can be relatively small, say no more than three or five. In this situation, from our experiences, or works well.
Obviously, when the random errors in the true model follow a scaled Student’s
t-distribution with a known degree of freedom
ν, then
. Then, (
5) can be simplified into:
where
is the initial weight of the
j-th forecaster and
is an estimate of
from the
j-th forecaster using all information at and before time point
when the true
ν is known.
2.4. Risk Bounds of the t-AFTER
To avoid potential redundancy, we first give a risk bound on the
t-AFTER assuming
ν is known. A more general theorem that treats
ν (and even the form of error distribution) as unknown will be given in
Section 3 (the third remarks of Theorem 2).
2.4.1. Conditions
Condition 1: There exists a constant
, such that for any
,
Condition 2: These exists a constant
, such that for any
and
:
Condition
: These exists a constant
, such that for any
and
:
Condition 1 holds when the forecast errors are bounded, which is true in many real applications, although it excludes some time series models, such as AR(1). It is required for the development of the theorems in this paper. As you can see, this condition does not require
to be bounded, so it allows large outliers to occur in the random errors. When the conditional mean of
is known to stay in certain range and the related forecasts are relatively restricted, the condition holds. See Subsection 3.1 of Wei & Yang [
12] for more discussions on this condition.
Condition 2 generally requires that the estimates of the scale parameters are not too small compared to the truth. Condition 2 requires that the estimates of the scale parameters are not too far from the truth in both directions.
2.4.2. Risk Bounds for the t-AFTER with a Known ν
Assume the true forecast errors follow a scaled Student’s t-distribution with a known degree of freedom ν. Let and be the conditional standard deviation and scale parameter, respectively, of at time point i, and let be an estimator of from the j-th forecaster.
Let
be the actual conditional error density function at time point
i and
, where
is defined in (
4). Therefore,
is the mixture estimator of
from the
t-AFTER procedure. Let
be the Kullback–Leibler divergence between two density functions
f and
g. Therefore,
is a measure of the performances of
as an estimate of
under the Kullback–Leibler divergence at time point
i.
Theorem 1. If the random errors are from a scaled Student’s t-distribution with degrees of freedom ν and Condition 2 holds, then: Further, if ν is strictly larger than two and Conditions 1 and 2 hold, then In the above, C, and are constants. and depend on and , respectively. is a function of ν, and C depends on τ and .
Remarks:
When only Condition 2 is satisfied, Theorem 1 shows that the cumulative distance between the true densities and their estimators from the t-AFTER is upper bounded by the cumulative (standardized) forecast errors of the best candidate forecaster plus a penalty that has two parts: the squared relative estimation errors of the scale parameters and the logarithm of the initial weights. This risk bound is obtained without assuming the existence of the variances of the random errors, and is only required to be lower bounded.
When ν is assumed to be strictly larger than two and both Conditions 1 and 2 are satisfied, Theorem 1 shows that the cumulative forecast errors have the same convergence rate of the cumulative forecast errors of the best candidate forecaster plus a penalty that depends on the initial weights and efficiency of scale parameter estimation. The risk bounds hold even if the the distribution of random errors has tails as heavy as .
If there is no prior information to decide the
’s in (
6), then equal initial weights could be applied. That is,
for all
j. In this case, it is easy to see that the number of candidate forecasters plays a role in the penalty. When the candidate pool is large, some preliminary analysis should be done to eliminate the significantly less competitive ones before applying the
t-AFTER.
6. Conclusions
Forecast combination is an important tool to achieve better forecasting accuracy when multiple candidate forecasters are available. Although many popular forecast combination methods do not necessarily exclude heavy-tailed situations, little is found in the literature that examines the performances of forecast combination methods in such situations with theoretical characterizations.
In this paper, we propose combination methods designed for cases when forecast errors exhibit heavy-tailed behaviors that can be modeled by a scaled Student’s t-distribution and for the cases when the heaviness of the forecast errors is not easy to identify. The t-AFTER models the heavy-tailed random errors with scaled Student’s t-distributions with unknown (or known) degrees of freedom and scale parameters. A candidate pool of degrees of freedom is proposed to solve the estimation problem, and the resulting t-AFTER works well, as seen in the simulation and real example analysis.
However, in many cases, the heaviness of the tails of the random errors is difficult to identify. Therefore, we design a combination process for general use and call it g-AFTER. For these situations, instead of assuming a certain distribution form for the random errors, a set of possible heaviness of the tails is considered, and the combination process automatically decides which ones are more reasonable by giving them high weights. The numerical results suggest that the performance of the g-AFTER is more robust than other popular combination methods because of its adaptive capability. The design of the g-AFTER provides a general idea: when there are multiple reasonable candidate distributions for the random errors, combining them in an AFTER scheme like the g-AFTER for forecast combination should work well.
In the present numerical work, the numbers of candidate forecasts considered are relatively small. In some situations, there are large numbers of candidate forecasts to begin with. It has been shown in the literature that a proper screening before combining can be beneficial, and information criteria can be used to choose top performers to be combined (see, e.g., Yuan & Yang [
29] and Zhang
et al. [
23]). Alternatively, one may also use model confidence sets (see Hansen
et al. [
30] and Ferrari & Yang [
31]) to narrow the pool of candidates before applying a combining method. Samuwels & Sekkel [
32] provide an interesting comparative study on the effect of screening via a model confidence set of Hansen
et al. [
30], which shows that removing poor candidates indeed improves the final performance of the combined forecast. In the future, it will be useful to investigate how the
t- and
g-AFTER methods behave when a screening step is applied before combining.