Forecast Combination Under Heavy-Tailed Errors

Forecast combination has been proven to be a very important technique to obtain accurate predictions. In many applications, forecast errors exhibit heavy tail behaviors for various reasons. Unfortunately, to our knowledge, little has been done to deal with forecast combination for such situations. The familiar forecast combination methods such as simple average, least squares regression, or those based on variance-covariance of the forecasts, may perform very poorly. In this paper, we propose two nonparametric forecast combination methods to address the problem. One is specially proposed for the situations that the forecast errors are strongly believed to have heavy tails that can be modeled by a scaled Student's t-distribution; the other is designed for relatively more general situations when there is a lack of strong or consistent evidence on the tail behaviors of the forecast errors due to shortage of data and/or evolving data generating process. Adaptive risk bounds of both methods are developed. Simulations and a real example show superior performance of the new methods.


Introduction
When multiple forecasts are available for a target variable, well designed forecast combination methods can often outperform the best individual forecaster, as demonstrated in the literature of applications of forecast combinations in fields such as tourism, wind power generation, finance and economics in the last fifty years.
Many combination methods have been proposed from different perspectives since the seminal work of forecast combination by Bates & Granger (1969). See the discussions and summaries in Clemen (1989), Newbold & Harvey (2002) and Timmermann (2006) for key developments and many references. More recently, Lahiri et. al (2013) provided theoretical and numerical comparisons between adaptive and simple forecast combination methods.
However, to our knowledge, few studies have proposed/discussed forecast combination methods that target at cases where the forecast errors exhibit heavy tail behaviors. In this paper, heavy tailed distributions may sometimes loosely refer to distributions with tails heavier than Gaussian distributions, although specific choices such as t-distributions will be studied.
In many such situations, the familiar forecast combination methods such as simple average, Therefore, it is practically very useful to design forecast combination methods to handle heavy tailed situations.
In this paper, we propose two forecast combination methods following the spirit of the AFTER strategy by Yang (2004). One is specially designed for situations when there is strong evidence that the forecast errors are heavy-tailed and can be modeled by a scaled Student's t-distribution. The other one is designed for more general uses. For the former case, we assume that the forecast errors follow a scaled Student's t-distribution with possibly unknown scaled parameter and degrees of freedom. For situations when the identification of the heaviness of tails of the forecast errors is not feasible, normal, double-exponential and scaled Student's t-distributions are considered at the same time as candidates for the distribution form of the forecast errors. In either case, no parametric assumptions are needed on the relationships of the candidate forecasts.
Technically, if the forecast errors are assumed to follow a normal or a double-exponential distribution with zero mean, then the conditional probability density functions used in the combining process of the AFTER scheme can be estimated relatively easily for all the candidate forecasters because the estimation of the conditional scale parameters is straightforward.
See, e.g., Zou & Yang (2004) and Wei & Yang (2012), for more details. However, this is not thue if a scaled Student's t-distribution is assumed. Among the literature discussing the maximum likelihood parameter estimation in Student's t-regressions in the last few decades, Fernandez & Steel (1999) and Fonseca et. al (2008) provided comprehensive summaries of the convergence properties of the parameter estimations in different situations. Both of them showed that the estimation of the degrees of freedom and the scale parameter simultaneously in a scaled Student's t-regression models suffers from monotonic likelihood because the likelihood goes to infinity as the scale parameter goes to zero if the degrees of freedom ν is not large enough. To deal with this difficulty, methods other than maximum likelihood estimation have been proposed in the literature. For example, one may fix the degrees of freedom first then estimate the scale parameter using method of moments or other tools (see, e.g., Kan & Zhou, 2003).
In this paper, we follow a two-step procedure to estimate the density function given a forecast error sequence. First, estimate the scale parameter for each element in a given candidate pool of degrees of freedom. Note that each combination of the degrees of freedom and the scale parameter leads to a different estimate of the density function. Second, the weight of a density estimate is assigned from its relative historical performance. The final density estimate is a weighted mean of all the candidate density estimates. More details about this procedure, including how to determine the pool of candidate estimates, are available in section 2. There are three major advantages of this procedure: first, because a pool of degrees of freedom (rather than a single candidate) is considered, it reduces the potential risk of picking a degrees of freedom parameter that is far from the truth. Second, the likelihood that each candidate density estimate is the best is purely decided by data. Third, the calculation of the combined estimator is easy and fast.
It is worth pointing out that some popular combination methods in the literature make assumptions on the distributions of forecast errors that do not necessarily exclude heavy tailed behaviors. For example, methods that are based on the estimation of variance-covariance of forecasters require the existence of variances. Regression based forecast combination methods (see, e.g., Granger & Ramanathan, 1984) assume the existence of certain moments of the forecast errors. However, to our knowledge, these methods are not really designed to handle heavy-tailed errors and are not expected to work well for such situations.
Prior to our work, efforts have been made to deal with error distributions that have tails heavier than normal by adaptive forecast combination methods. For example, Sancetta (2010) assumed that the tails of the target variables are no heavier than exponential decays, which restrict the heaviness of the tails of the forecast errors. Wei & Yang (2012) designed a method for errors heavier than the normal distributions but not heavier than the doubleexponential distributions. However, none of these methods can deal with forecast errors with tails as heavy as that of Student's t-distributions. The new AFTER methods in this paper will be shown to handle such situations.
The plan of the paper is as follows: section 2 introduces the forecast combination method designed for heavy-tailed error distributions; in section 3, a more general combination method is proposed. Simulations are presented in section 4, and section 5 provides a real data example. Section 6 includes a brief concluding discussion. The proofs of the theoretical results are in the appendix.

t-AFTER
In this section, we propose a forecast combination method when there is strong evidence that the random errors in the data-generating process are heavy-tailed and can be modeled by a scaled Student's t-distribution.

Problem Setting
Suppose at each time period i ≥ 1, there are J forecasters available for predicting y i and the forecast combination starts at i 0 ≥ 1. Note that some combination methods may require i 0 to be large enough, e.g., 10, to give reasonably accurate combinations. Letŷ i,j be the forecast of y i from the j-th forecaster. LetŶ i := (ŷ i,1 , · · · ,ŷ i,J ) be the vector of candidate forecasts for y i made at time point i − 1.
Suppose y i := m i + ǫ i , where m i is the conditional mean of y i given all available information prior to observing y i and ǫ i is the random error at time i. Assume ǫ i is from a distribution with probability density function (pdf ) 1 where s i is the scale parameter that depends on the data before observing y i and h(·) is a pdf with mean 0 and scale parameter 1.
w J ) be the initial weight vector. The combined forecast for y i from a combination method is: where a, b stands for the inner-product of vectors a and b. Specifically, when needed, we use a superscript δ on each W i to denote the combination weights that correspond to the method δ. For example, in the following sections, W A 2 i and W A 1 i stand for the combination weights from the L 2 -and L 1 -AFTER methods, respectively.

The Existing AFTER Methods
As one recent method of adaptive forecast combination, the general scheme of adaptive forecast combination via exponential re-weighting (AFTER) was proposed by Yang (2004 In the general AFTER formulation, the relative cumulative predictive accuracies of the forecasters are used to decide their combining weights. Let ||x|| 1 := n i=1 |x i | be the l 1 -norm of vector x = (x 1 , · · · , x n ).
The general form of W i for the AFTER approach is: where l i−1 = (l i−1,1 , · · · , l i−1,J ) and for any 1 ≤ j ≤ J, whereŝ i ′ ,j is an estimate of s i ′ from the j-th forecaster at time point i ′ − 1.
Below, the most commonly used AFTER procedures, the L 2 -AFTER from Zou & Yang (2004) and the L 1 -AFTER from Wei & Yang (2012), are briefly introduced.
L 2 -AFTER When the random errors in the data generating process follow a normal distribution or a distribution close to a normal distribution, the L 2 -AFTER is both theoretically and empirically competitive in providing combined forecasts that perform at least as well as any individual forecaster in any performance evaluation period plus a small penalty. Let f N be the pdf of N(0, 1). To get W A 2 i , first use f N as the h in (3), then plug the new l i−1 into (2). Theŝ i,j used in the L 2 -AFTER, denoted asσ i,j , is the sample standard deviation of assuming the random errors are independent and identically distributed. L 1 -AFTER Let f DE be the pdf of a double-exponential distribution with scale parameter 1 and location parameter 0. To get W A 1 i , one can follow the same procedure for W A 2 i but use f DE as the h in (3). Theŝ i,j used in the L 1 -AFTER, denoted asd i,j , is the mean of The L 1 -AFTER method was designed for robust combination when the random errors have occasional outliers. See Wei & Yang (2012) for details.

The t-AFTER Methods
Since the estimation of the degrees of freedom and the scale parameter simultaneously in a scaled Student's t-regression setting suffers from certain theoretical difficulties as mentioned in the introduction, we use a different strategy in this paper. Specifically, we take an estimation procedure that has two steps: 1. We decide a pool of candidate degrees of freedom with size K. The elements in the pool are considered to be close to the degrees of freedom of the Students' t-distribution that describes the random errors well. For each element in the set, we assume it is the true degrees of freedom to estimate the related scale parameter. So we have K sets of estimate for the degrees of freedom and scale parameter pair.
2. For each of the K sets of estimate, we find its probability to be the true one based on the relative historical performances.
This two-step procedure is used in the t-AFTER method for forecast combination when the random errors have heavy tails that can be described well by a Students' t-distribution.
Let Ω := (ν 1 , · · · , ν K ) be a set of degrees of freedom for Student's t-distributions. The choice of Ω will be discussed later in this subsection. Let w j,k (w j,k ≥ 0 and K k=1 J j=1 w j,k = 1) be the initial combination weight of the forecaster j under the degrees of freedom ν k .
Let the combining weight ofŶ i from a t-AFTER method be W At 1. Estimate (e.g., by MLE) s i for each ν k ∈ Ω and for each candidate forecaster. The estimate for s i from the j-th forecaster given ν k is denoted asŝ i,j,k .
2. Calculate W At i andŷ At i : where l At i−1 = (l At i−1,1 , · · · , l At i−1,J ) and for 1 ≤ j ≤ J and any i ≥ i 0 + 1, where f t (·|ν) is the pdf of a Student's t-distribution with degrees of freedom ν.
It is assumed that the elements in Ω are natural numbers for the sake of convenience.
In general, when no specific information is available to estimate the size of candidate degrees of freedom efficiently, one can start with a large but relatively sparse pool (say, Obviously, when the random errors in the true model follow a scaled Student's t-distribution with a known degrees of freedom ν, then Ω := {ν}. Then (5) can be simplified into: where w j is the initial weight of the j-th forecaster andŝ i,j is an estimate of s i from the j-th forecaster using all information at and before time point i − 1 when the true ν is known.

Risk Bounds of the t-AFTER
To avoid potential redundancy, we first give a risk bound on the t-AFTER assuming ν is known. A more general theorem that treats ν (and even the form of error distribution) as unknown will be given in section 3.

Conditions
Condition 1. There exists a constant τ > 0 such that for any i ≥ i 0 , Condition 2. These exists a constant ξ 1 > 0 such that for any i ≥ i 0 and 1 ≤ j ≤ J: Condition 2 ′ . These exists a constant 0 < ξ ′ 1 < 1 such that for any i ≥ i 0 and 1 ≤ j ≤ J: Condition 1 holds when the forecast errors are bounded, which is true in many real applications, although it excludes some time series models such as AR(1). It is required for the development of the theorems in this paper. As you can see that this condition does not require y i to be bounded so that it allows large outliers to occur in the random errors.
When the conditional mean of y i is known to stay in certain range and the related forecasts Let ν be the actual conditional error density function at time point i Further, if ν is strictly larger than 2 and Conditions 1 and 2 ′ hold, then In the above, C, B 1 , B 2 and B 3 are constants. B 1 and B 3 depend on ξ 1 and ξ ′ 1 , respectively. B 2 is a function of ν and C depends on τ and ξ ′ 1 .

Remarks.
1. When only Condition 2 is satisfied, Theorem 1 shows that the cumulative distance between the true densities and their estimators from the t-AFTER is upper bounded by the cumulative (standardized) forecast errors of the best candidate forecaster plus a penalty that has two parts: squared relative estimation errors of the scale parameters and logarithm of the initial weights. This risk bound is obtained without assuming the existence of variances of the random errors andŝ i,j /s i is only required to be lowerbounded.
2. When ν is assumed to be strictly larger than 2 and both Conditions 1 and 2 ′ are satisfied, Theorem 1 shows that the cumulative forecast errors have the same convergence rate of the cumulative forecast errors of the best candidate forecaster plus a penalty that depends on the initial weights and efficiency of scale parameters estimation. The risk bounds hold even if the the distribution of random errors have tails as heavy as t 3 .
3. If there is no prior information to decide the w j 's in (6), then equal initial weights could be applied. That is, w j = 1/J for all j. In this case, it is easy to see that the number of candidate forecasters plays a role in the penalty. When the candidate pool is large, some preliminary analysis should be done to eliminate the significantly less competitive ones before applying the t-AFTER.

g-AFTER
In section 2, the theoretical risk bounds of the combined forecasts from the t-AFTER are provided when the random errors are known to have Student's t-distributions. However, the error distribution is typically unknown.
In this section, we propose a forecast combination method, g-AFTER, for situations when there is a lack of strong or consistent evidence on the tail behaviors of the forecast errors due to shortage of data and/or evolving data-generating process. A theorem that allows the random errors to be from one of the three popular distribution families (normal, double-exponential, and scaled Student's t) is provided to characterize the performance of the g-AFTER.

The g-AFTER Method
Let the combining weight ofŶ i from the g-AFTER be W where l where l A 2 i−1,j , l A 1 i−1,j and l At i−1,j are from the L 2 -, L 1 -and t-AFTERs, respectively and c 1 and c 2 are non-negative constants that control the relative importances of the L 2 -, L 1 -and t-AFTERs in the g-AFTER. For instance, c 1 and c 2 can be small when one has evidence that suggests the random errors are likely to be normally distributed.

Conditions
Condition 3. Suppose the random errors have zero mean and are from one of the three families (normal, double exponential, and scaled Student's t), and there exists a constant 0 < ξ 2 ≤ 1 such that for any i ≥ i 0 , with probability 1, we have where s i the actual conditional scale parameter at time point i andŝ i refers to any estimate of s i used in the g-AFTER.
This condition requires all the estimates of the scale parameters stay in a reasonable range around the true values. For the j-th candidate forecaster,ŝ i isσ i,j when associated with normal errors, isd i,j when associated with the double exponential, and isŝ i,j,k when associated with the scaled Student's t with degrees of freedom ν k , whereσ i,j ,d i,j ,ŝ i,j,k and ν k are defined in section 2.2 and 2.3.

Condition 4.
When the random errors in the true model follow a scaled Student's tdistribution with degrees of freedom ν, assume there exist positive constants ν, λ andν such that,

Risk Bounds for the g-AFTER
Let w A 2 j and w A 1 j be the initial combination weights of the forecaster j in the L 2 -and L 1 -AFTERs respectively and w At j,k be the initial combination weight of the j-th forecaster under the degrees of freedom ν k in the t-AFTER.
k are the weights of the density estimates under normal, double-exponential and scaled Student's t with degrees of freedom ν k in the g-AFTER procedure at time point i − 1 from the j-th forecast, respectively. Let where c 1 and c 2 are defined in (8). Let q i be the pdf of ǫ i at time point i and its estimator from a g-AFTER procedure be:

Theorem 2. If Conditions 3 and 4 hold, then forŷ
Ag i from a g-AFTER procedure, we have: If Condition 1 also holds, then In the above, C, B 1 , B 2 and B 3 are constants depending on τ , ξ 2 and parameters in Condition 4.

Remarks.
1. Theorem 2 provides a risk bound for more general situations compared to Theorem 1. That is, as long as the the true random errors are from one of the three popular families, similar risk bounds hold.
2. When strong evidence is shown that the errors are highly heavy-tailed, Ω can be very small with only small degrees of freedom and the c 2 w At j,k in G can be relatively large (relative to w A 2 j and c 1 w A 1 j ). The more information on the tails of the error distributions is available, the more efficient the allocation of the initial weights can be.
3. Specially, when the true random errors have tails significantly heavier than normal and double-exponential, they could be assumed to be from a scaled Student's t-distribution with unknown ν and a (general) t-AFTER procedure is more reasonable. In this case, s i,j,k ν k andŵ At i,j,k ≥ 0 for all j and k. Without assuming Condition 1 is satisfied, it follows for any n ≥ 1: where w At j,k is defined the same as that in section 2.3 and If Condition 1 is also satisfied, then it follows: where C, B 1 , B 2 and B 3 are the same as in Theorem 2.

Simulations
We consider two simulation scenarios, with candidate forecasters from linear regression models and autoregressive (AR) models. Results from the linear regression models show improvements of the t-and g-AFTERs over the L 1 -and L 2 -AFTERs when the random errors have heavy tails. In the AR settings, the t-and g-AFTERs are compared to many other popular combination methods in various situations, including cases that the forecast errors are with extremely symmetric/asymmetric heavy tails. We also compared the performances of the tand g-AFTERs to other combination methods on the linear regression models and similar results are found. Only representative results are given here.
In this and the following sections, we have the following settings: • Since it is usually the case that g-AFTER is preferred when the users have no consistent and strong evidences to identify the distribution of the error terms from the three candidate distribution families, we put equal initial weights to the candidate distributions. So are used in the g-AFTER. Note that, for example, if there is clear and consistent evidence that the error distribution is more likely to be from the normal distribution family, then putting relatively large initial weights on the L 2 -AFTER procedure in a g-AFTER can be more appropriate than using equal weights.
• Theŝ i,j,k 's are the sample median of the absolute forecast errors before time point i from the forecaster j divided by the theoretical median of the absolute value of a random variable with distribution t ν k .

Simulation Settings
There are p predictors (X 1 , · · · , X p ) available and the true model uses the first p 0 predictors with coefficients β = (β 1 , · · · , β p 0 ). That is, Y = p 0 i=1 X i β i + ǫ. The p candidate forecasters are generated from the following p models: We take p = 2p 0 − 1 for this scenario. Other settings for p and p 0 were also considered and they gave similar results.
The p predictors are generated from a multivariate normal distribution with zero mean and covariance matrix Σ with sample size n = 125. For the entries in Σ, the diagonal elements are 1 and off-diagonal elements are 0.8. The forecasters are generated after the 90-th observation, and the combination is generated after the 5th forecasts. Various distributions for the random errors (ǫ) are considered. Note that, we also tried other structures of Σ, including the ones with Σ i,j = 0.5 |i−j| and Σ i,j = I(i = j) ∀1 ≤ i, j ≤ p. The results are similar.
For each set of β, we generate 200 sets of (X 1 , · · · , X p , Y ) and on each of the 200 sets,

Results
Three sets of results (p 0 = 3, 5, 10 respectively) are presented in Table 1 in this subsection.
In this table, A2, At and Ag stand for the ratios of the mean ASEEs of the L 2 -, t-and g-

Summary
From Table 1, in the linear regression setting, we see that the overall performances of the tand g-AFTERs are relatively more robust than that of the L 1 -and L 2 -AFTERs. Specifically: 1. When the random errors have heavy tails, the t-and g-AFTERs provide more accurate forecasts than the L 2 -and L 1 -AFTERs consistently.
2. When the tails of the random errors distributions are not or only mildly heavy, say a normal or a scaled Student's t-distribution with a large degrees of freedom, the g-AFTER is better than the t-AFTER in terms of forecast accuracy.
3. The L 1 -AFTER outperforms the L 2 -AFTER when the random errors have heavy tails while L 2 -AFTER is more accurate than the L 1 -AFTER when the random errors are not heavy-tailed.

Simulation Settings
Let the true model be a AR(p 0 ) process with random errors from certain distributions and the candidate forecasters be based on AR(1), AR(2), · · · , AR(p) (1 ≤ p 0 ≤ p), respectively.
For results on asymptotically optimal model selection for AR models, see, e.g., Ing (2007) and Ing et. al (2012). We here compare forecast combination methods.

Other Combination Methods
Some other popular combination methods are included in this part and compared with the newly proposed methods. Simple average combination strategy (SA) uses the average of the candidate forecasts as the combined forecasts. The MD and T M strategies use the median and the trimmed mean (remove the largest and smallest before averaging) of candidate forecasts, respectively. The variance-covariance estimation based combination method (denoted as BG because it was first proposed by Bates & Granger (1969)) we use in this paper is the version in Hansen (2008). Also, a modified BG method with a discount factor 0 < ρ < 1 is considered and the results of multiple ρ's are presented. In the modified BG, the estimate of the (conditional) variance of the forecast errors of a forecaster at any time point is the associated discounted mean squared forecast error with factor ρ. See, e.g, Stock & Watson (2006), for more details. Hereafter, for example, BG 0.9 denotes a BG method with ρ = 0.9. Two linear-regression based combination methods are also considered: one is the combination via ordinary linear regression (LR) and the other one is a constrained linear regression (CLR) combination. The constraints of the CLR are: all coefficients are non-negative and the sum of the coefficients is 1 (without intercept in the regressions).

Results
Tables 2 and 3 provide the summaries of the simulation results. In these two tables, A2, At, Ag, SA, MD, T M, BG, LR and CLR stand for the relative performances of these methods over that of the L 1 -AFTER. The other entries are defined as in Table 1. Table 2 presents the results for the cases that the random errors are not (or only mildly) heavy-tailed, while Table 3 contains the results when the random errors have significant heavy tails.

Summary
In the autoregression scenario, we see that the t-and g-AFTERs consistently outperform all other non-AFTER based combination methods in all the simulated situations (heavy tailed or not) and outperform the L 1 -and L 2 -AFTERs when the random errors are not normal.
Below are some important details: 1. In between the t-and g-AFTER, the latter is more robust since its performances under all scenarios are the best or close to the best. For the t-AFTER, its advantages over the  Tables 2 and 3, the CLR is the most competitive method outside the AFTER family. It is because the constraints in the CLR make its weights relatively more stable and resistant to dramatic changes. The CLR gets more competitive when the random errors have heavier tails.

In both
3. The SA and T M are vulnerable to outliers, which hurts their overall performances. We can see this from both tables. 4. In our settings, similar to many real application situations, since some of the candidate forecasters are highly correlated, using only the conditional variances to assign relative combining weights may not be enough. This explains why the BG and the discounted BG's are not quite competitive as seen in Tables 2 and 3.

Real Data Example
The M3-competition data contain 3003 financial/economical variables in which 1428 (N1402-N2829) have 18 forecasts and the rest have only 6 or 8 forecasts. For each of the 3003 variables, notice that the forecasts are generated all at once (1-, 2-,· · · and up to 6, 8 or 18-step ahead) by each forecaster. There were 24 candidate forecasters for each of the variables. We use the 1428 variables with 18 forecasts to conduct the simulation study because some combination methods need a few forecasts to train the parameters before achieving a reasonable level of reliability.

Data and Settings
Letŷ i ′ be the forecast of y i ′ for n 0 ≤ i ′ ≤ n 1 , then the mean squared forecast error (MSFE) is We use the mean squared forecast errors to measure the prediction performances of the combination methods on each of the 1428 variables. For each variable, the MSFE of each of the other combination methods over the MSFE of the SA is reported.
Specifically, using the same notations as those in section 4.2, the averaged relative performances (MSFE) of the MD, T M, BG, discounted BG's, A2, A1, At and Ag over the SA over the 1428 variables are presented. The main reason that we use the SA as the benchmark on this real data set is that the SA is one of the most popular combination methods with a great reputation in a broad range of applications. Since there are too many candidate forecasters compared to the forecast periods available, the two linear regression related combination methods discussed in section 4.2 are not considered here.
For each of the variables with 18 forecast periods, the combination starts after the 6-th forecasts and the MSFE of the last 9 forecasts of each method is recorded for performance comparisons. For each variable, the MSFE ratio of each method over that of the SA is reported. The summaries, mean (and its standard error), median, minimum, the 1st, 3rd quartiles (denoted as Q 1 and Q 3 , respectively) and maximum, of the 1428 ratios of each method are reported in Table 4.
Also, the comparison on a subset of M3-competition data is provided. On this subset, the variables are considered to have high potentials to be heavy tailed. For each of the 1428 variables with 18 forecast periods, there are some training data (about 70-128 months). We modeled the training data to find the ones with high potential to have heavy tailed errors.
Specifically, let y t be the observed value of a variable at time t and we fit each variable with a model as: y t = β 0 + 1 j=1 1β j I(m t = j) + β 12 y t−1 + · · · + β 16 y t−5 using AIC in backward selection and the ones with kurtosis larger than 3 are considered to have heavy tails. There are 199 out of 1428 variables are selected.
On the heavy tailed subset, we want to focus on the comparison between the g-AFTER and the non-AFTER methods because the comparison inside AFTER family is well addressed in simulation settings. The reason we choose the g-AFTER instead of the t-AFTER for further comparison is because g-AFTER is practically more efficient since it performs well even the signal of heavy tails is not extremely strong. So, on this subset, the benchmark method is the g-AFTER and the results are reported in 5. Table 4, the overall performances of the AFTER based methods are better than the other popular combination methods considered. It also shows that the AFTERs can occasionally be significantly worse than the SA and other methods. Table 4, it is worth noticing that the performances of the AFTERs can be a thousand times better while only about 10 times worse than that of SA. An examination reveals that for certain variables, such as N1837 and N2217, some candidate forecasters are consistently and significantly worse than others. In this situation, since the SA can not remove the extreme 'disturbing' ones before averaging, its performance is extremely poor. However, the AFTERs essentially ignore the 'unreasonable' candidate forecasts so they can be significantly better than the SA. Table 4 suggests that the t-and g-AFTERs have competitive performances in general while being more robust than others since their overall performances are outstanding and are still acceptable for the worst cases. Table 5, the g-AFTER is significantly better than the non-AFTER methods when the random errors are suspected to have heavy tails. So the robustness of g-AFTER is supported by the M3-Competition data.

Conclusions
Forecast combination is an important tool to achieve better forecasting accuracy when multiple candidate forecasters are available. Although many popular forecast combination methods do not necessarily exclude heavy tailed situations, little is found in the literature that examines the performances of forecast combination methods in such situations with theoretical characterizations.
In this paper, we propose combination methods designed for cases when forecast errors exhibit heavy tail behaviors that can be modeled by a scaled Student's t-distribution and for the cases when the heaviness of the forecast errors is not easy to identify. The t-AFTER models the heavy-tailed random errors with scaled Student's t-distributions with unknown (or known) degrees of freedom and scale parameters. A candidate pool of degrees of freedom are proposed to solve the estimation problem and the resulting t-AFTER works well as seen in simulation and real example analysis.
However, in many cases the heaviness of the tails of the random errors is difficult to identify. Therefore, we design a combination process for general use and call it g-AFTER.
For these situations, instead of assuming a certain distribution form for the random errors, a set of possible heaviness of the tails are considered and the combination process automatically decides which ones are more reasonable by giving them high weights. The numerical results suggest the performance of the g-AFTER is more robust than other popular combination methods because of its adaptive capability. The design of the g-AFTER provides a general idea: when there are multiple reasonable candidate distributions for the random errors, combining them in an AFTER scheme like the g-AFTER for forecast combination should work well.

Acknowledgement
This work is partially supported by National Science Foundation grant DMS-1106576.

A.2
Lemma 1 Let h ν (x) be the density function of t ν , ν > 0 and λ > 0 be constants. Then for where C 1 , C 2 and C 3 are constants depending on s 0 , ν,ν and λ.
Proof: After a proper reorganization, we have • Let ν * = min(ν, ν ′ ) and using the Facts 1, 2 and 3, then: • Using Fact 2 in A.1, it follows: • It is easy to show that: where C * 3 is a constant depending on s 0 , ν,ν and λ.
The proof can be completed by combining these steps.
Note that if ν is known, then ν = ν ′ . Then, Lemma 2 Let h(x) be the density function of a double-exponential distribution with µ = 0 and d = 1, then for s 0 > 0 and s ≥ s 0 it follows: where C 4 and C 5 are constants depending only on s 0 .

Lemma 3
Let h(y) be the density function of a standard normal distribution, then for s 0 > 0 and s ≥ s 0 it follows: where C 6 and C 7 are constants depending only on s 0 .

A.3
In this subsection, we prove Theorem 1.
Conditional on the information available until time point i, it is assumed that where s i is the conditional scale parameter at time i. Letŝ i,j be the estimator of s i from the j-th forecaster.
, where h(·) is the density function of t ν and π j is the initial combining weight of the j-th forecaster. So, q n is the estimator of f n .
Then, for any 1 ≤ j ′ ≤ J, Conditional on all the information before time point i, where B 2 is a function of ν and B 3 is deducted the same as B 1 but under Condition 2 ′ instead of Condition 2.

A.4
Essential part of the proof of Theorem 2 is provided in this subsection. We only provide the steps of the proof when the random errors are scaled Student's t-distributed since proof of other situations are similar.