1. Introduction
There is a consensus that any attempt to justify the comparative superiority of forecasts from a given model is both incomplete and inadmissible if no consideration has been given to the statistical significance associated with the comparison. Tests on forecast evaluation and comparison have a long and detailed history which can be found in Chapter 3 of [
1]. Few historically popular examples of such statistical tests are discussed in [
2,
3,
4,
5]. Of these, the Diebold-Mariano (DM) test [
5] is one which is highly cited, and its popularity is evident via statements such as that in ([
6] p. 8) according to which, “for comparing forecasts, DM is the only game in town.”
Whilst there is indeed no question regarding the popularity of the DM test, it is pertinent to note that the DM test is by no means a panacea. At present there exists other improved variants for evaluating the statistical significance between forecasts. Two sound examples would be Hansen’s [
7] Superior Predictive Ability (SPA) test, and Hansen
et al.’s [
8] Model Confidence Set (MCS) which are superior to the DM test. In addition, recently there has been a renewed interest in research focussing on testing the predictive accuracy of forecasts through the work in [
9,
10,
11,
12]. Clark and McCracken [
10] in particular shows that the DM test is inferior or inappropriate for use alongside nested forecasting models.
The aim of this paper is to introduce a complement statistical test (which differs from the tests noted above) for comparing between the predictive accuracy of forecasts whilst overcoming the constraints of the DM test which are identified below. Interestingly, regardless of the existence of more superior tests, the DM test continues to be cited in forecasting literature both in isolation and at times along side SPA and MCS tests, see for example [
13,
14,
15]. In this paper, the DM test is used as a benchmark with the reasons being justified in what follows.
The DM test can be briefly introduced as an asymptotic
z-test for the hypothesis that the loss differential is zero [
6]
1. Whilst it is not the intention of this paper to ridicule any proven test currently adopted for comparing the accuracy of forecasts, we believe the need for a complement statistical test arises owing to the following reasons which relate to both theoretical and empirical issues with the DM test. Firstly, the original DM test was limited by finite sample properties [
5]. Secondly, as a parametric test, the DM test requires that the loss differential has a stationary covariance [
6]. The failure to meet this assumption invalidates the results and imposes a restriction on the applicability of this test. These issues were later addressed in [
2] when a solution was achieved via the inclusion of a new assumption whereby all autocovariances of the mean loss differential beyond some lag length are assumed to be 0. However, according to the recent findings in [
18,
19] it has been proven that when the lag of a sample autocorrelation function (ACF) exceeds 1, the sum of the ACF is always equal to -
. In fact, according to [
2] the modified DM statistic continues to be multiplied by the original DM statistic
, where
and
is the
kth autocovariance of
. Then, as per recent findings [
18,
19] it implies that the sum of the autocovariance,
which in turn ensures that the expectation of
V̂(d̄)=0, and therefore the modified DM test statistic tends to infinity. Thus, if two models are used to forecast
n data points without repeating or updating the data, then the modified DM test cannot be applied as the sum of the covariance will be zero. Thirdly, the modified DM test statistic for improved small sample properties is dependent on the Student’s
t distribution [
2] which cannot be justified unless the forecast errors are independent and normally distributed. In addition, even though [
2] asserts that the modified DM test can provide efficient results when faced with small sample properties, in practice there can be instances when this assertion fails to hold. For example, in some instances where the Ratio of the Root Mean Squared Error
2 (RRMSE) criterion shows that the forecasts from a particular model are for example 60% more accurate than the forecasts from another model (with a large sample size), the DM test fails to show a statistically significant difference between such forecasts. Moreover, when faced with comparing for example a small sample of
steps ahead forecasts there is a tendency for the modified DM test to always report a significant difference between forecasts even when the RRMSE criterion is at around 99%. Finally, according to the simulation results reported in [
2] the modified DM test is not accurately sized for both small and large samples beyond the one-step ahead forecasting horizon.
The proposed test is founded upon the principles of the Kolmogorov-Smirnov (KS) test [
20] and is non-parametric in nature. The choice of a non-parametric test is important as in the real world we are mostly faced with data which fails to meet the assumptions of normality and stationarity underlying parametric tests. The proposed test (referred to as the Kolmogorov-Smirnov Predictive Accuracy or KSPA test) was motivated by the work in [
21,
22], where cumulative distribution functions (c.d.f.s) relating to the absolute value of forecast errors are exploited to determine if one forecasting technique provides superior forecasts in comparison to another technique. The approach presented in the aforementioned papers are in fact based on the concept of stochastic dominance. However, the evidence presented relies purely on graphical representations and lacks a formal statistical test for significance which in turn leaves the final result open for debate.
The beauty of the proposed KSPA test is that it not only enables distinguishing between the distribution of forecasts from two models, but also enables to determine whether the model with the lowest error also reports the lowest stochastic error in comparison to the alternate model. Moreover, this test is not affected by the potential autocorrelation that may be present in forecast errors which is yet another advantage. The ability of exploiting the KSPA test for determining the model with the lowest stochastic error stems from the work of literature on stochastic dominance and as such deserves to be noted. Whilst the consideration of stochastic dominance in forecasting literature is novel, as noted in [
23] stochastic dominance is widely used in econometric and actuarial literature and is therefore a well established and recognized concept. The use of KS tests for first and second order stochastic dominance dates back to the work in [
24] where the author considered KS tests with independent samples with equal number of observations. Moreover, as the KS test compares each point of the c.d.f. [
24,
25] it has the potential of being a consistent test which considers all of the restrictions imposed by stochastic dominance [
25].
The nature of the proposed KSPA test is such that it evaluates the differences in the distribution of forecasting errors as opposed to relying on the mean difference in errors as is done in the DM approach. This in itself enables the KSPA test to benefit from several advantages. Firstly, relying on the distribution of errors enables the KS test to have more power than the DM test. This is because the KSPA test essentially considers an infinite number of moments whilst the DM test only tests the first moment which is popularly referred to as the mean. Secondly, the presence of outliers can severely impact the DM test as the mean is highly sensitive to outliers in data whereas the cumulative distribution function for errors are less affected. Thirdly, a test statistic which is concentrated around a mean fails to account for the variation around the data. For example, it is possible to have two populations with identical means and yet these two populations would not really be identical if the variation around the mean is not the same. By considering the distribution of the data as is done via the proposed KSPA test, we are able to study and obtain a richer understanding of the underlying characteristics which in turn enables a more efficient and accurate decision.
The remainder of this paper is organized as follows.
Section 2 presents the theoretical foundation underlying the proposed statistical test for comparing between forecasting models.
Section 3 is dedicated to the results from the simulation study which compares the size and power properties of both the KSPA and modified DM tests for different sample sizes and forecasting horizons.
Section 4 presents empirical evidence from applications to real data where the performance of the KSPA test is compared alongside the modified DM test, and the paper concludes in
Section 5.
2. Theoretical Foundation
In this section we begin by briefly introducing the theory underlying the Kolmogorov-Smirnov test which is followed by the introduction of the hypothesis for the two-sided and one-sided KS tests which are of interest to this study. Thereafter, the KSPA test is presented for distinguishing between the distribution of forecasts errors and identifying the model with the lower stochastic error. The first part of the KSPA test, which is the two-sample two-sided KSPA test, aims at identifying a statistically significant difference between the distribution of two forecast errors (and thereby comparing the predictive accuracy of forecasts). The second part, which is the two-sample one-sided KSPA test aims at ascertaining whether the forecast with the lowest error according to some loss function also has a stochastically smaller error in comparison to the competing forecast (and thereby enables the comparison of the predictive accuracy of forecasts).
2.1. The Kolmogorov-Smirnov (KS) Test
The cumulative distribution function (c.d.f.) is an integral component of the KS test. As such, let us begin by defining the c.d.f.,
for a random variable
X. The c.d.f of
X is denoted as:
where
x includes a set of possible values for the random variable
X. In brief, the c.d.f. shows the probability of
X taking on a value less than or equal to
x. The next step is to obtain the empirical c.d.f. This is because the one sample KS test (which is introduced below) aims at comparing the theoretical c.d.f. with an empirical c.d.f., whereby the latter is an approximation for the former. The empirical c.d.f. can be defined as:
where
n is the number of observations, and
I is an indicator function such that
I equals 1 if
and 0 otherwise. According to [
26], as implied by the law of large numbers, for any fixed point
x ϵ ℝ, the proportion of the sample contained in the set
approximates the probability of this set as:
where 𝔼 represents the expectation.
Then, the one sample Kolmogorov–Smirnov statistic for any given
can be calculated as
where
denotes the maximum of the set of distances. Note that the one sample KS test in Equation (4) compares the empirical c.d.f. with a theoretical c.d.f. However, presented next is the two sample KS test statistic which is of direct relevance to the proposed KSPA test. In contrast to the one sample KS test, the two sample KS test compares the empirical c.d.f.’s of two random variables in order to find out whether both random variables share an identical distribution, or whether they come from different distributions. Assuming two random variables
X and
Y, the two sample KS test statistic will be
Next, we introduce the hypothesis which are relevant for the proposed KSPA test. Let us begin by presenting the hypothesis for the two-sided KS test. Let
X and
Y be two random variables with c.d.f.’s
and
, respectively. Then, a two sample, two-sided KS test will test the hypothesis that both c.d.f.’s have an identical distribution, and the resulting null and alternate hypothesis can be expressed as:
In simple terms, the null hypothesis in Equation (6) states that both X and Y share an identical distribution whilst the alternate hypothesis states that X and Y do not share the same distribution.
Finally, the hypothesis for the two sample one-sided KS test which is also known as the one-sided test of stochastic dominance is presented as in [
24]:
The important point to note here is that the alternate hypothesis in Equation (9) states that the c.d.f. of X lies above and to the left of the c.d.f. of Y, which in turn means that X has a lower stochastic error than Y. Note that in our case we consider X and Y in absolute or squared terms for example.
As with all tests, the decision making process requires the calculation of the probability value. For the KS test, there are various formulas for calculating the
p-value, each with its own advantages and limitations. See for example, [
27,
28,
29]. Here we rely on the formulae used in [
29] to calculate the
p-values for both two-sided and one-sided KS tests. In what follows, we introduce the two-sided and one-sided KSPA tests which are based on the foundations of the KS test which has been concisely explained above.
2.2. Testing for Statistically Significant Differences between the Distribution of Two Sets of Forecast Errors
The aim here is to exploit the two sample two-sided KS test (which is referred to as the two-sided KSPA test hereafter) to ascertain the existence of a statistically significant difference between the distributions of two forecast errors. Let us begin by defining forecast errors. Suppose we have a real valued, non zero time series
of sufficient length
N.
is divided into two parts,
i.e., training set and test set such that
represents the training set and
represents the test set. The observations in
are used to model the data whilst the observations in
are set aside for evaluating the forecasting accuracy of each model. Assume we use two forecasting techniques known as
and
. A loss function
can be used to assess and compare between the out-of-sample forecast errors. Whilst there are varied options for
, here we define
as:
where
denotes the forecasting horizon, and
denotes the
h-step ahead forecast of
. If the forecast error is denoted by
ε, then we have the expression
In this case the forecast errors for
, obtained using models
and
can be denoted by
where
is the
h-step ahead forecast errors generated from model
and
is the
h-step ahead forecast errors generated from model
. The most common loss functions consider errors in the form of absolute values or squared values (see for example, the Mean Absolute Percentage Error and Root Mean Squared Error). As such, we can use either the absolute value of errors or squared errors when calculating the KSPA test depending on the loss function in use. Then, the absolute values and squared values of forecast errors can be calculated as
The forecast errors in (13) or (14) are inputs into the KSPA test for determining the existence of a statistically significant difference in the distribution of forecasts from models
and
. As the requirement is to test the distribution between two samples of forecast errors, the two sample two-sided KSPA test statistic can be calculated as:
where
and
denote the empirical c.d.f.’s for the forecast errors from two different models.
Accordingly, in terms of forecast errors, the two-sided KSPA test hypothesis can be approximately represented as follows; where
and
are the absolute or squared forecast errors from two forecasting models
and
with unknown continuous empirical c.d.f’s, the two-sided KSPA test will test the hypothesis:
Then, if the observed significance value of the two-sample two-sided KSPA test statistic is less than α (which is usually considered at the 1%, 5% or 10% level), we reject the null hypothesis and accept the alternate which is that the forecast errors
and do not share the same distribution. In such circumstances we are able to conclude with 1-α confidence that there exists a statistically significant difference between the distribution of forecasts provided by models
and , and thereby conclude the existence of a statistically significant difference between the two forecasts based on the two-sided KSPA test.
2.3. Testing for the Lower Stochastic Error
The aim of the two sample one-sided KS test (referred to as the one-sided KSPA test hereafter) is to identify whether the model which reports the lowest error based on some loss function also reports a stochastically smaller error in comparison to the alternate model. The usefulness of the one-sided KSPA test in distinguishing between the predictive accuracy of forecasts is most apparent in circumstances where forecasts from two models may share an identical distribution with some degree of error (as otherwise this would mean the two forecasts are exactly the same), such that one model will clearly report a comparatively lower forecast error based on some loss function. In such instances, the two-sided KSPA test would fail to identify a statistically significant difference between the two forecasts, but the one-sided KSPA test has the ability of testing the out-of-sample forecasts further in order to identify whether the model with the lower error also reports a stochastically smaller error, and thereby test for the existence of a statistically significant difference between two forecasts.
In terms of forecast errors, the two-sample, one-sided KSPA test hypothesis can be approximately represented as follows. Once again, where
and
are the absolute or squared forecast errors from two forecasting models
and
with unknown continuous empirical c.d.f.’s, the two sample one-sided KSPA test will test the hypothesis:
The acceptance of the alternate hypothesis in this case translates to the c.d.f. of forecast errors from model
lying towards the left and above the c.d.f. of forecast errors from model
. More specifically the acceptance of the alternate hypothesis confirms that model
reports a lower stochastic error than model
. Recall the relationship identified in [
21] that if the c.d.f. for absolute value of forecast errors from one model lies above and hence to the left of that for the other model, the model lying above had a lower stochastic error than the other model. The one-sided KSPA test evaluates this notion and provides a statistically valid foundation which was previously lacking.
5. Conclusions
Developing on the ideas presented in [
21,
22] with respect to using an empirical c.d.f. for determining whether the forecast errors from one model are stochastically smaller than those obtained from a competing model, we introduce a complement statistical test for distinguishing between the predictive accuracy of forecasts. The proposed non-parametric Kolmogorov-Smirnov Predictive Accuracy (KSPA) test serves two purposes via the two-sided KSPA test and the one-sided KSPA test. A simulation study is called upon to evaluate the efficiency and robustness of the KSPA test which is followed by an application to real data. The need for the KSPA test is further evidenced by limitations of the DM test in relation to issues in sample size or inherent assumptions which have been left invalidated in the face of recent findings.
Through the simulation study, the KSPA test is directly compared with the widely accepted modified DM test. In order to enable a meaningful comparison, we consider distributions as used in [
2] for their simulation study. The simulation results provide a clear indication that the proposed KSPA test is more robust than the DM test especially when the number of out-of-sample forecast errors available for comparison purposes are considerably small.
We also consider applications to real data which capture forecasts from different cases in real world applications for validating the proposed KSPA test, and compare the results against those obtained via the modified DM test. As expected, we observed that when the number of observations are small the KSPA test is able to accurately identify a statistically significant difference between forecasts whilst the modified DM test fails. Furthermore, through another scenario in real world applications we show that the KSPA test can be applied in forecasting exercises where the modified DM test is not applicable. In addition, another scenario is used to show that the two variations of the KSPA test can be extremely useful in practice.
Another advantage in the proposed KSPA test is that given its nature, which is to compare the empirical c.d.f. of errors from two forecasting models, we are able to compare both parametrically estimated model-based forecasts and survey-based forecasts with no restrictions on whether these models are nested or non-nested. This is because, regardless of the model used, a forecast error would always be calculated as the actual value minus the predicted value, and the proposed KSPA test will compare the distribution of these errors to differentiate between them. In addition, as the KSPA test is non-parametric it is not dependent on any assumptions relating to the properties of the underlying errors which is also advantageous in practice.
In conclusion, the KSPA test has shown promising results in comparison to the modified DM test and is presented as a viable alternative for comparing between the predictive accuracy of forecasts. The non-parametric nature of the test enables one to overcome issues with the assumptions underlying the DM test which have recently been proven void (see for example, [
18,
19]). Additionally, in the process we have provided statistical validity to the ideas presented in [
21,
22] whilst showing the relevance and applicability of the KSPA test via simulations and applications to real data. Our research now continues to ascertain whether there is a possibility of extending the use of the KSPA test to enable comparisons between more than two forecasts as this would add more value to its practical use.