## 1. Introduction

A natural question to ask of a group of forecasters is whether some are better than others, or is the fact that forecaster A makes a more accurate forecast than B this period simply due to chance, so that the situation is as likely as not to be reversed next time? In the context of forecasting inflation and GDP growth one quarter and one year ahead,

D’Agostino et al. (

2012) find little evidence to suggest that some forecasters are ‘really better’. In our paper, we apply their methodology to the probability assessments of GDP growth and inflation reported to the US Survey of Professional Forecasters (SPF). There have been a number of assessments of the individuals’ probability assessments, but none that directly address the question of interest: are some forecasters’ probability assessments more accurate than those of other forecasters? For example,

Clements (

2018) compares the individual forecasts to a benchmark, but does not consider whether there are systematic differences between forecasters.

Why does it matter whether some forecasters are better than others? In recent years, there has been much work on expectations formation. If some forecasters were superior to others, this might have a bearing on some of the ingredients of the models of expectations. For example, some of the models currently discussed in the literature replace the assumption of full-information rational expectations—in which all agents know the true structure of the economy and have access to the same information set—with ‘informational rigidities’. Agents are assumed to form their expectations rationally subject to the information constraints they face. For example, under noisy information agents only ever observe noisy signals about economic fundamentals.

1 The baseline noisy information model assumes the noise-variance contaminating agents’ signals is equal across agents. Consequently, agents’ forecasts are all equally as good. If this is false, then the homogeneous signal assumption could be dropped, although

Coibion and Gorodnichenko (

2012,

2015) argue the macro-level evidence supports the baseline model.

2D’Agostino et al. (

2012) conclude that there are no real differences between forecasters in terms of the accuracy of their point forecasts. Why then might we expect differences in terms of the accuracy of their probability assessments? Producing histogram forecasts is typically more time-consuming and costly than producing point predictions, for a number of reasons. Point predictions are often reported in the media, and made available in reports that professional forecasters are likely to have ready access to. Moreover, professional forecasters may need to produce point forecasts regularly, and not just at the behest of the SPF. There is less likely to be a prevailing view about the probability assessment, that respondents are able to draw on, and respondents may only produce such assessments for their SPF survey returns. There is then a more uneven playing field, and more scope for some forecasters to outperform others.

In addition, there is evidence in the literature that the probability assessments and point predictions do not always tally in terms of what they imply about the most likely outcome, or the expected outcome: see (

Engelberg et al. 2009) and (

Clements 2009,

2010,

2014b) for the US SPF. Hence there should be no presumption that the findings for point predictions carry over to the histograms.

Point forecasts are, of course, less valuable than probability density forecasts, because they do not provide information on the range of likely outcomes. This point has been made repeatedly in the literature. For example, in discussing point forecasts, (

Granger and Newbold 1986, p. 149) state ‘It would often be very much better if a more sophisticated type of forecast were available, such as a confidence interval with the property that the probability that the true value would fall into this interval takes some specified value.’ See also (

Chatfield 1993) for a discussion of interval forecasts. Some surveys such as the US SPF provide the ‘sophisticated’ forecasts Granger and Newbold wished for.

Clements (

2019b) provide a general treatment of survey expectations, and

Castle et al. (

2019) discuss forecast uncertainty for a non-technical audience. The focus of this paper—whether there are real differences in terms of the accuracy of individuals’ probability assessments—is at least as important as asking this question of point forecasts.

The plan of the remainder of the paper is as follows.

Section 2 describes the SPF forecast data.

Section 3 describes the bootstrap test of

D’Agostino et al. (

2012), applied by those authors to the US SPF point predictions, and

Section 4 our application of their test to the SPF probability assessments.

Section 5 presents the results.

Section 6 compares our findings to those from using alternative approaches that have been applied to point predictions.

Section 7 offers some concluding remarks.

## 2. Forecast Data: SPF Respondents’ Forecasts

The US Survey of Professional Forecasters (SPF) is a quarterly survey of macroeconomic forecasters of the US economy that began in 1968, and is currently administered by the Philadelphia Fed (see

Croushore 1993). The SPF is made freely available by the Philadelphia Fed. It is perhaps the foremost survey of its kind, and a staple for academic research: an academic bibliography

3 listed 101 papers as of February 2018.

We use the 150 surveys from 1981:3 to 2018:4 inclusive, and consider the probability assessments of output growth (real GDP) and GDP deflator inflation, as well as the point predictions for these two variables.

The survey asks for histogram forecasts of the annual rates of growth (of GDP, and the GDP deflator) for the year of the survey relative to the previous year, as well as of the next year, relative to the current year. This results in a sequence of ‘fixed-event’ histogram forecasts. Using annual inflation in 2016 (compared to 2015) as an illustration, the 2016:Q1 survey yields a forecast of just under a year ahead, and the 2016:Q4 survey, a forecast with a horizon of just under a quarter. (We do not use the longer-horizon forecasts of 2016 which are provided in the 2015 surveys). This means we only have one histogram a year if we want a series of fixed-horizon forecasts: we could take the forecasts made in the first quarter, say, to give an annual series of year-ahead forecasts, or in the fourth quarter, to give an annual series of one-quarter-ahead forecasts.

As a result, there are only approximately one quarter as many histogram forecasts (of a given horizon) as there are point predictions used by

D’Agostino et al. (

2012).

4 Each survey also provides point predictions of quarterly growth rates for the current quarter and the next four quarters, providing a quarterly series of fixed-horizon forecasts—these are the forecasts used by

D’Agostino et al. (

2012). Although our main interest is in the probability assessments, we also apply the bootstrap test to a matching sequence of annual fixed-event point predictions. The SPF elicits forecasts each quarter of the annual calendar-year rates of growth of GDP and of inflation, matching the histogram forecasts. This serves as a check of whether the results for our annual series of point predictions are similar to those of

D’Agostino et al. (

2012) for the quarterly series, given that we will (of necessity) be working with an annual series for histograms.

5## 3. The Bootstrap Test

The test compares the empirical distribution of forecaster performance to the distribution which would be obtained under the null hypothesis of equal ability, constructed by randomly assigning forecasts to forecasters. The test is detailed in (

D’Agostino et al. 2012, p. 718), and described below. It accounts for the unbalanced nature of the panel, and the possibility that comparisons across forecasters might be distorted by some forecasters being active at times which are conducive to more accurate forecasting, and others participating during more uncertain times.

6The changing composition of the SPF panel is evident from

Table 1, which provides an illustrative snapshot of the (standardized) RPS statistics of the ten forecasters who filed the most inflation histogram responses to Q4 surveys over our sample period. (The ranked probability score (RPS)—which is based on the forecaster’s histograms—and the meaning of ‘standardized’ are described subsequently). The panel changes because of entry and exit, as well as non-participation by members in particular periods (as documented by

Engelberg et al. 2011, amongst others). Entry, exit and non-participation is clearly evident in

Table 1. For example, the first forecaster joined the survey in 1991:Q4, and then remained an active respondent over the sample period, albeit with occasional periods of non-response. Forecaster 2 joined a year earlier. Forecaster 3 was active from the beginning of the sample, and left in 2009. There are of course more missing values for the less prolific SPF respondents. The unbalanced nature of panels of survey forecasters such as the US SPF is an impediment to inter-forecaster comparisons, but is allowed for in the approach of (

D’Agostino et al. 2012, p. 718).

7By comparing the empirical distribution with the null distribution, we can determine whether the ‘best forecaster’ occupies that position by virtue of being inherently the most accurate, or by chance, as well as whether the same is true at any percentile of the empirical distribution. For example, is the forecaster at the 25th percentile of the actual distribution inherently more accurate than three-quarters of the forecasters, or might this be due to chance? We use the bootstrap described below to calculate confidence intervals for the percentiles of the actual distribution, and also to calculate the probability of obtaining a lower (better) score by chance at a given percentile than that actually found.

The bootstrap test is as follows. Let

${s}_{it}$ denote the ‘score’ for individual

i in response to survey

t, where we abstract from the forecast horizon. The score is either the squared forecast error, when we consider point forecasts, or the QPS or RPS value (as given by (

1) and (

2), and discussed below), when we consider the histogram forecasts. In either case, the score is non-negative, and zero indicates a perfect forecast. To account for different macroeconomic conditions, the scores are normalized by the average of all respondents’ scores for that period (i.e., the scores for all the forecasts made at survey

t):

where

${N}_{t}$ is the number of such forecasts.

${S}_{it}$ measures the performance of forecaster

i relative to that of all others forecasting at that time. The overall mean score of forecaster

i is then given by

${S}_{i}$:

where

${N}_{i}$ is the set of surveys to which

i filed a response, and

${n}_{i}$ is the number of elements in the set. Note inter-forecaster comparisons should be legitimate when forecasters

i and

j respond to different surveys and to different numbers of surveys, due to the normalization and averaging.

8For each survey

t, the

${S}_{i\tau}$ (for

$\tau =t$) are randomly assigned (with replacement) across a set of ‘imaginary’ forecasters who match the SPF forecasters exactly in terms of participation. For example, if the third SPF forecaster did not participate at time

t, the third imaginary forecaster’s time

t forecast will also be missing. We continue for each

t, for

$t=1,\dots ,T$. As stressed by

D’Agostino et al. (

2012), forecasters at each

t can only be assigned a forecast from another forecaster made at that time.

9 The scores are then calculated for this random reshuffling, by averaging over the non-missing values for each

i, to give the simulated vector of values

${S}^{1}$, with typical element

${S}_{i}^{1}$, where

$i=1,\dots ,{N}_{f}$, with

${N}_{f}$ the number of actual forecasters. We repeat the above another 999 times to obtain 1000 bootstrap distributions,

$\left\{{S}^{1},\dots ,{S}^{1000}\right\}$.

Following

D’Agostino et al. (

2012), we compare selected percentiles of the actual distribution (e.g., the ‘best’, the forecaster occupying the position of the 5th percentile, etc.) to the 5th and 95th percentiles for those ‘positions’ calculated from the bootstrap distribution (generated under the assumption of equal accuracy). For example, to obtain a confidence interval for the best actual score under the null of equal forecast accuracy, we first calculate the best (minimum) score for each vector

${S}^{j}$,

$j=1,\dots ,1000$. Denoting these by

$\left\{{S}_{min}^{j}\right\}$,

$j=1,\dots ,1000$, we then calculate the 50th and 950th largest values. If the best actual score lies within the confidence interval calculated under the null that forecasters are equally accurate, we do not reject the null at the 10% level. In addition, we calculate a ‘

p-value’ as the proportion of the

$\left\{{S}_{min}^{j}\right\}$ which are less than the actual best (say,

${S}_{min}$). As shown below when we explain the results, this may provide useful additional information.

As another example, for the 5th percentile of the actual distribution, we select the 5th percentile values from each vector ${S}^{j}$, $j=1,\dots ,1000$, and then proceed as for the best, i.e., we calculate the 50th and 950th largest of the 5th-percentile-values, as well as the p-value.

## 4. The Loss Functions

Following D’Agostino et al., we evaluate point predictions using squared-error loss, that is, the squared forecast error (SFE). As argued by

Clements and Hendry (

1993a,

1993b), comparisons of forecasts on MSFE (mean SFE) suffer from a number of limitations and shortcomings, and these are inherited by RPS and QPS. Namely, MSFEs lack invariance to nonsingular linear transformations of the forecasts when the forecasts are multivariate (acknowledging that a forecaster at time

t typically produces forecasts of both inflation and output growth, either point forecasts or histograms) and/or multiple horizons.

Hendry and Martinez (

2017) extend the work of

Clements and Hendry (

1993a) on multivariate measures, and see also

Anderson and Vahid (

2011) for an application.

It should also be borne in mind that we consider whether a forecaster is better than another, in terms of point forecast performance assessed by SFE, or the quality of their histograms, as assessed by QPS or RPS. Suppose Forecaster A’s MSFE is lower than that of Forecaster B. However, it could still be the case that the forecasts of B add value to those of Forecaster A, and this would be the case if A fails to ‘forecast encompass’ B.

Chong and Hendry (

1986) develop the concept of forecast encompassing, and

Ericsson (

1992) discusses the relationship between MSFE dominance (our concern in this paper) and forecast encompassing. The notion of forecast encompassing is typically framed in terms of point forecasting and squared error loss, although

Clements and Harvey (

2010,

2011) develop tests for probability forecasts, and the concept is applicable to the evaluation of histograms.

10For the probability assessments, the loss function we choose is determined in part by the form in which the assessments are reported, namely as histograms. For assessing both point predictions and histograms, the actual values are the advance estimates, so that the rate of GDP growth in 2010, say, is the value available at the end of January 2011, as provided in the Real-time Data Research Center.

11 This seems preferable to using latest-vintage data, and using ‘real-time’ data is common practice in macro-research. We also check whether our results change if we use the second-quarterly estimates (in our example, the vintage available in 2011:Q2).

The histogram forecasts are assessed using two scoring rules, the quadratic probability score (QPS:

Brier 1950) and the ranked probability score (RPS:

Epstein 1969). A scoring rule (or loss function) assigns a numerical score based on the density and the realization of the variable. Although the log score is perhaps the most popular scoring rule for densities (see, e.g.,

Winkler 1967), QPS and RPS have the advantage that they can be calculated directly from the histograms, that is, without making any additional assumptions.

12QPS and RPS are defined by:

and:

for a single histogram with

K bins (indexed by the superscript

k), where

${p}^{k}$ is the probability assigned to bin

k.

${y}^{k}$ is an indicator variable equal to 1 when the actual value is in bin

k, and zero otherwise. In the definition of

$RPS$,

${P}^{k}$ is the cumulative probability (i.e.,

${P}^{k}={\sum}_{s=1}^{k}{p}^{s}$), and similarly

${Y}^{k}$ cumulates

${y}^{s}$. Note that if

${y}^{{s}_{1}}=1$, then

${Y}^{k}=1$ for all

$k\ge {s}_{1}$.

13We consider both scores. Being based on cumulative distributions, RPS will penalize less severely forecasts with probability close to the bin containing the actual value, relative to QPS. For QPS, a given probability outside the bin in which the actual falls has the same cost regardless of how near or far it is from the outcome-bin. For this reason, the RPS seems preferable to QPS.

The calculation of QPS and RPS only requires knowledge of the probabilities assigned to each bin, $\left\{{p}^{k}\right\}$, provided explicitly by the survey respondents, and a stance on what constitutes the actual value (and therefore ${y}^{k}$). Unlike fitting parametric distributions, no difficulties arise when probability mass is assigned to only one or two bins.

## 6. Related Literature

To the best of my knowledge there are no studies attempting to determine whether some survey forecasters’ probability assessments are really better than those of others. The results in this paper are the first to bear on this question.

In terms of point predictions,

Clements (

2019a) considers the set of forecasters who made the most forecasts over a given period. The ranks of the forecasters are calculated over two sub-periods (using normalized forecast errors), and the null of equal accuracy is interpreted to mean a zero correlation between the ranks in the two periods. That is, that there is no persistence in the relative performance of the forecasters over the two sub-periods. The approach is based on the simple idea that if there are real differences between forecasters, then the ranking of forecasters in one period ought to be informative about the ranking in a subsequent period.

Clements (

2019a) considers different variables, consumers’ expenditure growth, investment growth and real GDP growth jointly, using either the trace or determinant of the second moment matrix, of either one or four-step ahead quarterly forecast errors. The Spearman test of no correlation in the ranks is rejected at conventional significant levels. As well as the different set of variables under consideration, the study uses the quarterly series of fixed-horizon forecasts, and so uses a relatively long span of forecast data compared to the annual series we have considered.

There would appear to be advantages and disadvantages to both approaches: the bootstrap test, and the rank correlation test across sub-samples. The bootstrap test considers the forecasts en masse,

18 whereas the requirement that an individual makes a reasonable number of forecasts in each of the two sub-periods leads to a focus on a smaller number of forecasters:

Clements (

2019a) considers the 50 most prolific respondents. The most prolific might not be typical of the wider population of forecasters, so the null is not the same as in the bootstrap test. The test is of whether within in a given sample of forecasters (the most prolific) some appear to perform consistently better than others. The results of the bootstrap test depend on all forecasters (subject to the caveat above about exclusion) and are liable to being unduly influenced by a small number of very poor forecasts.

In terms of implementation, the bootstrap test simply requires choosing the minimum number of forecasts required for an individual to be included. The sub-sample approach requires making a decision on the period for splitting the sample into two, and the minimum number of forecasts in each of the two samples required for an individual to be included.

Table 8 illustrates the sub-sample approach, where we split the sample in the middle, into 1981:3 to 1999:4, and 2000:1 to 2018:4, and require that each respondent makes a minimum of 15 forecasts in each sub-period. This gives 17/18 respondents for the tests of the histograms, and 19/20 for the point predictions. We bundle the forecasts made in the four quarters of the year together, that is, the

${S}_{i}$ average over forecasts of horizons of one quarter to one year ahead.

19 The test is based on a comparison of the ranks of

${S}_{i}^{1}$ and

${S}_{i}^{2}$, where the superscript denotes the sub-sample, and

i indexes the eligible individuals.

The top panel of

Table 8 shows that we reject the null of uncorrelated ranks at conventional significance levels for the histogram forecasts (scored by RPS), and for the inflation point forecasts, but not for the GDP growth point predictions.

Broadly, the two approaches are in tune in suggesting that there is more evidence that there are real differences between individuals in terms of their ability to make accurate probability assessments. However, for the reasons we have explained, it need not be the case that the two tests agree, as the tests emphasize different facets of the inter-forecaster comparisons (e.g., the behaviour of the most prolific forecasters, or of the top forecaster, or the median forecaster, and so on).

## 7. Conclusions

D’Agostino et al. (

2012) propose a bootstrap test to ascertain whether some survey respondents’ forecasts really are more accurate than those of others. They suggest that the actual forecasters in the upper half of the empirical distribution occupy those positions by chance. Using annual series of calendar-year growth rates of real GDP and the GDP deflator, we broadly confirm their findings, which were based on quarterly series of quarterly growth rates. We then present novel empirical evidence regarding whether some forecasters are able to make superior probability assessments to others, and find more evidence that this is the case (compared to when they make point predictions).

Both the probability assessments (in the form of histograms) and the point predictions are of annual calendar year growth rates, at horizons of 1 to 4 quarters ahead. Hence the forecast target and horizons match, and so the different conclusions we draw—regarding whether some forecasters really are better than others—are attributable to the differences in the type of forecast, and not simply because of different forecast horizons, etc. Given that providing a histogram forecast is likely to be more costly, in terms of knowledge acquisition, and the processing of information, it is perhaps not surprising that there are real differences between individuals, reflecting different amounts of resources being devoted to the task.

We investigate whether inexperience affects the results. If we exclude histogram forecasts made by respondents when they were newcomers, and apply the bootstrap test to forecasts made when they were more experienced, are the results unchanged? We find that ‘real’ differences in histogram forecaster accuracy are still apparent.

Findings based on comparing the rankings of forecasters across two sub-samples tell a similar story: there are persistent differences between individuals in terms of their ability to make accurate histogram forecasts.