New York FED Staff Nowcasts and Reality: What Can We Learn about the Future, the Present, and The Past? §

The paper was presented at the 2nd Vienna Workshop on Forecasting and at the 21st IWH-CIREQ-GW Macroeconometric Workshop. The author is grateful to two anonymous reviewers as well as workshop participants for their comments. The views are solely of the author and under no circumstances represent those of Latvijas Banka. Abstract: We assess the forecasting performance of the nowcasting model developed at the New York FED. We show that the observation regarding a striking difference in the model’s predictive ability across business cycle phases made earlier in the literature also applies here. During expansions, the nowcasting model forecasts at best are at least as good as the historical mean model, whereas during the recessionary periods, there are very substantial gains corresponding in the reduction in MSFE of about 90% relative to the benchmark model. We show how the asymmetry in the relative forecasting performance can be verified by the use of such recursive measures of relative forecast accuracy as Cumulated Sum of Squared Forecast Error Difference (CSSFED) and Recursive Relative Mean Squared Forecast Error (based on Rearranged observations) ( R 2 MSFE (+ R )). Ignoring these asymmetries results in a biased judgement of the relative forecasting performance of the competing models over a sample as a whole, as well as during economic expansions, when the forecasting accuracy of a more sophisticated model relative to naive benchmark models tends to be overstated. Hence, care needs to be exercised when ranking several models by their forecasting performance without taking into consideration various states of the economy.


Introduction
The outbreak of the Great Financial Crisis about a decade ago significantly spurred the quest for reliable forecasting of economic conditions not only in the distant future, but also for reliable assessment of the current health of the economy. The forecasting academics and practitioners responded with developing of econometric models that specifically aim at forecasting GDP growth in the current or next quarter at most. This process of forecasting either the present or not that very distant past, or the future was naturally labelled as "nowcasting" (Banbura et al. 2011).
Among the recent contributions to the nowcasting literature, the project initiated and maintained at the Federal Reserve Bank of New York (FRBNY) is worth mentioning. The model described in the academic contribution of Bok et al. (2018) is promoted to the general public in a series of online blogs. The initial online announcement (Aarons et al. 2016) about the model and its regular nowcast releases made publicly available was made in 2016. The subsequent blog entry (Giannone et al. 2017) describes in the Q & A formataccessible to the general public-what nowcasting is and how the assessment of the current economic conditions is carried out by means of a pure data-driven approach. The model developers striving to achieve model transparency even went as far as putting the underlying code in the online code depository (Adams et al. 2018), such that anyone interested in nowcasting can go through the code and, if needed, adapt it for nowcasting of economic conditions in a country/region of their choice. More importantly, as initially announced, the nowcasts are regularly made public on the dedicated website in a forecastas-you-go fashion since the inception of this project in 2016 (https://www.newyorkfed.org/research/policy/nowcast, accessed on 21 February 2021). In the most recent addition to the project, Adams et al. (2019) made the archive of nowcasts simulated backward for the period from 2002 to 2015 publicly available. As a result of this contribution, the sequence of nowcasts for every quarter since 2002 until the most recent one is available for analysis.
For our purposes, this most recent blog entry is important, as the sample for which nowcasts are available extends long enough in the past to include the Great Financial Crisis and it covers one complete business cycle period including both expansion and recession phases. In this study, we intend to verify the conclusions of Chauvet and Potter (2013) on asymmetric forecasting performance of the state-of-the-art macroeconometric models during expansionary and recessionary phases of the business cycle. This asymmetry displays itself in the fact that absolute forecasting errors during recessions tend to be larger than during expansions, i.e., the forecast accuracy, tend to decrease during economic downturns compared to economic upturns. Moreover, they find that during expansions, a simple univariate benchmark model that utilizes only its own past, delivers forecast accuracy that is comparable with that produced by very sophisticated models that draw information from various economic and financial indicators that are often available much earlier than official GDP releases. Since this aspect of the nowcasting performance of the model developed at the Federal Reserve Bank of New York is absent both in the online blog entries as well as in the published paper , it shapes the contribution of our study to the nowcasting literature. Namely, whether the conclusions of Chauvet and Potter (2013) reached for the other types of models can be generalized for the model in question.
Compared to the models utilized in Chauvet and Potter (2013), where for each quarter the accuracy of single one-and two-step ahead forecasts were evaluated, the output of the NY FRB Nowcasting model for each targeted quarter comprises a sequence of about 20 weekly nowcasts available for analysis. These weekly sequences of nowcasts provide us with additional information helping us to address the question of how far ahead in the future one can forecast using a weekly rather than a quarterly time scale definition. Willingly or not, by making the historical record of model nowcasts publicly available, Adams et al. (2019) provide a benchmark that other forecasters may be tempted to use in order to compare nowcasting performance of their models, e.g., see Babii et al. (2019, p. 21) and Cimadomo et al. (2020). Interestingly enough, Bok et al. (2018) does not provide a formal comparison of the forecasting performance of their model with commonly used univariate benchmark models. This constitutes an additional motivation of our study, where we specifically evaluate the forecasting performance of the NY FED Nowcasting model relative to the univariate benchmark models for the full period and across business cycle phases. Our results can be informative for those studies that use NY FED Nowcasting model predictions of the US GDP growth as the benchmark. More generally, our research is related to such studies as Cai et al. (2019) and Alessi et al. (2014) where the actual forecasting experience at such policy-making institutions as ECB and FRBNY is scrutinized.
The rest of the paper is organized as follows. In Section 2, a review of the relevant literature is provided. The NY FED Nowcasting model and its output is detailed in Section 3. A description of benchmark models and forecast evaluation metrics used for model comparison is provided in Sections 4 and 5, respectively. The accuracy of the nowcasting performance of the model against different releases of GDP data (advance, second, final, and latest) for the full sample as well as separately for the periods of the economic downturn (the Great Recession) and upturns is reported in Section 6. The final section concludes.

Literature Review
Instabilities in forecasting performance of macroeconometric models have been long acknowledged in the literature. For example, Rossi (2013) in a comprehensive review of the relevant literature points out at several stylized facts. First, the predictive strength of the variables substantially varies over time such that excellent predictive performance in the past does not warrant similar forecasting excellence in the near future, let alone the distant one. Second, model empirical validation, based on their performance in sample, often serves as a poor approximation of their forecasting ability out of sample.
While there are many potential reasons of unstable predictive ability of macroeconometric models, one explanation that seems obvious is the presence of the business cycles that at some more or less regular intervals shake up individual countries, whole regions, or even spread all over the globe. Naturally, the economic dynamics is completely different during recessions than expansions; hence, it is natural to expect that the forecasting performance varies with the state of the business cycle. Rossi (2013) only briefly mentions business cycles as a possible explanation of forecasting instability, but a more thorough investigation of this topic is provided in Chauvet and Potter (2013). Chauvet and Potter (2013) specifically evaluated the predictive ability of several most widely used macroeconometric models using US GDP growth as an example. The list of these models includes the structural DSGE model, reduced-form VARs estimated using either Bayesian or frequentist approaches, the dynamic factor model with Markov-Switching mechanism, and the cumulative depth of recession model. The conclusions reached in Chauvet and Potter (2013) are surprisingly uniform across these very diverse models. First, the forecasting accuracy of the models worsened in recessions, i.e., on average, forecast errors tended to be larger in economic downturns than during upturns. Second, during expansions, the forecasting performance of highly sophisticated models was matched by that of a simple univariate autoregressive benchmark model. Siliverstovs (2020a) extends the analysis of Chauvet and Potter (2013) to a different class of models, namely, models that combine data observed at the heterogeneous frequencies: quarterly GDP growth and several monthly economic and financial indicators, such as industry production, sentiment indices, labor-market and housing statistics, stock market index, and interest rates that are commonly used for assessing current economic conditions in the US. In particular, Siliverstovs (2020a) re-examines the forecasting performance of a multiple-indicator U-MIDAS-type model suggested in Carriero et al. (2015). The model generalises the Unrestricted MIDAS model suggested in Foroni et al. (2015) in several directions by allowing more than one skip-sampled explanatory variable, optional inclusion of stochastic volatility, and Bayesian estimation of model parameters. The adopted mixed-frequency setup allows to monitor changes in the forecast accuracy as more information can be incorporated in the forecasting model from one month to another, in contrast to Chauvet and Potter (2013), where forecasts were made once per quarter. Siliverstovs (2020a) shows that, at first glance, the impressive reduction in the RMSFE over the benchmark AR(2) model up to 22% reported in Carriero et al. (2015), when evaluated over the whole forecast sample from 1985Q1 until 2011Q3, is mainly driven by a few observations during recessions, with the most prominent contribution being traced to those observations during the Great Recession. Evaluation of the model's forecasting performance during NBER recessions and expansions indicates that, during expansions, the performance of this model is closely matched by that of the benchmark model, conforming with the conclusion of Chauvet and Potter (2013). At the same time, it is worthwhile pointing out that during recessions the improvement over the benchmark model is very dramatic-almost up to 60% in terms of the RMSFE. All in all, it seems that ignoring the asymmetry in the forecasting performance of a more sophisticated model over a benchmark model results in a biased assessment of the model forecasting performance. The predictive ability of the former model tends to be overstated during expansions that last longer than recessions, but at the same time, it is severely understated during rather rare recessions, when prevailing economic distress makes demands for accurate forecasts more acute.
The findings of Chauvet and Potter (2013) and Siliverstovs (2020a), reported for a single time series (US GDP growth), were extended in Siliverstovs and Wochner (2021) for each time series in the Stock-Watson dataset comprising more than 200 US macroeconomic variables. The aim of this exercise was to replicate the study of Stock and Watson (2002) on a more recent data vintage but evaluate the forecasting performance of the diffusion-index model separately for the NBER expansions and recessions in a similar way as done in Chauvet and Potter (2013). Siliverstovs and Wochner (2021) confirm that there are systematic differences in forecasting accuracy across the business cycle phases both in absolute and relative terms with respect to the benchmark models. During expansions, both diffusion-index models and benchmark models generally display similar forecasting performance. However, the more sophisticated model tends to yield substantial forecasting gains around turning points relative to the benchmark models. Quite often, such forecasting gains of the complicated model outweigh its relative losses during economic upturns, such that when the models are judged on the basis of their average forecasting performance over the whole forecast evaluation period, the actual performance of more sophisticated models is overstated, making it appear better in normal times than it really is. The opposite side of the coin is that its performance during recessions tends to be understated.

Model
The NY FED Nowcasting model is similar to the one introduced in Giannone et al. (2008). This dynamic factor model conveniently accommodates features of the data that a forecaster faces when making forecasts in real time. These data features include mixedfrequency data, i.e., GDP data available at the quarterly frequency and auxiliary economic and financial data that often are released at the monthly or even higher frequency. A data set of the auxiliary indicators can be unbalanced both at the beginning of the sample as well as at the end of the sample. Missing data at the beginning of the sample arise most often due to the fact that some time series began earlier or later than others. Missing data at the current edge are due to differences in the release timing of different indicators during a month and because of different publication lags of these indicators. For example, indicators released at the end of the current month can have the latest available observation either for the current month, the previous month, or even for a month further back in the past. In fact, the nowcasting framework developed in Giannone et al. (2008) proved very robust to such challenges posed by the above characteristics of economic data. In case of nowcasting, the GDP growth in Switzerland, once coded at the end of 2009, it ran reliably without a single breakdown during weekly nowcasting exercises at the KOF Swiss Economic Institute (ETH Zurich). The model is described in Siliverstovs and Kholodilin (2012) and the track of its nowcasting performance in real time squared is documented in Siliverstovs (2012) and Siliverstovs (2017).

Timing
Since the nowcasting project of the NY FED goes on, we have to truncate the information flow to reflect the data availability at the time of writing. More specifically, we evaluate the accuracy of nowcasts using the period from 2002Q1 until 2020Q2. For every quarter in this sample, we collected sequence of 21 nowcasts released at a weekly frequency. For several quarters, the number of weekly nowcasts exceeds 21. We chose to concentrate on these 21 weekly forecasts because in this case we obtain an equal number of forecasts for each quarter. This makes our measures of forecasting accuracy which we compute at each forecast origin comparable. The limiting factor was that the sequence of nowcasts for quarter 2018Q1 started one week later than it was usually made for other quarters and it comprises exactly 21 weekly forecasts. The first nowcast in this sequence is released 20 weeks ahead of the week when advance GDP estimate for the targeted quarter is published. The second nowcast precedes the advance GDP release by 19 weeks, and so on. The release of the last nowcast for the targeted quarter coincides with the timing of the release of the advance GDP estimate for this quarter which typically takes place at the end of the first month following the end of the quarter in question. Given such weekly releases, we label nowcasts by their forecast origin measuring the distance by the number of weeks preceding the week when the final nowcast for each quarter was released.
For example, the sequence of nowcasts for quarter 2009Q3-the first quarter after the end of the Great Recession-is shown in Figure 1 together with the second estimate of GDP growth in this quarter. There are several GDP releases-advance, second, and final that are sequentially released at the end of the first, second and third months of the following quarter -they can be compared with the nowcasts. In addition, one can compare nowcasts with GDP growth estimates from the vintage (released on 30 September 2020) that was available to us at the time of writing this manuscript. In the main text, we describe the results with respect to the second estimates of GDP growth. In the Appendix A, we verify robustness of our results by evaluating forecast accuracy for other versions of GDP releases: advance, final and latest, see Tables A1-A6.) This sequence of weekly nowcasts very well illustrates the benefits of nowcasting. We can observe gradual improvement in outlook as more data become incorporated in the nowcasting model. Starting from a rather pessimistic nowcast of -1.76% released on 29 May 2009, each subsequent week brings largely positive news pushing up nowcasts until the final nowcast of 5.1% was made on 30 October 2009. In the course of 2020, the unfolding COVID-19 pandemic brought about new challenges for the world economy and also for forecast practitioners that were forced to forecast unprecedented swings in GDP growth. In Figure 2, we present the nowcast sequence generated by the NY FED nowcasting model. As can be seen, the earliest nowcasts made at the beginning of March 2020 did not signal a severe recession for the US economy. It is only since the middle of April and in the course of May, when the enforced restrictions crippled the economy, that the model has sent a strong negative signal that turned out very close to the GDP estimates published much later. For example, according to the second GDP release on 31 July 2020, the US economy shrank by -31.7% at the annualized rate. Starting from the end of May, the model continuously signaled an improving outlook for the US economy. When assessing the model predictive accuracy, we group nowcasts by their forecast origin. In doing so, we can track how nowcast precision evolves as more information about the relevant quarter becomes available in the course of time. Tracking nowcast accuracy also makes it possible to determine how far ahead in the future one can forecast more accurately using extraneous information from various economic and financial indicators than, for example, naive benchmark models that use exact information from past GDP data only.
We further group each weekly forecast origin by forecast horizon. We distinguish between two forecast horizons, h = 1 and h = 2, depending on the distance in quarters between a targeted quarter and the quarter for which an official estimation of GDP growth already was released. For example, recall the sequence of forecasts displayed in Figure 1. Please note that for the forecast made on 29 May for 2009Q3, the GDP growth estimate for 2009Q1 was already available. Hence, this forecast is labelled as a two-step ahead forecast (h = 2). Meanwhile, for the forecast made on 4 September for the same quarter 2009Q3, the advance GDP estimate for 2009Q2 was already available. Hence, this nowcast is labelled as a one-step ahead forecast (h = 1). Similarly, in Figure 2, we distinguish between twoand one-step ahead forecasts made before and after the release of advance GDP estimate for 2020Q1 on 29 April 2020.
Such breakdown of nowcasts into one-and two-step ahead forecasts is helpful when we compare their forecasting accuracy with that of the benchmark models, discussed in the next section. Forecasts from benchmark models are made only once per quarter, whenever a release of advance GDP assessment takes place. For example, a two-step ahead forecast for 2009Q3 from a benchmark model was made on 29 April, when the data for 2009Q1 were released. Consequently, a one-step ahead forecast for 2009Q3 from a benchmark model was made on 31 July, when the data for 2009Q2 were released.
In order to put a perspective by the challenges posed by the COVID-19 pandemic, we show GDP growth outturns (second estimates) as well as model forecasts at the selected three forecast origins, i.e., 20, 10, and 0 weeks preceding advance GDP releases in Figure 3. The shaded areas indicate the recessionary periods in our sample. The first one is the Great Financial Crisis (2007Q4-2009Q2) and the second recessionary period spans the last two quarters in our sample 2020Q1 and 2020Q2, with reported negative GDP growth. The expansionary period is correspondingly defined 2002Q1-2007Q3 and 2009Q3-2019Q4.

Benchmark Models
In this section, we will discuss the choice of a benchmark model against which one can compare the predictive accuracy of the factor model. A standard model that is routinely used as a benchmark model in the forecasting exercises of US GDP is an autoregressive model of order two, AR(2) For example, this benchmark model was used in Chauvet and Potter (2013) and Carriero et al. (2015). An alternative benchmark model is a historical-mean model (HMM) which uses the average GDP growth rate as the forecast for the upcoming two quarters. As argued in Siliverstovs (2020a), this very simple model provides forecasts of US GDP growth that during NBER expansions match the predictive accuracy not only of an autoregressive model, but also of the mixed-frequency model of Carriero et al. (2015). Both benchmark models are estimated using recursively expanding windows that start in 1970Q1. At each forecast origin, one-and two-step ahead forecasts from the benchmark models are made using the real-time GDP vintage that was historically available. Please note that two-step ahead forecasts from the AR(2) model are obtained iteratively.
For the sake of brevity, we refer to the autoregressive and historical-mean models as ARM and HMM, respectively, and the NY FED Nowcasting model as DFM.

Forecast Accuracy Evaluation Metrics
In this section, we present such traditional measures of relative forecasting performance as (Root) Mean Squared Forecast Error ((R)MSFE) and its relative counterparts that deliver point estimates, as well as a more recent measure of relative forecasting performance of Welch and Goyal (2008) that allows one to determine influential observations that contribute most to relative forecast accuracy referred to as the Cumulated Sum of Squared Forecast Error Difference (CSSFED). Finally, we base our analysis on an innovative measure of relative forecasting accuracy based on rearranged observations, suggested in Siliverstovs (2020b). Siliverstovs (2020b) proposes to use this metrics in order to gauge the leverage of influential observations directly on relative (R)MSFE in a similar way as the CSSFED allows one to sort out the effect of influential observations on difference in (Root) Mean Squared Forecast Error metrics. In order to distinguish the newly introduced and traditional measures of relative forecast accuracy, we label those based on the rearranged observations as R 2 MSFE(+R) and R 3 MSFE(+R), denoting Recursive Relative MSFE and Recursive Relative Root MSFE both (based on rearranged observations) respectively.
By complementing our analysis with these recursive measures of forecast accuracy, we address the main shortcoming of such measures as the (Root) Mean Squared Forecast Error. In terms of this metric, the model ranking is based on comparing average values of squared forecast errors that are not informative about whether one model should be preferred because it systematically produces lower (squared) forecast errors and therefore it is genuinely better than its competitor, or results are driven by a limited number of observations that artificially boosts the difference in the reported (Root) Mean Squared Forecast Errors.

Traditional Measures of Point Forecast Accuracy
Models' predictive accuracy is evaluated using the following accuracy measures of point forecasts: the Mean Squared Forecast Error (MSFE) with T standing for the number of observations in the forecast evaluation period and the relative MSFE (rMSFE) and the Cumulative Sum of Squared Forecast Errors (CSSFED) The MSFE and rMSFE are point estimates of forecast accuracy and represent a typical yardstick to compare models' predictive accuracy in terms of average squared forecast errors. In contrast, the CSSFED, introduced in Welch and Goyal (2008), is a cumulative sequence of the differential of squared forecast errors that allows to dissect the models' relative forecasting performance observation by observation.
There is a number of interesting patterns that this sequence can take, and these patterns can reveal the nature of how and when one model dominates another in terms of forecasting accuracy. For example, a continuous upward or downward trend indicates that the first model tends to produce systematically larger or smaller (squared) forecast errors respectively than the second model does. Hovering around some horizontal line indicates that none of the models produces smaller forecast errors in a systematic way. Naturally, breaks in the trend slope, i.e., situations when an initially positive slope changes to negative, indicates reversals in the relative forecasting performance or instabilities in the forecasting performance, thoroughly discussed in Rossi (2013). Finally, jumps in the CSSFED sequence indicate an unusually large discrepancy in (squared) forecast errors in a given period, which can have a disproportionately large leverage on the calculated RMSFE for one model or relative ranking of two models based on their MSFEs or rMSFEs.
In the Bayesian econometrics, there is a natural counterpart of the CSSFED referred to as the Cumulated Sum of Logarithmic Score Difference (CSLSD) or the Cumulative Log Predictive Bayesian Factors. The significance of using such recursive metrics as the CSLSD for model comparison was emphasized in Geweke and Amisano (2010) stating that this metrics "… shows how individual observations contribute to the evidence in favour of one model over another. For example, it may show that a few observations are pivotal in the evidence strongly favouring one model over another." This conclusion also naturally extends to the CSSFED.

R 2 MSFE(+R)/R 3 MSFE(+R)
The R 2 MSFE(+R) and R 3 MSFE(+R) is as an extension of the CSSFED metrics to recursive estimation of the relative MSFE and it can be straightforwardly derived as follows from Opening the MSFE and cancelling the number of observations T results in Please note that that expression in the numerator corresponds to the cumulated sum of squared forecast error difference (CSSFED) introduced in Welch and Goyal (2008), see Equation (5).
As it stands, the rMSFE is a point estimate of the models' relative forecasting performance computed over the whole forecast evaluation sample. However, as argued above, the relative forecasting performance may change over time and the point estimates like rMSFE are not informative about these changes. Hence, in order to gauge how the relative forecasting performance depends on separate observations, one needs to come up with a recursive version, similarly to the CSSFED measure of individual observation contributions to wedges in the forecast accuracy of the competing models.
To this end, Siliverstovs (2020b) suggests a recursively computed relative MSFE that exposes leverage of individual observations on the relative MSFE. This can be computed recursively using a rearranged sequence of observations in an ascending order according to the absolute value of the squared forecast error difference, |e 2 1,j − e 2 2,j| with ji < jk whenever |e 2 1,ji − e 2 2,ji| < |e 2 1,jk − e 2 2,jk|. that also can be recursively computed using squared forecast errors arranged by absolute values of the squared forecast error difference, | , − , |.

Results
In this section, results of the forecasting competition between the nowcasting model developed at the NY FED and simple benchmark models are presented. In the main text, we report the results based on second GDP releases. We verify the robustness of the conclusions using such alternative GDP releases as advance, final, and the latest ones, which are reported in the Appendix A. This section is divided in two parts. The first part discusses the predictive ability of the models in terms of the traditional measures based on squared forecast errors averaged over the full evaluation sample or its recessionary and expansionary sub-samples. The second part applies the recursive measures of the forecast accuracy dissecting differences in the predictive ability observation by observation.

Point Estimates of the Relative Forecasting Accuracy
The point estimates of the forecast accuracy (MSFE and relative MSFE) are reported in Tables 1 and 2, respectively. These two tables are organized in the following way. The left panel reports the measures of the forecasting accuracy for the pre-COVID period, 2002Q1-2019Q4. In the left panel of Table 1, we report the MSFE for the full sample as well as separately for the expansionary period (2002Q1-2007Q3 and 2009Q3-2019Q4) and the period of the Great Financial Crisis (2007Q4-2009Q2). The left panel of Table 2 correspondingly contains the derived relative MSFEs of the DFM and ARM with respect to the benchmark HMM model. The right panel of each table contains the nominal and relative MSFEs for the full sample at our disposal (2002Q1-2020Q2) and the two recessionary periods (2007Q4-2009Q2 and 2020Q1-2020Q2). Since the expansionary period for the full sample is the same as for the pre-COVID sample, the relevant column was omitted from the right panel in Tables 1 and 2. The results presented in this way allow us to disentangle the effect of extending the forecasting exercise with the two COVID quarters on the nominal and relative forecast accuracy measures reported for the full sample, i.e., averaging across expansionary and recessionary quarters, as well as for the case when one is interested in differences in the models' predictive ability across the expansionary and recessionary phases.    First, we address differences in MSFEs brought about by extending the sample by the COVID recessionary period. The evolution of MSFE at each forecast origin for each of the three models under scrutiny is shown for the pre-COVID and full samples in the left and right panels in Figure 4, respectively. Upon comparing these two plots, it becomes evident that the average squared forecast error substantially increased at every forecast origin and for every model. We also observe that at the earlier forecast origins the relative model ranking has changed. In the pre-COVID period, the univariate benchmark models were characterized by lower MSFE than the NY FED model. In the full sample, this advantage in forecast accuracy of the benchmark models disappeared. One more detail deserves attention. In the pre-COVID period, the MSFE of the DFM showed a clear downward trending behavior, implying increasing forecast accuracy as more information was incorporated into the model. In the full sample, this pattern is no longer observed. In fact, the most accurate predictions are made for the forecasts made about 7-12 weeks before advance GDP releases. Forecasts made at shorter forecast horizons are characterized by increasing MSFE values. An explanation for such observation can be found in Figure 2 where the sequence of nowcasts for 2020Q2 is presented. One can observe that, at the forecast origins of 7-12 weeks, the nowcasts are very close to the GDP outturn, whereas this is not the case for nowcasts made either earlier or later than that. In short, this example illustrates that a single data point can have a rather large influence on the measures of forecast accuracy based on averages of squared forecast errors.
Comparative forecasting performance of the benchmark models deserves a special mention. As can be seen in Figure 4, when evaluated for the full sample (either with or without the COVID period), the ARM produces lower MSFE values than the HMM. At first glance, this observation should support the choice of the autoregressive model as the harder-to-beat benchmark model. However, when one examines the relative MSFEARM/HMM reported in Table 2, it becomes evident that during the expansionary phase the MSFE values of the ARM are up to 20% higher than those of the HMM. It is only during recessions when losses in forecast accuracy during expansions relative to the historical mean model are overcompensated by the respective gains for the autoregressive model. This implies that the HMM is a harder-to-beat benchmark during the expansions that take a lion's share of observations in our sample. This is actually the main reason why the relative measures of forecast accuracy are reported with respect to the historical mean model in this study.
Motivated by this conclusion, we present evolution rMSFE DFM/HMM for the samples without and with COVID observations in Figure 5. The overall conclusion that can be tentatively made is very comforting for the NY FED nowcasting model. The reduction in MSFEs, when compared to that of the HMM, is up to 55% for the pre-COVID forecast evaluation sample and about 80% for the full sample under scrutiny. Another dimension for the analysis of the predictive ability of the NY FED nowcasting model is to compare the MSFE values for the expansionary and recessionary periods. Chauvet and Potter (2013) observe that during recessions it is harder to make forecasts in the sense that forecast errors tend to be larger than those observed during expansions. The corresponding MSFE values are shown in Figure 6  For the pre-COVID sample, one can observe the pattern that largely conforms to the observation made by Chauvet and Potter (2013). At the earlier forecast origins, the forecasts are less precise during the GFC than during the expansionary phases. At the same time, for the forecasts made less than three weeks ahead of the advance GDP estimate releases, the forecasting accuracy during the GFC and expansions is very similar. However, the latter observation can no longer be confirmed when one compares the MSFEs computed for both the GFC and COVID recessionary periods with the MSFE computed for the expansionary period. As can be seen from the right panel of Figure 6, at all forecast origins, forecasts of GDP growth during the recessions are less precise than those during the expansions.
More importantly, given such large differences in nominal measures of forecast accuracy during recessions and expansions, it is worthwhile verifying whether there are noticeable differences in the relative measures. In Figure 7, we plot the evolution of the relative MSFE of the NY FED model and the historical mean model rMSFEDFM/HMM separately for the recessionary and expansionary periods. The left panel shows the results for the pre-COVID sample and the right panel-for the sample extended with the COVID recessionary period. In both panels of Figure 7, one can observe a very pronounced asymmetry in the relative forecast accuracy between the DFM and HMM. During expansions, the HMM model produces more precise forecasts made at forecast origins longer than four weeks ahead of advance GDP releases. Only where forecasts are made less than four weeks ahead of advance GDP releases, the forecast accuracy of both models becomes very similar. Given the timing of a typical advance GDP release, this corresponds to the forecast origins at the end of the last month of a targeted quarter and during the weeks of the first month after the end of the targeted quarter. This observation is consistent with that made by Chauvet and Potter (2013), i.e., simple univariate models are robust forecasting devices during expansions; see also Siliverstovs (2020a) for an assessment of the predictive ability of the model of Carriero et al. (2015) for US GDP growth during expansions and recessions. At the same time, during economic crisis periods, the NY FED nowcasting model produces much more accurate forecasts than the historical mean model.
At this point, it is instructive to compare rMSFEDFM/HMM shown in Figure 5 for the full sample (excluding or including the COVID recession) with the above values of rMSFEDFM/HMM shown in Figure 7. As can be seen, the advantages of the more sophisticated model over the very simple benchmark model reported for the full sample are brought about by rather few observations during economic crises, be it only the Great Financial Crisis or both the GFC and COVID pandemic. Hence, when one ignores this asymmetry in the forecasting performance of the models during business cycles, the forecasting ability of a more sophisticated model tends to be severely overstated during expansions. In this sense, recessions serve as the breadwinner for forecasters devoted to developing evolved models-a point that was made by Siliverstovs and Wochner (2021)/Siliverstovs and Wochner (2019) after a comprehensive and systematic evaluation of forecastability of more than 200 US time series during expansions and recessions.

Recursive Estimates of the Relative Forecasting Accuracy
In this section, we present an analysis of the relative forecasting accuracy of the NY FED dynamic factor model (DFM) and the historical mean model (HMM). The main focus of our analysis is centered around Figures 9 and 11 representing CSSFED DFM/HMM and R 2 MSFE(+R) recursive measures. The auxiliary plots depicting SFED DFM/HMM in its natural temporal ordering and rearranged by its absolute value within each sub-period (expansionary, GFC, and CVD) are represented in Figures 8 and 10, respectively.
For the sake of brevity and without loss of generalization, we will concentrate on the forecasts made at the three selected forecast origins, namely 20, 10, and 0 weeks ahead of advance GDP estimate releases. Figure 8 depicts the SFED DFM/HMM computed for each out-of-sample forecast evaluation period. As can be seen, there are rather few observations for which we observe substantial differences in the forecast accuracy at the 20-week forecast origin. Less so at the other two forecast origins. Please note that during the COVID pandemic in 2020Q2 there is the largest SFED in our sample. As expected during the GFC, we also observe substantial differences in forecast accuracy between these two models. The raw plots of SFED DFM/HMM are informative in pointing out that the models' forecasting accuracy varies from observation to observation and it tends to be more pronounced during periods of economic distress. However, a simple operation makes these differences informative about changes in the relative ranking of the models, based on their forecasting performance. The resulting cumulative sums of these SFED DFM/HMM are shown in the respective panels of Figure 9. Points above the zero line indicate that up until this observation the DFM produced on average higher squared forecast errors than the HMM. Points below the zero line indicate the opposite. As for the forecasts made at the 20-week origin (see the upper panel of Figure 9), we can conclude that, based on the evidence from all but one observation (2020Q2), the forecast accuracy produced by the HMM was superior to that of the DFM. It was only the latest observation in our forecast evaluation sample that was the game changer reversing the conclusion in favour of the DFM over the HMM. This is a good example when one observation leads to complete overhaul of the models' relative ranking based on their average forecasting performance. From the middle panel of Figure 9, we can infer that the conclusion on the superior average forecasting ability of the HMM over DFM reversed much earlier, i.e., during the GFC period. In any case, for the 10-and 0-week forecast origins, the observation in 2020Q2 strongly reinforces the evidence of the superior average forecasting accuracy of the NY FED nowcasting model that first surfaced during the GFC.
Last but not least, we conclude the analysis of relative predictive ability using the R 2 MSFE(+R) of Siliverstovs (2020b). The R 2 MSFE(+R) allows to directly track the evolution of the rMSFE DFM/HMM, as for its computation one adds more and more observations of increasing intensity that is measured in terms of absolute values of SFED,|SFEDt|. One option is to report R 2 MSFE(+R) based on the reordered observations purely by the magnitude of |SFEDt| in the ascending order. We, however, apply a slight modification in rearranging the observations capitalising on the knowledge of the expansionary and recessionary phases of the business cycle in our data. We present the R 2 MSFE(+R) in Figure 10 that is based on the observations rearranged by the magnitude of |SFEDt| in ascending order within each of the three sub-samples: the expansions (2002Q1-2007Q3 and2009Q3-2019Q4), the GFC period (2007Q4-2009Q2), and the COVID period (2020Q1-2020Q2). The main advantage of visually presenting the results in this way is that they are directly comparable to the numerical results reported in Table 2. The underlying SFEDs, rearranged in the ascending order according to its modulus within the three sub-periods (expansionary, the GFC, and COVID periods), are shown in Figure 11.  The R 2 MSFE(+R) calculated using the forecasts for the three selected forecast origins are shown in Figure 10. These sequences visually display asymmetry in the relative forecasting ability of the dynamic factor and historical mean models during different subsamples. As for the expansionary period, we can clearly observe that the HMM, on average, produces lower squared forecast errors than its sophisticated counterpart at the forecast origins that are 20-and 10-weeks ahead of advance GDP releases. The last point in the red sequence in the upper and middle panels of the figures corresponds to the rMSFE reported in Column (3) of Table 2. These values indicate that the MSFE of the DFM model is 71.8% and 28.0% higher than that of the HMM for the forecasts made 20 and 10 weeks before the releases of advance GDP estimates. We can infer from the lower panel of the figure that the rMSFEDFM/HMM is very close to zero for the forecasts released during the same week when advance GDP estimates are published. In fact, the corresponding entry of -0.022 in Table 2 indicates that the MSFEDFM is only about 2% lower than the MSFEHMM during expansions.
The sequence of the green dots shows how the relative MSFE changes if we add observations from the GFC period. The last green dot corresponds to the value of rMSFE reported in Column (1) in Table 2. For the 20-week-ahead forecasts, we can read off the corresponding value of 0.165 which indicates that the HMM average forecast accuracy is superior to that of the DFM, even when the observations from the GFC are taken into account. However, for the shorter forecast horizon of 10 weeks, the corresponding entry is -0.308 indicating changes in the models' relative ranking brought about by the observations during the GFC period. This perfectly illustrates how excessive gains in the forecasting ability during the Great Recession, accrued by the more sophisticated model, can overcompensate for the forecast accuracy losses during much longer periods of economic expansion. As for the forecasts released during the same week as advance GDP releases, the relevant value of -0.546 in Table 2 signals a reduction of about 55% in MSFE brought about by the DFM.
Finally, the blue dots indicate how rMSFEDFM/HMM changes if the two remaining observations (2020Q1-2020Q2) are added to the sample. The last dots in these sequences correspond to the entries in Column (7) of Table 2 for the DFM model. As discussed above, addition of these observations changes the relative ranking of the models for 20-weekahead forecasts, now indicating a reduction of about 5.4% in MSFE brought about by the DFM, and substantially lowers the relative MSFE for the forecasts made at shorter forecast horizons. The corresponding reductions in MSFE are 81.8% and 68.6% relative to those of the HMM.

Conclusions
In this study, we analyze the predictive performance of the dynamic factor model (DFM) developed and maintained at the NY FED. In contrast to many forecasting exercises that were carried out on the past historical data vintages, this project publishes forecasts in a do-it-as-you-go manner and, therefore, is void of data snooping biases. This fact allows us to evaluate the genuine forecasting ability of the econometric model in question, not the least, during the unfolding COVID-19 pandemic when the uncertainty about the current economic conditions is substantially elevated compared to the tranquil times and the demand for accurate assessment of the current state of the economy is especially acute.
The dataset that we analyze comprises US GDP quarterly growth forecasts made in real time squared since 2016 and forecasts calculated backwards during 2002-2015 using historical data vintages. We summarize the nominal and relative accuracy of the DFM forecasts made at 21 weekly forecast origins. The earliest forecast origin precedes releases of advance GDP estimates by 20 weeks, next forecast origin-by 19 weeks, and so on until the week when the advance GDP estimate is released for the targeted quarter. The DFM forecast accuracy is compared with that of two benchmark models: a historical mean model (HMM) and autoregressive model of order (2), ARM.
The main contribution to the forecasting literature is that we analyze the DFM predictive performance during the whole period as well as separately during its expansionary and recessionary sub-periods. The recessionary sub-period includes two distinct periods: the Great Financial Crisis (2007Q4-2009Q2) and the first two quarters of the unfolding COVID-19 pandemic (2020Q1-2020Q2). In doing so, we intend to verify whether the conclusions of Chauvet and Potter (2013) on the asymmetric predictive ability of a wide range of modern macroeconometric models also applies to the NY FED nowcasting model.
Our main conclusion is that we indeed observe a very strong variation in the forecasting ability of the DFM across the business cycle phases. This conclusion is supported when one compares the predictive performance during the pre-COVID period, where there is only one recessionary period (GFC) in our sample, and it is even further reinforced when the latest observations from the COVID pandemic period are included in the analysis. As for the expansionary period, we find that at longer forecast horizons accuracy of the DFM predictions is inferior to that of the historical mean model. It is only at the forecast origins less than 4 weeks ahead of advance GDP releases where both the sophisticated and the naive benchmark models deliver similar forecasting accuracy. On the contrary, the DFM delivers superior forecast accuracy during the recessionary quarters.
Our analysis also demonstrates that the widespread practice of reporting the measures of forecast accuracy, based on average squared forecast errors (and their differences) over longer periods that include both expansionary and recessionary phases of business cycles, are prone to deliver a biased assessment of the models' nominal and relative forecasting ability. Typically, since the relative gains in the forecasting accuracy of a more sophisticated model during recessions significantly outweigh the relative losses during expansions, the results of averaging across both expansions and recessions tend to artificially exaggerate the predictive ability of a more sophisticated model relative to naive benchmarks both for the period as a whole and during its expansionary sub-sample.
In order to avoid such misrepresentation of the results of a forecasting exercise, it is advisable to complement the measures of forecasting ability reported for the whole sample with results reported for more homogeneous sub-samples, e.g., expansions and recessions. Additional information regarding the models' relative forecasting ability can be provided by recursive measures of forecasting accuracy such as CSSFED of Welch and Goyal (2008) and R 2 MSFE(+R) of Siliverstovs (2020b). The recursive measures dissect the relative predictive performance of the competing models' observation by observation and alleviate the gauging leverage of one or more observations on the models' relative ranking.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All computations were performed in R (R Core Team 2012). The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A
The tables presented in appendix replicate the results shown in Tables 1 and 2 in the main text but rely on forecast accuracy metrics computed for advance, final, and latest releases of GDP estimates.    Table A3. For additional information, please see the notes in that table.  2002Q1-2007Q3 and 2009Q3-2019Q4, and, since they are identical for both the left and right panels, the corresponding results are reported only in the left panel.