Using the Entire Yield Curve in Forecasting Output and Inflation

In forecasting a variable (forecast target) using many predictors, a factor model with principal components (PC) is often used. When the predictors are the yield curve (a set of many yields), the Nelson–Siegel (NS) factor model is used in place of the PC factors. These PC or NS factors are combining information (CI) in the predictors (yields). However, these CI factors are not “supervised” for a specific forecast target in that they are constructed by using only the predictors but not using a particular forecast target. In order to “supervise” factors for a forecast target, we follow Chan et al. (1999) and Stock and Watson (2004) to compute PC or NS factors of many forecasts (not of the predictors), with each of the many forecasts being computed using one predictor at a time. These PC or NS factors of forecasts are combining forecasts (CF). The CF factors are supervised for a specific forecast target. We demonstrate the advantage of the supervised CF factor models over the unsupervised CI factor models via simple numerical examples and Monte Carlo simulation. In out-of-sample forecasting of monthly US output growth and inflation, it is found that the CF factor models outperform the CI factor models especially at longer forecast horizons.


Introduction
The predictive power of the yield curve for macroeconomic variables has been documented in the literature for a long time.Many different points on the yield curve have been used and various methodologies have been examined.For example, Stock and Watson (1989) find that two interest rate spreads, the difference between the six-month commercial paper rate and the six-month Treasury bill rate, and the difference between the ten-year and one-year Treasury bond rates, are good predictors of real activity, thus contributing to their index of leading indicators.Bernanke (1990), Friedman and Kuttner (1993), Estrella and Hardouvelis (1991), and Kozicki (1997), among many others, have investigated a variety of yields and yield spreads individually on their ability to forecast macroeconomic variables.Hamilton and Kim (2002) as well as Diebold et al. (2005) provide a brief summary of this line of research and the link between the yield curve and macroeconomic variables.
Various macroeconomic models for exploring the yield curve information for real activity prediction are proposed.Ang and Piazzesi (2003) and Piazzesi (2005) study the role of macroeconomic variables in an arbitrage-free affine yield curve model.Estrella (2005) constructs an analytical rational expectations model to investigate the reasons for the success of the slope of the yield curve (the spread between long-term and short-term government bond rates) in predicting real economic activity and inflation.The model in Ang et al. (2006), Piazzesi and Wei is an arbitrage-free dynamic model (using lags of GDP growth and yields as regressors) that characterizes expectations of GDP growth.Rudebusch and Wu (2008) provide an example of a macro-finance specification that employs more macroeconomic structure and includes both rational expectations and inertial elements.Stock andWatson (1999, 2002) investigate forecasts of output growth and inflation using over a hundred of economic indicators, including many interest rates and yield spreads.Stock andWatson (2002, 2012) advocate methods that aim at solving the large-N predictor problem, particularly those using principal components (PC).Ang et al. (2006) suggest the use of the short rate, the five-year to three-month yield spread, and lagged GDP growth in forecasting GDP growth out-of-sample.The choice of these two yield curve characteristics, as they argue, is because they have almost one-to-one correspondence with the first two principal components of the short rate and five yield spreads that account for 99.7% of quarterly yield curve variation.
Alternatively to the PC factor approach on the large-N predictor information set, Diebold and Li (2006) propose the Nelson and Siegel (1987) (NS) factors for the large-N yields.They use a modified three-factor NS model to capture the dynamics of the yield curve and show that the three NS factors may be interpreted as level, slope, and curvature.Diebold et al. (2006) examine the correlations between NS yield factors and macroeconomic variables.They find that the level factor is highly correlated with inflation and that the slope factor is highly correlated with real activity.For more on the yield curve background and the three characteristics of the yield curve, see Litterman and Scheinkman (1991) and Diebold and Li (2006).
In this paper, we utilize the yield curve information for prediction of macro-economic variables.Using a large number of yield curve points with different maturities yields a large-N problem in the predictive regression.The PC factors or the NS factors of the yield curve may be used to reduce the large dimension of the predictors.However, the PC and NS factors of the yield curve are not supervised for a specific variable to forecast.These factors simply combine information (CI) of many predictors (yields) without having to look at a forecast target.Hence, the conventional CI factor models (using factors of the predictors) are unsupervised for any forecast target.
Our goal in this paper is to consider factor models where the factors are computed with a particular forecast target in mind.Specifically, we consider the PC or NS factors of forecasts (not of predictors), with each of the forecasts formed using one predictor at a time.(It could be generalized to make each forecast from using more than one predictor, e.g., a subset of the N predictors, in which case there can be as many as 2 N forecasts to combine.)These factors will combine the forecasts (CF).The PC factors of forecasts are combined forecasts using the combining weights that solves a singular value problem for a set of forecasts, while the NS factors of forecasts are combined forecasts using the combining weights obtained from orthogornal polynomials that emulate the shape of a yield curve (in level, slope, and curvature).The PC or NS factors of the many forecasts are supervised for a forecasting target.The main idea of the CF-factor model is to focus on the space spanned by forecasts rather than the space spanned by predictors.The factorization of forecasts (CF-factor model) can substantially improve forecasting performance compared to the factorization of predictors (CI-factor model).This is because the CF-factor model takes the forecast target into the factorization, while the conventional CI-factor model is blind to the forecast target because the factorization uses only information on predictors.
For both CI and CF schemes, the NS factor model can be relevant only when the yield curve is used as predictors while the PC factor model can be used in general.The NS factors are specific to the yield curve factors such as level, slope, and curvature factors.When the predictors are from the points on the yield curve, the NS factor models proposed here is nearly the same as the PC factors.Given the similarity of NS and PC and the generality of PC, we begin the paper with the PC models to understand the mechanism of the supervision in CF-factor models.We demonstrate how the supervised CF factor models outperform the unsupervised CI factor model, under the presence of many predictors (50 points on the yield curve at each time).The empirical work shows that there are potentially big gains in the CF-factor models.In out-of-sample forecasting of U.S. monthly output growth and inflation, it is found that the CF factor models (CF-NS and CF-PC) are substantially better than the conventional CI factors models (CI-NS and CI-PC).The advantage of supervised factors is even greater for longer forecast horizons.
The paper is organized as follows: in Section 2, we describe the CI and CF frameworks and principal component approaches for their estimation, present theoretical results about supervision, and an example to provide intuition.Section 3 provides simulations of supervision under different noise, predictor correlation, and predictor persistence conditions.In Section 4, we introduce the NS component approaches for the CI and CF frameworks.In Section 5, we show the out-of-sample performance of the proposed methods in forecasting U.S. monthly output growth and inflation.Section 6 presents the conclusions.

Factor Models
Let y t+h denote the variable to be forecast (output growth or inflation) using yield curve information stamped at time t, where h denotes the forecast horizon.The predictor vector x t contains information about the yield curve at various maturities: x t := (x 1t , x 2t , . . ., x Nt ) , where x it := x t (τ i ) denotes the yield at time t with maturity τ i (i = 1, 2, . . ., N).
Consider the CI model when N is large for which the forecast at time T is ŷCI-OLS with α estimated by OLS using the information up to time T. A problem is that here the mean-squared forecast error (MSFE) is of order O N T increasing with N. 1 A solution to this problem is to reduce the dimension either by selecting a subset of the N predictors, e.g., by Lasso type regression (Tibshirani 1996) or by using factor models of, e.g., Stock and Watson (2002).In this paper, we focus on using the factor model rather than selecting a subset of the N predictors. 2

CI-Factor Model
The conventional factor model is the CI factor model for x t of the form where The estimated factor loadings ΛCI are obtained either by following Stock and Watson (2002) and Bai (2003), or by following Nelson and Siegel (1987) and Diebold and Li (2006).The latter approach is discussed in Section 4. The factors are then estimated by fCI,t = Λ CI x t . (4) As this model computes the factors from all N predictors of x t directly, it will be called "CI-factor".
The forecast ŷT+h = (1 f CI,T )α CI can be formed using αCI estimated at time T from the regression  2 Bai and Ng (2008) consider CI factor models with a selected subset (targeted predictors).
In matrix form, we write the factor model ( 3) and (5) for the vector of forecast target observations y and for the T × N matrix of predictors X as follows:3 where y is the T × 1 vector of observations, F CI is a T × k CI matrix of factors, Λ CI is an N × k CI matrix of factor loadings, α CI is a k CI × 1 parameter vector, v CI is a T × N random matrix, and u CI is a T × 1 vector of random errors.

Remark 1. (No supervision in CI-factor model):
Consider the joint density of (y t+h , x t ) where D 1 is the conditional density of y t+h given x t , and D 2 is the marginal density of x t .The CI-factor model assumes a situation where the joint density operates a "cut" in the terminology of Barndorff-Nielsen (1978) and Engle et al. (1983), such that where θ = (θ 1 θ 2 ) , and θ 1 = α, θ 2 = (F, Λ) are "variation-free".Under this situation, the forecasting equation in ( 5) is obtained from the conditional model D 1 and the factor equation in ( 3) is solely obtained from the marginal model D 2 of the predictors.The computation of the factors is entirely from the marginal model D 2 that is blind to the forecast target y t+h .
While the CI factor analysis of a large predictor matrix X solves the dimensionality problem, it computes the factors using information in X only, without accounting for the variable y to be forecast, and therefore the factors are not supervised for the forecast target.Our goal in this paper is to improve this approach by accounting for the forecast target in the computation of the factors.The procedure will be called supervision.
There are some attempts in the literature to supervise factor computation for a given forecast target.For example, Bair et al. (2006) and Bai and Ng (2008) consider factors of selected predictors that are informative for a specified forecast target; Zou et al. (2006) consider sparse loadings of principal components; De Jong (1993) and Groen and Kapetanios (2016) consider partial least squares regression; De Jong and Kiers (1992) consider principal covariate regression; Armah and Swanson (2010) select variables for factor proxies that have the maximum predictive power for the variable being forecast; and some weighted principal components have been used to downweight noisier series.
In this paper, we consider the CF-factor model that computes factors from forecasts rather than from predictors.This approach has been proposed in Chan et al. (1999) and in Stock and Watson (2004), there labeled "principal component forecast combination".We will refer to this approach as CF-PC (combining forecasts principal components).The details are as follows.

CF-Factor Model
The forecasts from a CF-factor model are computed in two steps.The first step is to estimate the factors of the individual forecasts.Let the individual forecasts be formed by regressing the forecast target y t+h using the ith individual predictor x it : The CF-factor is estimated from fCF,t+h := Λ CF ŷt+h . ( The second step is to estimate the forecasting equation (for which the estimated CF-factors from the first step are used as regressors) 4 Then, the CF-factor forecast at time T is where αCF is estimated.See (Chan et al. 1999;Huang and Lee 2010;Stock and Watson 2004).
To write the CF-factor model in matrix form, we assume for notational simplicity that the data has been centered so that we do not include a constant term.We regress y on the columns x i of X, i = 1, . . ., N, one at a time, and write the fitted values in (10) as Collect the fitted values in the matrix where B = diag(b 1 , . . ., b N ) ∈ R N×N is a diagonal matrix containing the regression coefficients.We call B the supervision matrix.Then, the CF-factor model is where F CF is a T × k CF matrix of factors of Ŷ = XB, Λ CF is an N × k CF matrix of factor loadings, α CF is an k CF × 1 parameter vector, v CF is a T × N random matrix, and u CF is a T × 1 vector of random errors.
In the rest of the paper, the subscripts CI and CF may be omitted for simplicity.We use principal components (PC) as discussed in Stock and Watson (2002), Bai (2003), and Bai and Ng (2006).For the specific case of yield curve data, we use NS components as discussed in Nelson and Siegel (1987) and Diebold and Li (2006).We use both CF and CI approaches together with PC factors and NS factors.Our goal is to show that forecasts using supervised factor models (CF-PC and CF-NS) are better than forecasts from conventional unsupervised factor models (CI-PC and CI-NS).
4 Given the dependent nature of macroeconomic and financial time series, the forecasting equation can be extended to allow the supervision to be based on the relation between y t and some predictors after controlling for lagged dependent variables and to allow the dynamic factor structure, which we leave for future work.
We show analytically and in simulations how supervision works to improve factor computation with respect to a specified forecast target.In Section 5, we present empirical evidence.

Remark 2. (Estimation of B):
The CF-factor model in ( 17) and ( 18) with B = I N (identity matrix) is a special case when there is no supervision.In this case, the CF-factor model collapses to the CI-factor model.If B were consistently estimated by minimizing the forecast error loss, then the CF-factor model with the "optimal" B would outperform the CI-factor model.However, as the dimension of the supervision matrix B grows with N 2 , B is an "incidental parameter" matrix and can not be estimated consistently.See Neyman and Scott (1948) and Lancaster (2000).Any estimation error in B translates into forecast error in the CF-factor model.Whether there is any virtue in considering Bayesian methods of estimating B, while still avoiding this problem, is left for future research.Instead, in this paper, we circumvent this difficulty by imposing that B = diag(b 1 , . . ., b N ) be a diagonal matrix and by estimating the diagonal elements b i 's from the ordinary least squares regression in (10) or (15) with one predictor x i at a time.The supervision matrix B can be non-diagonal in general.As imposing the diagonality on B may be restrictive, it would be an interesting empirical question to examine if the CF-factor forecast with this restriction and the estimation strategy of B can still outperform the CI-factor forecast with B = I N .Our empirical results in Section 5 (Table 1) support this simple estimation strategy for the diagonal matrix B, in favor of the CF-factor model.

Remark 3. (Combining forecasts with many predictors):
It is generally believed that it is difficult to estimate the forecast combination weights when N is large.Therefore, the equal weights 1  N have been widely used instead of estimating weights. 5It is often found in the literature that equally-weighted combined forecasts are often the best.Stock and Watson (2004) call this the "forecast combination puzzle".See also Timmermann (2006).Smith and Wallis (2009) explore a possible explanation of the forecast combination puzzle and conclude that it is due to estimation error of the combining weights.Now, we note that, in the CF-factor model described above, we can consistently estimate the combining weights.From the CF-factor forecast (14) and the estimated factor (12), where ŵ := ΛCF αCF (20) is estimated consistently as long as ΛCF and αCF are estimated consistently.

Singular Value Decomposition
In this section, we formalize the concept of supervision and explain how it improves factor extraction.We compare the two different approaches CI-PC (Combining Information-Principal Components) and CF-PC (Combining Forecasts-Principal Components) in a linear forecast problem of the time series y given predictor data X.We explain the advantage of the CF-PC approach over CI-PC in Section 2.3 and give some examples in Section 2.4.We explore the advantage of supervision in simulations in Section 3.2.As an alternative to PC factors, we propose the use of NS factors in Section 4.

Principal components of predictors X (CI-PC):
Let X ∈ R T×N be a matrix of regressors and let An exception is Wright (2009), who uses Bayesian model averaging (BMA) for pseudo out-of-sample prediction of U.S. inflation, and finds that it generally gives more accurate forecasts than simple equal-weighted averaging.He uses N = 107 predictors.
be the singular value decomposition of X, with Σ ∈ R T×N diagonal rectangular, that is, diagonal square matrix padded with zero rows below the square if min(T, N) = N or padded with zero columns next to the square if min(T, N) = T, R ∈ R T×T , and W ∈ R N×N is unitary.Write where Σ Σ := diag(σ 2 1 , . . ., σ 2 N ) is diagonal and square.Therefore, W contains the eigenvectors of X X.For a matrix A ∈ R T×N , denote by A k ∈ R T×k the matrix consisting of the first k ≤ N columns of A. Then, W k is the matrix containing the singular vectors corresponding to the k = k CI largest singular values (σ 1 , . . ., σ k ).The first k principal components are given by where as R k R k = I k .Therefore, the CI forecast, ŷCI-PC , is the projection of y onto R k .The CI forecast error and the CI sum of squared error (SSE) are as Bai (2003) shows that, under general assumptions on the factor and error structure, F CI is a consistent and asymptotically normal estimator of F CI H, where H is an invertible k × k matrix. 6his identification problem is also clear from Equation ( 24), and it conveniently allows us to identify the principal components The principal components are scalar multiples of the first k columns of R. Bai's result shows that principal components can be estimated consistently only up to linear combinations.Bai and Ng (2006) show that the parameter vector α in the forecast equation can be estimated consistently for α H −1 with an asymptotically normal distribution.

Principal components of forecasts Ŷ (CF-PC):
To generate forecasts in a CF-factor scheme, we regress y on the columns x i of X, i = 1, . . ., N, one at a time, and calculate the fitted values of (15).Collect the fitted values in the matrix as in ( 16), with B = diag(b 1 , . . ., b N ) containing the regression coefficients in its diagonal.Compute the singular value decomposition of Ŷ: with Θ ∈ R T×N is diagonal rectangular, and S ∈ R T×T , V ∈ R N×N unitary.Pick the first k = k CF principal components of Ŷ, where V k is the N × k matrix of the singular vectors corresponding to the k largest singular values (θ 1 , . . ., θ k ) and Θ kk is the k × k upper-left diagonal block of Θ. Again, we can identify the estimated k principal components of Ŷ with F CF = S k , where F CF is the T × k CF matrix of factors of Ŷ.The projection (forecast) of y onto F CF is given by: as S k S k = I k .The CF forecast, ŷCF-PC , is the projection of y onto S k .The CF forecast error and the CF SSE are as (I T − S k S k ) is symmetric idempotent.

Supervision
In this sub-section, we explain the advantage of CF-PC over CI-PC in factor computation.We call the advantage "supervision", which is defined as follows: Definition 1. (Supervision).The advantage of CF-PC over CI-PC, called supervision, is the selection of principal components according to their contribution to variation in y, as opposed to selection of principal components according to their contribution to variation in the columns of X.This is achieved by selecting principal components from a matrix of forecasts of y.
We use the following measures of supervision of CF-PC in comparison with CI-PC.

Definition 2. (Absolute Supervision
). Absolute supervision is the difference of the sums of squared errors (SSE) of CI-PC and CF-PC: Definition 3. (Relative Supervision).Relative supervision is the ratio of the sums of squared errors of CI-PC over CF-PC: because SS = RR = I T .Relative supervision is defined only for k CF < N.
For the sake of simplifying the notation and presentation, we consider the same number of factors in CI and CF factor models with k CI = k CF = k for the rest of the paper.
Remark 5. S k is a block of a basis change matrix that in the expression y S k returns the first k coordinates of y with respect to the new basis.This new basis is the one with respect to which the mapping Ŷ Ŷ = XBBX = SΘΘ S becomes diagonal, with singular values in descending order such that the first k columns of S correspond to the k largest singular values.Therefore, y S k S k y is the sum of the squares of these coordinates.Broadly speaking, the S k are the k largest components of y in the sense of Ŷ and its construction from the single regression coefficients.Thus, y S k S k y is the sum of the squares of the k coefficients in y that contributes most to the variation in the columns of Ŷ.
Analogously, R k is a block of a basis change matrix that for y R k returns the first k coordinates of y with respect to the basis that diagonalizes the mapping XX = RΣΣ R .Therefore, y R k R k y is the sum of squares of the k coordinates of y selected according to their contribution to variation in the columns of X.
We emphasize the factors that explain most of the variation of the columns of X, i.e., the eigenvectors associated with the largest eigenvalues of XX , which are selected in the principal component analysis of X, may have little to do with the factors that explain most of the variation of y, however.The relation between X and y in the data-generating process can, at worst, completely reverse the order of principal components in the columns of X and in y.We demonstrate this in the following Example 1.

Example 1
In this subsection, we give a small example to facilitate intuition for the supervision mechanics of CF-PC.Example 1 illustrates how the supervision of factor computation defined in Definition 1 operates.In Example 2 in the next section, we add randomness to Example 1 to explore the effect of stochasticity in a well-understood problem. Let with T = 6 and N = 5.The singular value decomposition of 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 Then, the diagonal matrix B that contains the coefficients of y w.r.t. each column of X is and The singular value decomposition of 5 0 0 0 0 0 4 0 0 0 0 0 3 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 We set k CI = k CF = k and compare CI-PC and CF-PC with the same number of principal components.Recall from ( 23) that F CI = RΣ k and from (28) that F CF = SΘ k .The absolute supervision and relative supervision, defined in ( 32) and ( 33), are computed for each k : See Appendix A for the calculation.The absolute supervision is all positive and the relative supervision is larger than 1 for all k < N.
As noted in Remarks 1 and 5, the relation between X and y is crucial.In this example, the magnitude of the components in y is reversed from the order in X.For X, the ordering of the columns of X with respect to the largest eigenvalues of XX is {3, 1, 2, 5, 4}.For y, the ordering of the columns of X with respect to the largest eigenvalues of Ŷ Ŷ is {4, 5, 2, 1, 3}.For example, consider the case k = 2, i.e., we choose two out of five factors in the principal component analysis.CI-PC, the analysis of X, will pick the columns 3 and 1 of X, that is, the vectors (1, 0, 0, 0, 0, 0) and (0, 1/2, 0, 0, 0, 0) .These correspond to the two largest singular values 1 and 1/2 of X. CF-PC, the analysis of Ŷ, will pick columns 4 and 5 of X, that is, the vectors (0, 0, 0, 0, 1/5, 0) and (0, 0, 0, 1/4, 0, 0) .These correspond to the two largest singular values 5 and 4 of Ŷ.The regression coefficients in B = diag(4, 9, 1, 25, 16) de-emphasize columns 3 and 1 of X and emphasize columns 4 and 5 of X.

Monte Carlo
There are several simplifications in the construction of Example 1, which we relax by the following extensions: (a) Adding randomness makes the estimation of the regression coefficients in B a statistical problem.The sampling errors influence the selection of the components of Ŷ.(b) Adding correlation among regressors (columns of X) introduces correlation among individual forecasts (columns of Ŷ), increasing the effect of sampling error in the selection of the components of Ŷ. (c) Increasing N to realistic magnitudes, in particular in the presence of highly correlated regressors, will increase estimation error in the principal components due to collinearity.
We address the first extension (a) in Example 2. All three extensions (a), (b), (c) are addressed in Example 3 of Section 3.2.

Example 2
Consider adding some noise to X, y in Example 1.Let v be a T × N matrix of independent random numbers, each entry distributed as N(0, σ 2 v ), and u be a vector of independent random numbers, each distributed as N(0, σ 2 u ).In this example, the new regressor matrix X is the sum of X in Example 1 and the noise term v, and the new y is the sum of y in Example 1 and the noise term u.For simplicity, we set σ v = σ u in the simulations and let both range from 0.01 to 3.This covers a substantial range of randomness given the magnitude of the numbers in X and y.For each scenario of σ v = σ u , we generate 1000 random matrices v and random vectors u and calculate the Monte Carlo average of the sums of squared errors (SSE).
Figure 1 plots the Monte Carlo average of the SSEs for selection of k = 1 to k = 4 components.For standard deviations σ v = σ u close to zero, the sum of squared errors are as calculated in Example 1.As the noise increases, the advantage of CF over CI decreases but remains substantial, in particular for smaller numbers of principal components.For k = 5 estimated components (not shown), the SSEs of CI-PC and CF-PC coincide because k = N.

Example 3
We consider the data-generating process (DGP) where y is the T × 1 vector of observations, F is a T × r matrix of factors, Λ is an N × r matrix of factor loadings, α is an r × 1 parameter vector, v is a T × N random matrix, and u is a T × 1 vector of random errors.We set T = 200, N = 50 and consider r = 3 data-generating factors.Note that, under this DGP, the CI-PC model in Equations ( 6) and ( 7) is correctly specified if the correct number of factors is identified, i.e., k CI = r.Even under this DGP, however, an insufficient number of factors, k CI < r, can still result in an advantage of the CF-PC model over the CI-PC model.We will explore this question in this section.
Factors and persistence: For each run in the simulation, we generate the r factors in F as independent AR(1) processes with zero mean and a normally distributed error with mean zero and variance one: We consider a grid of 19 different AR (1) coefficients φ, equidistant between 0 and 0.90.We consider r = 3 data-generating factors and k ∈ {1, 2, 3, 4} estimated factors.Contemporaneous factor correlation: Given a correlation coefficient ρ for adjacent regressors, the N × r matrix Λ of factor loadings is obtained from the first r columns of an upper triangular matrix from a Cholesky decomposition of We consider a grid of 19 different values for ρ, equidistant between the points −0.998 and 0.998.In this setup, the 10th value is very close to ρ = 0.Then, the covariance matrix of the regressors is given by where Ω F = ΛΛ and Ω v = Ev v is given by the identity matrix in our simulations.The relation EF F = I is due to the independence of the factors, but may be subject to substantial finite sample error, in particular for φ close to one, for well-known reasons.
Relation of X and y: The r × 1 parameter vector α is drawn randomly from a standard normal distribution for each run in the simulation.This allows α to randomly shuffle which factors are important for y.
Noise level: We set σ u = σ v and let it range between 0.1 and 3 in steps of 0.1.We add the case of 0.01 that essentially corresponds to a deterministic factor model.
For a given number r = 3 of data-generating factors, the simulation setup varies along the dimensions φ (19 points), k (4 points), ρ (19 points), σ u = σ v (31 points).For every single scenario, we run 1000 simulations and calculate the SSEs of CI-PC and CF-PC, and the relative supervision s rel (X, y, k, k).Then, we take the Monte Carlo average of the SSEs and s rel (X, y, k, k) over the 1000 simulations. 7 The Monte Carlo results are presented in Figures 2-4.Each figure contains four panels that plot the situation for k = 1, 2, 3, 4 estimated number of factors.The main findings from the figures can be summarized as follows: 1.In relation to the empirical application using the yield data in Section 5, we could have calibrated the simulation design to make the Monte Carlo more realistic for the empirical application in Section 5. Nevertheless, our Monte Carlo design covers wide ranges of the parameter values for the noise levels, correlation structures (ρ and φ) in the yield data. Figure 2 shows that the supervision is smaller with larger noise levels, which may be rather obvious intuitively.Figure 4 shows that the advantage of supervision when the factors are persistence, which depends on the number of factors k relative to the true number of factors r.Particularly interesting is Figure 3 which shows that the advantage of supervision is smaller when the contemporaneous correlation ρ between predictors is larger, which may be relevant for the yield data because the yields with different maturities may be moderately contemporaneously correlated.We thank a referee for pointing this out.
more are estimated (k ≥ r), as in bottom panels, the advantage of supervision increases with the noise level σ u = σ v , Even in this case when the CI-PC is the correct model (k ≥ r), supervision becomes larger as the noise increases.2. Figure 3: The advantage of supervision is greatest when the contemporaneous correlation ρ between predictors is minimal.For almost perfect correlation, the advantage of supervision disappears.This is true regardless of whether the correct number of factors is estimated or not.
Intuitively, for near-perfect factor correlation, the difference between those factors that explain variation in the columns of X and those that explain variation in Ŷ vanishes, and so supervision becomes meaningless.3. Figure 4 1) coefficients φ ranging from 0 to 0.9, while the noise level is fixed at σ u = σ v = 1 and the contemporaneous regressor correlation is ρ = 0.

Supervising Nelson-Siegel Factors
In the previous section, we have examined the factor model based on principal components.When the predictors are points on the yield curve, an alternative factor model can be constructed based on Nelson-Siegel (NS) components.We introduce two new factor models, CF-NS and CI-NS, by replacing principal components with NS components in CF-PC and CI-PC models.Like CI-PC, CI-NS is unsupervised.Like CF-PC, CF-NS is supervised for the particular forecast target of interest.

Nelson-Siegel Components of the Yield Curve
As an alternative to using principal components in the factor model, one can apply the modified Nelson-Siegel (NS) three-factor framework of Diebold and Li (2006) to factorize the yield curve.Nelson and Siegel (1987) propose Laguerre polynomials L n (z) = e z n! d n dz n (z n e −z ) with weight function w(z) = e −z to model the instantaneous nominal forward rate (forward rate curve) where z = θτ, L 0 (z) = 1, L 1 (z) = 1 − θτ, and β j ∈ R for all j.The decay parameter θ may change over time, but we fixed θ = 0.0609 for all t following Diebold and Li (2006). 8 Then, the continuously compounded zero-coupon nominal yield x t (τ) of the bond with maturity τ months at time t is Allowing the β j 's to change over time and adding the approximation error v it , we obtain the following approximate NS factor model for the yield curve for i = 1, . . ., N: where f t = (β 1t , β 2t , β 3t ) are the three NS factors and ) are associated with level, slope, and curvature of the yield curve.
8 Diebold and Li (2006) show that fixing Nelson-Siegel decay parameter at θ = 0.0609 maximizes the curvature loading at the two-year bond maturity and allows better identifications of the three NS factors.They also show that allowing the θ to be a free parameter does not improve the forecasting performance.Therefore, following their advice, we fix θ = 0.0609 and did not estimate it.A small θ (for a slow decaying curve) fits the curve for long maturities better and a large θ (for a fast decaying curve) fits the curve for short maturities better.

NS Components of Predictors X (CI-NS)
We have N predictors of yields x t = (x 1t , x 2t , . . ., x Nt ) where x it = x t (τ i ) denotes the yield to maturity τ i months at time t, (i = 1, 2, . . ., N). Stacking x it for i = 1, 2, . . ., N, (48) can be written as or where λ i denotes the i-th row of which is the N × 3 matrix of known factor loadings because we fix θ = 0.0609 following Diebold and Li (2006).The NS factors fCI,t = ( β1t , β2t , β3t ) are estimated from regressing x it on λ CI,i (over i = 1, . . ., N) by fitting the yield curve period by period for each t.
Then, we consider a linear forecast equation in order to forecast y t+h (such as output growth or inflation).We first estimate αCI using the information up to time T and then form the forecast we call CI-NS by This method is comparable to CI-PC with number of factors fixed at k = 3.It differs from CI-PC, however, in that the three NS factors ( β1t , β2t , β3t ) have intuitive interpretations as level, slope and curvature of the yield curve, while the first three principal components may not have a clear interpretation.In the empirical section, we also consider two alternative CI-NS forecasts by including only the level factor β1t (denoted CI-NS (k = 1)), and only the level and slope factors ( β1t , β2t ) (denoted CI-NS (k = 2)) to see whether the level factor or the combination of level and slope factors have dominant contribution in forecasting output growth and inflation.

NS Components of Forecasts Ŷ (CF-NS)
While CI-NS solves the large-N dimensionality problem by reducing the N yields to three factors fCI,t = ( β1t , β2t , β3t ) , it computes the factors entirely from yield curve information x t only, without accounting for the variable y t+h to be forecast.Similar in spirit to CF-PC, here we can improve CI-NS by supervising the factor computation, which we term as CF-NS.
The CF-NS forecast is based on the NS factors of ŷt+h := ( ŷ(1) t+h , ŷ(2) t+h , . . ., ŷ(N) t+h ) , a vector of the N individual forecasts as in ( 10) and ( 11), with Λ CF = Λ CI in (51).Hence, Λ CI = Λ CF = Λ for the NS factor models.Note that, when the NS factors loadings are normalized to sum up to one, the three CF-NS factors fCF,t+h = Λ ŷt+h (55) T+h are weighted individual forecasts with the three normalized NS loadings, with , and The CF-NS forecast can be obtained from the forecasting equation ŷCF-NS T+h = f CF,T+h αCF , which is denoted CF-NS(k = 3).The parameter vector αT is estimated using information up to time T. Using only the first factor or the first two factors, one can obtain the forecasts CF-NS(k = 1) and CF-NS(k = 2).Note that, while the CF-PC method can be used for data of many kinds, the CF-NS method we propose is tailored to forecasting using the yield curve.It uses fixed factor loadings in Λ that are the NS exponential factor loadings for yield curve modeling, and hence avoids the estimation of factor loadings.In contrast, CF-PC needs to estimate Λ.
Also note that, by construction, CF-NS(k = 1) is the equally weighted combined forecast

Forecasting Output Growth and Inflation
This section presents the empirical analysis where we describe the data, implement forecasting methods introduced in the previous sections on forecasting output growth and inflation, and analyze out-of-sample forecasting performances.This allows us to analyze the differences between output growth and inflation forecasting using the same yield curve information and to compare the strengths of different methods.

Data
Let y t+h denote the variable to be forecast (output growth or inflation) using yield information up to time t, where h denotes the forecast horizon.The predictor vector x t = (x t (τ 1 ), x t (τ 2 ), . . ., x t (τ N )) contains the information about the yield curve at various maturities: x t (τ i ) denotes the zero coupon yield of maturity τ i months at time t (i = 1, 2, . . ., N).
Two forecast targets, output growth and inflation, are constructed respectively as monthly growth rate of Personal Income (PI, seasonally adjusted annual rate) and monthly change in CPI (Consumer Price Index for all urban consumers: all items, seasonally adjusted) from 1970:01 to 2010:01.PI and CPI data are obtained from the web site of the Federal Reserve Bank of St. Louis (FRED2).
We apply the following data transformations.For the monthly growth rate of PI, we set y t+h = 1200[(1/h) ln(PI t+h /PI t )] as the forecast target (as used in Ang et al. (2006)).For the consumer price index (CPI), we set y t+h = 1200[(1/h) ln(CPI t+h /CPI t )] as the forecast target (as used in Stock and Watson (2007)). 99 y t+h = 1200[(1/h) ln(CPI t+h /CPI t ) − ln(CPI t /CPI t−1 )] is used in Bai and Ng (2008).
Our yield curve data consist of U.S. government bond prices, coupon rates, and coupon structures, as well as issue and redemption dates from 1970:01 to 2009:12. 10 We calculate zero-coupon bond yields using the unsmoothed Fama and Bliss (1987) approach.We measure bond yields on the second day of each month.We also apply several data filters designed to enhance data quality and focus attention on maturities with good liquidity.First, we exclude floating rate bonds, callable bonds and bonds extended beyond the original redemption date.Second, we exclude outlying bond prices less than 50 or greater than 130 because their price discounts/premium are too high and imply thin trading, and we exclude yields that differ greatly from yields at nearby maturities.Finally, we use only bonds with maturity greater than one month and less than fifteen years because other bonds are not actively traded.Indeed, to simplify our subsequent estimation, using linear interpolation we pool the bond yields into fixed maturities of 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,33,36,39,42,45,48,51,54,57,60,63,66,72,78,84,90,96,102,108, and 120 months, where a month is defined as 30.4375days. 11 We examine some descriptive statistics (not reported for space) of the two forecast targets and yield curve level, slope, and curvature (empirical measures), over the full sample from 1970:01 to 2009:12 and the out-of-sample evaluation period from 1995:02 to 2010:01.We observe that both PI growth and CPI inflation become more moderate and less volatile from around the mid-1980s.This has become a stylized fact known as the "Great Moderation".In particular, there is a substantial drop in persistency of CPI inflation.The volatility and persistency of the yield curve slope and curvature do not change much.The yield curve level, however, decreases and stabilizes.
In predicting macroeconomic variables using the term structure, yield spreads between yields with various maturities and the short rate are commonly used in the literature.One possible reason for this practice is that yield levels are treated as I(1) processes, so yield spreads will likely be I(0).Similarly, macroeconomic variables are typically assumed to be I( 1) and transformed properly into I(0), so that, in using yield spreads to forecast macro targets, issues such as spurious regression are avoided.In this paper, however, we use yield levels (not spreads) to predict PI growth and CPI inflation (not change in inflation), for the following reasons.First, whether yields and inflation are I(1) or I(0) is still arguable.Stock andWatson (1999, 2012) use yield spreads and treat inflation as I(1), so they forecast change in inflation.Inoue and Kilian (2008), however, treat inflation as I(0).Since our target is forecasting inflation, not change in inflation, we will treat CPI inflation as well as yields as I(0) in our empirical analysis.Second, we emphasize real-time, out-of-sample forecasting performance more than in-sample concerns.As long as out-of-sample forecast performance is unaltered or even improved, we think the choice of treating the variables as I(1) or I(0) variables does not matter much. 12Third, using yield levels will allow us to provide clearer interpretations for questions such as what part of the yield curve contributes the most towards predicting PI growth or CPI inflation, and how the different parts of the yield curve interact in the prediction, etc.

Out-of-Sample Forecasting
All forecasting models are estimated in a rolling window scheme with window size R = 300 months ending at month t (starting at t − R + 1).In the evaluation period from t = 1995:02 10 As a robust check, we apply our method to the original yield data of Diebold and Li (2006) and also to the sub-samples in our data set.The results are essentially the same as those summarized at the end of Section 5. 11 It may be interesting to explore whether different maturity yields might have different effects on the forecast outcome.
However, the present paper is focused on the comparison between CF and CI, rather than a detailed CI-only analysis, e.g., to find the best maturity yield for the forecast outcome.Nevertheless, our CI-NS model has reflected such effects as the three NS factors (level, slope, and curvature) are different combinations of bond maturities as shown in Equation ( 55).
The different coefficients on the NS factors suggest that different bond maturities have different effects on the forecast outcome, as Gogas et al. (2015) has found. 12While not reported for space, we tried forecasting change in inflation and found forecasting inflation directly using all yield levels improves out-of-sample performances of most forecasting methods by a large margin.
The out-of-sample evaluation period is from 1995:02 to 2010:01 (hence out-of-sample size P = 180). 13n all NS-related methods (CI and CF), we set θ, the parameter that governs the exponential decay rate, at 0.0609 for reasons discussed in Diebold and Li (2006). 14We compare h-months-ahead out-of-sample forecasting results of those methods introduced so far for h = 1,3,6,12,18,24,30, 36 months ahead.
Figure 5 illustrates what economic contents these factors in CF-PC may bear.It shows that the first PC assigns about equal weights to all N = 50 individual forecasts that use yields at various maturities (in months) so that it may be interpreted as the factor that captures the level of the yield curve; the second PC assigns roughly increasing weights so that it may be interpreted as the factor capturing the slope; and the third PC assigns roughly first decreasing then increasing weights, so that it may be interpreted as factor capturing curvature.
Tables 1 and 2 present the root mean squared forecast errors (RMSFE) of PC methods with k = 1, 2, 3, 4, 5, and of NS methods with k = 1, 2, 3, for PI growth (Table 1A) and for CPI inflation (Table 2A) forecasts using all 50 yield levels. 15 In Panel A of Tables 1 and 2, we report the Root Mean Squared Forecast Errors (RMSFE, which is the squared root of the MSFE of a model). 16In Panel B of Tables 1 and 2  We find that, in general, supervised factorization performs better.The CF schemes (CF-PC and CF-NS) perform substantially better than the CI schemes (CI-PC and CI-NS).Within the same CF or CI schemes, two alternative factorizations work similarly: CF-PC and CF-NS are about the same, and CI-PC and CI-NS are about the same.We summarize our findings from Figure 5 and Tables 1 and 2 as follows.
4. We often get the best supervised predictions with a single factor (k = 1) with the CF-factor models. 18 Since CF-NS(k = 1) is the equally weighted combined forecast as noted in Section 4.2.2, this is another case of the forecast combination puzzle discussed in Remark 3 that the equal-weighted forecast combination is hard to beat.Since CF-PC(k = 1) is numerically identical to CF-NS(k = 1) as shown in Figure 5, CF-PC(k = 1) is also effectively equally weighted forecast averaging. 19 Figlewski and Urich (1983) talked about various constrained models in forming a combination of forecasts and examined when we need more than the simple averaging combined forecast.They discussed a sufficient condition when the simple average of forecasts is the optimal forecast combination: "Under the most extensive set of constraints, forecast errors are assumed to have zero mean and to be independent and identically distributed.In this case the optimal forecast is the simple average."This corresponds to CF-PC(k = 1) and CF-NS(k = 1) when the first factor (k = 1) in PC or NS is sufficient for the CF factor model.It is clearly the case in CF-NS as shown in Equation ( 55).One can show that the first PC (corresponding to the largest singular value) would also be the simple average.Hence, in terms of the CF-factor model, the forecast combination puzzle amounts to the fact that we often do not need the second PC factor.Interestingly, (Figlewski and Urich 1983, p. 696) continued to note the cases when the simple average is not optimal: "However, the hypothesis of independence among forecast errors is overwhelmingly rejected for our data-errors are highly positively correlated with one another."On the other hand, they also noted other reasons why the simple average may still be preferred, as they wrote, "Because the estimated error structure was not completely stable over time, the models which adjusted for correlation did not achieve lower mean squared forecast error than the simple average in out-of-sample tests.Even so, we find...that forecasts from these models, while less accurate than the simple mean, do contain information which is not fully reflected in prices in the money market, and is therefore economically valuable."We thank a referee for letting us know on this from Figlewski and Urich (1983). 19While the simple equally weighted forecast combination can be implemented without the use of PCA or without making reference to the NS model, it is important to note that the simple average combined forecast indeed corresponds the first CF-PC factor (CF-PC(k = 1)) or the first CF-NS factor (CF-NS(k = 1)).In view of Figlewski and Urich (1983), it will be useful to know when the first factor (k = 1) is enough so that the simple average is good or when the higher order factors (k > 1) may be necessary as they contain more information in addition to the first CF-factor.This is important in understanding the forecast combination puzzle.The forecast combination puzzle is about whether to include only the first CF factor or more.

Conclusions
For forecasting in the presence of many predictors, it is often useful to reduce the dimension by a factor model (in a dense case) or by variable selection (in a sparse case).In this paper, we consider a factor model.In particular, we examine the supervised principal component analysis of Chan et al. (1999).The model is called CF-PC, as the principal components of many forecasts are the combined forecasts.
The CF-PC extracts factors from the space spanned by forecasts rather than from the space spanned by predictors.This factorization of the forecasts improves forecast performance compared to factor analysis of the predictors.We extend the CF-PC to CF-NS, which uses the NS factor model in place of the PC factor model, for the application where the predictors are the yield curve.While the yield curve is a functional data consisting of many different maturity points on a curve at each time, the NS factors can parsimoniously capture the shapes of the curve.
We have applied the CF-PC and CF-NS models in forecasting output growth and inflation using a large number of bond yields to examine if the supervised factorization improves forecast performance.In general, we have found that CF-PC and CF-NS perform substantially better than CI-PC and CI-NS, that the advantage of supervised factor models is even larger for longer forecast horizons, and that the two alternative factor models based on PC and NS factors are similar and perform similarly.

Figure 1 .
Figure 1.For Example 2. Monte Carlo averages of the sum of squared errors (SSE) against a grid of standard deviations σ u = σ v ranging from 0.01 to 3 in factor and forecast equations, for a selection of k = 1 to k = 4 components.When the standard deviation is close to zero, the SSE are close to the ones reported in Example 1.With increasing noise, the advantage of CF over CI decreases but remains substantial, in particular for few components.For k = 5 = N (not shown), the SSE of CI-PC and CF-PC coincide, as shown in Remark 4.

Figure 2 :
If the number of estimated factors k is below the true number r = 3, as shown in top panels, the supervision becomes smaller with increasing noise.If the correct number of factors or 7 : If the correct number of factors or more are estimated (k ≥ r), the advantage of supervision decreases with factor persistence φ.High persistence induces spurious contemporaneous correlation, and in this sense the situation is related to the result in No. 2. If the number of estimated factors is below the true number of factors (k < r), however, the advantage of supervision increases with factor persistence.

Figure 2 .Figure 3 .Figure 4 .
Figure2.Supervision dependent on noise.Relative supervision against a grid of standard deviations in factor and forecast equation σ u = σ v , ranging from 0.01 to 3, while the factor serial correlation is fixed at φ = 0 and the contemporaneous factor correlation is ρ = 0.
, we report Relative Supervision of CI-PC vs. CF-PC and Relative Supervision of CI-NS vs. CF-NS, according to Definition 3, which is the ratio of the MSFEs of two CI and CF models.The relative supervision in Panel B can be obtained from RMSFEs in Panel A. For simplicity of presentation in Panel B, we present the relative supervision only with the same number of factors (k CI = k CF and k NS = k NS ).

Table 1 .
Out-of-sample forecasting of personal income growth.

Table 1 .
Cont.The forecast target is Output Growth y t+h = 1200 × log(PI t+h /PI t ) ÷ h.Out-of-sample forecasting period is 02/1995-01/2010.In Panel A, reported are the Root Mean Squared Forecast Errors (which is the squared root of the MSFE of a model).In Panel B, reported are Relative Supervision of CI-PC vs. CF-PC and Relative Supervision of CI-NS vs. CF-NS, according to Definition 3, which is the ratio of the MSFEs of the two models.For simplicity of presentation, we present the relative supervision in Panel B only with the same number of factors (k CI = k CF = k and k NS = k NS = k).

Table 2 .
Out-of-sample forecasting of CPI inflation.

Panel B. Relative Supervision s rel (X, y, k CI , k CF )
The forecast target is Inflation y t+h = 1200 × log(CPI t+h /CPI t ) ÷ h.Out-of-sample forecasting period is 02/1995-01/2010.In Panel A, reported are the Root Mean Squared Forecast Errors (which is the squared root of the MSFE of a model).In Panel B, reported are Relative Supervision of CI-PC vs. CF-PC and Relative Supervision of CI-NS vs. CF-NS, according to Definition 3, which is the ratio of the MSFEs of the two models.For simplicity of presentation, we present the relative supervision in Panel B only with the same number of factors (k CI = k CF = k and k NS = k NS = k).