An Alternative Estimation Method for Time-Varying Parameter Models

: A multivariate, non-Bayesian, regression-based, or feasible generalized least squares (GLS)-based approach is proposed to estimate time-varying VAR parameter models. Although it has been known that the Kalman-smoothed estimate can be alternatively estimated using GLS for univariate models, we assess the accuracy of the feasible GLS estimator compared with commonly used Bayesian estimators. Unlike the maximum likelihood estimator often used together with the Kalman ﬁlter, it is shown that the possibility of the pile-up problem occurring is negligible. In addition, this approach enables us to deal with stochastic volatility models, models with a time-dependent variance–covariance matrix, and models with non-Gaussian errors that allow us to deal with abrupt changes or structural breaks in time-varying parameters.


Introduction
Most macroeconomists recognize that time-varying parameter models are sufficiently flexible to capture the complex nature of a macroeconomic system, thereby fitting the data better than models with constant parameters. The instability of the parameters in econometric models has often been incorporated in Markov-switching models (e.g., Hamilton 1989) and structural change models (e.g., Perron 1989). However, time-varying models allow the parameters to change gradually over time, which is the main difference between time-varying models and Markov-switching or structural break models.
In the literature on the application of time-varying vector autoregressive (TV-VAR) models in macroeconomics, Bernanke and Mihov (1998) consider that the autoregressive parameters may be time-varying. However, after confirming the stability of the parameters using the parameter consistency test of Hansen (1992), they employ the time-invariant (i.e., usual) VAR model. Indeed, Cogley and Sargent (2005) find that Hansen's (1992) test has low power and is unreliable and instead propose a TV-VAR model with stochastic volatility in the error term. Primiceri (2005) sheds light on a technical aspect of the time-varying model, namely, the Bayesian estimation technique used for the time-varying parameters. In general, difficulties in dealing with time-varying parameter models arise when free parameters and unobserved variables need to be estimated. Primiceri (2005) thus presents a clear estimation procedure based on the Bayesian Markov Chain Monte Carlo (MCMC) method.
Several studies, including Primiceri (2005), claim that the Bayesian method is preferred to the maximum likelihood (ML) method because the former (i) is less likely to suffer from the so-called pile-up problem (Sargan and Bhargava 1983); (ii) is less likely to have computation problems such as a degenerated likelihood function or multiple local minima; and (iii) helps find statistical inferences such as standard errors. However, both the Bayesian and the ML methods require Kalman filtering to estimate an unobservable state vector that includes the time-varying parameters. 1 Duncan and Horn (1972), Sant (1977), Maddala and Kim (1998), more recently, Chan and Jeliazkov (2009) and Durbin and Koopman (2012) have attempted to understand Kalman filtering through the lens of conventional regression models. 2 To the best of our knowledge, Duncan and Horn (1972) are the first to show that the generalized least squares (GLS) estimator for basic state-space models equivalently uncovers the unobserved state vector estimated through Kalman filtering. Sant (1977) proves the equivalence between the GLS estimator and Kalman-smoothed estimator. From a practical perspective, the series of papers by Ito et al. (2014Ito et al. ( , 2016Ito et al. ( , 2021 apply the TV-VAR, time-varying autoregressive (TV-AR) and time-varying vector error correction (TV-VEC) models to stock prices and exchange rates using the regression method as opposed to the Kalman filter.
In this study, in the spirit of Duncan and Horn (1972) and Sant (1977), we propose a multivariate, regression-based approach or GLS-based approach that utilizes ordinary least squares (OLS) or GLS in lieu of the Kalman smoother. More precisely, our GLS-based approach includes OLS as a variety of GLS. It is employed by Ito et al. (2014Ito et al. ( , 2016Ito et al. ( , 2021 to evaluate market efficiency in stock markets and foreign exchange markets. 3 However, the main purpose of employing the Kalman filter (or smoother) is to avoid using a system of large matrices required by GLS-at least until computers became capable of handling large matrices.
The equivalence between GLS and the Kalman smoother leads us to the following question: if GLS yields Kalman-smoothed estimates, to what extent can the GLS-based approach recover time-varying parameters? This question is practical and important because the finite sample properties of the GLS estimator are, generally, unknown. 4 Another question pertains to the seriousness of the pile-up problem, which occurs when the ML estimation of the variance of the state equation error is zero, even though its true value is non-zero (but small). While our proposed method is not identical to ML because we do not maximize the likelihood function with respect to the variances of the errors, whether our GLS-based approach suffers from the pile-up problem to the same degree as ML is not immediately obvious.
We also consider the possibility of non-independent and identically distributed (i.i.d.) or non-Gaussian errors in the model. The former are frequently used in this field because it is reasonable to assume that the variance of the errors may be time-varying. The latter are important in empirical studies because they allow us to model abrupt changes or structural breaks in time-varying parameters similar to the strategy employed by Perron and Wada (2009) among others.
To summarize, the contributions of this study are as follows: GLS estimates the true time-varying parameters fairly well even when the errors are not i.i.d. or not Gaussian, provided an appropriate way to implement FGLS is carefully chosen based primarily on the relative size of the variances of the errors or signal-to-noise ratio (SNR). The pile-up problem that is often cumbersome to ML is shown to be negligible. In addition, our GLS outperforms the commonly used Bayesian estimation method in recovering the timevarying parameter values.
The rest of this paper is organized as follows. Section 2 presents our model together with its likelihood function. We explain the GLS-based approach for the class of TV-AR models and steps used to implement FGLS in Section 3. Section 4 evaluates the GLSbased approach in a variety of conditions such as a small SNR, non-i.i.d. errors, and non-Gaussian errors.
An application to macroeconomic data, including a comparison with the Bayesian MCMC method, is demonstrated in Section 5. Section 6 concludes.

Model
Our model allows the class of TV-AR models and permits two different matrix forms. The first matrix form is that of Durbin and Koopman (2012), which they use as a device to find the Kalman-smoothed estimate of an unobserved state vector. The second matrix form is an extended version of Maddala and Kim (1998), which we employ in this study. As it will become clear, this form allows us to use GLS to estimate the time-varying parameters. We can then formally demonstrate that the Kalman-smoothed estimate of the first matrix form is equivalent to the GLS estimates of the second matrix form, showing that GLS estimates are an alternative estimation method to the Kalman smoother.

Basic State-Space Model of the Class of TV-AR Models
Our model is given by: where y t is a k × 1 vector of observable variables; Z t is a k × m matrix of observable variables; β t is an m × 1 vector of time-varying parameters; and ε t and η t are k × 1 and m × 1 vectors of normally distributed error terms with the variance-covariance matrices of H t and Q t , respectively: The variance-covariance matrices H t and Q t are allowed to be time-dependent, as in the stochastic volatility model. For the initial value of β t , we assume If we assume b 0 and P 0 are known, it is reasonable to use the diffuse prior for P 0 because β t follows a non-stationary process. In this case, the diagonal elements of P 0 should be large numbers (e.g., Harvey 1989;Koopman 1997). Alternatively, we can simply ignore P 0 as zero when we assume β 0 is known and not stochastic.
Equations (1) and (2) can be used for a variety of TV-AR models. For example, when k = 1, Z t = y t−1 yields a TV-AR(1) model. Similarly, the TV-VAR(1) model y t = A t y t−1 + ε t with A t = A t−1 + η t is expressed by setting Z t = y t−1 ⊗ I k and β t = vec(A t ). It is also possible to include intercepts that vary over time. For a TV-AR(1) model, for example, one can set Z t = (1, y t−1 ), meaning that the first element of β t is the time-varying intercept.
Below, we present two specifications of our model, (1) and (2). The first specification allows us to derive the Kalman-smoothed estimate as explained by Durbin and Koopman (2012). The second specification is in the spirit of Duncan and Horn (1972), leading us to the GLS-based approach. As we will see, both specifications yield the same smoothed estimate.

Model Matrix Formulation of the State-Space Model
Following Durbin and Koopman (2012), we employ the matrix formulation of Equations (1) and (2). For t = 1, . . . , T, we have a system of equations: Unlike a more general state-space model in which Equation (2) has a transition matrix that includes unknown parameters to be estimated, the matrix formulation of the timevarying parameter model is largely simplified. For example, matrix C is often called the random walk generating matrix (e.g., Tanaka 2017), which is non-singular, and there are no free parameters to be estimated in the matrix. In addition, if H t and Q t are time invariant (i.e., if no GARCH effects or stochastic volatility exists in the model), matrices H and Q are simplified substantially.
For simplicity, we assume b 0 is known and non-stochastic; hence, P 0 = 0. 6

Likelihood Function
Since we assume that ε and η are normally distributed, our matrix formulation of (1) and (2) allows us to write the log-likelihood function for Y T given the covariance matrices of the errors (H and Q), and initial value vector (b * 0 ) as where Interestingly, provided that H, Q, and b * 0 are known, the likelihood function does not involve our main parameter vector of interest, β.

Regression Lemma and Kalman Smoothing
Before showing the equivalence of our estimator and Kalman smoother, let us clarify the outcomes of the Kalman smoother when the model is described by Equations (1) and (2). According to Durbin and Koopman (2012), the Kalman-smoothed state of β is given by the expectation of β conditional on the information on all the observations of y t : Note that we assume normal errors to derive Equation (6). The variance of β, given all the observations Y T , is then The Kalman-smoothed estimate and its mean squared error (MSE) are given by (6) and (7), respectively. Durbin and Koopman (2012) call these equations the regression lemma, which derives the mean and variance of the distribution of β conditional on Y T , assuming the joint distribution of β and Y T is a multivariate normal distribution. It follows that for (3) and (4), the Kalman-smoothed estimate is and the conditional variance (or MSE) of the smoothed estimate is where Ω = H + ZCQC Z . Equations (8) and (9) are obtained given that Cov(β, Y T ) = Var(β)Z = CQC Z and Var(Y T ) = ZCQC Z + H = Ω, and by substituting them into Equations (6) and (7). It is well known that (8) is a minimum-variance linear unbiased estimate of β, given Y T , even though we do not assume the errors are normally distributed (e.g., Durbin and Koopman 2012).

Equivalence of the GLS-Based Estimator and Kalman Smoother
Equations (3) and (4) can be written in another matrix form to apply conventional regression analysis: This specification is similar to those of Duncan and Horn (1972) and Maddala and Kim (1998). The main difference between our specification and that of Duncan and Horn (1972) is that the former applies to a time-varying parameter model, while the latter is for a more general state-space model, which allows the transition equation to have a transition matrix F (i.e., when Equation (2) is β t = Fβ t−1 + η t ). Since we do not need to estimate the transition matrix, our regressors in Equation (10) are all known. 7 As mentioned by Duncan and Horn (1972), the confusion around the similarities and differences between Kalman filtering (including smoothing) and the conventional regression model stems from the fact that the former is the expectation of β, conditional on the information on Y T , which is the linear projection of β onto the space spanned by Y T (provided that the errors are normally distributed). On the contrary, the latter is a linear projection of the dependent variable onto the space spanned by the regressor, which is the projection of the left hand side of Equation (10) onto the space spanned by Z −C −1 . However, Duncan and Horn (1972) show that GLS for (10) until the time-t observation yields the Kalman-filtered estimate.
As shown by Sant (1977), a natural extension of Duncan and Horn (1972) is that we obtain the Kalman-smoothed estimate of β when GLS is applied to all the observations, Y T . We summarize the equivalence of the GLS estimator of model (10) and Kalman-smoothed estimator (8) and its MSE matrix (9) in Appendix A. We further reveal the equivalence when the time-varying model has time-invariant intercepts in Appendix B. With the likelihood function (5), GLS yields the ML estimator for the time-invariant intercepts and the Kalmansmoothed estimator for the rest (time-varying coefficients).

GLS in Practice
As shown in the previous subsections, under the condition that the variance-covariance matrices of the errors (H and Q) are known, the GLS estimator of β is identical to the Kalman-smoothed estimates. However, in practice, those variance-covariance matrices are generally unknown. The following two-step approach is often used to find the FGLS estimator. First, β can be estimated using OLS. Then, the OLS residuals are used to estimate H and Q, which are denoted as H and Q, respectively. In the second step, FGLS is applied to our model assuming H and Q are the variance-covariance matrices of ε and η, respectively.
However, FGLS suffers from two problems. First, H and Q may involve too many unknown parameters. For example, when a TV-VAR(p) model has many variables (i.e., when k is large), H has (T − p) of k × k matrices, and Q has the same number of k(kp + 1) × k(kp + 1) matrices. The second problem is possible heteroskedasticity. Suppose that ε is much greater in magnitude than η. More precisely, when the average trace of H is much larger than the average trace of Q, our GLS-based approach has heteroskedasticity in regression Equation (10), potentially causing an imprecise estimation of β. This concern is largely mitigated when the average traces of H and Q are similar.
To solve these two problems, we propose the following FGLS procedure.

•
Step 1. We estimate model (10) by OLS and obtain the estimate of β by OLS, β O . From the OLS residuals, ε t and η t , we construct the first-step estimates of H t and Q t : Then, to construct the estimates of H and Q, denoted as H O and Q O , respectively, we set H p+1 = H p+2 = · · · = H T and Q p+1 = Q p+2 = · · · = Q T to assume that the variances of ε and η are time-invariant. This assumption is undesirable because a number of studies of TV-VAR models have focused on stochastic volatility models, which require Q t = Q t+1 , for example. However, thanks to this assumption, H and Q are always invertible, and those inverses are readily computed. The simulations in the next section will reveal how severely this assumption affects our estimation when stochastic volatility is present. With H O and Q O , the log-likelihood is computed by (A6) or (5). • Step 2 (1FGLS). Given H O and Q O , we apply FGLS to obtain β G1 , which is the FGLS or 1FGLS estimate of β. We also compute the estimates of H and Q, denoted as H G1 and Q G1 , respectively, in the same way as we computed H O and Q O in the first step. Then, the value of the log-likelihood function is computed. • Step 3 (2FGLS). We repeat Step 2, computing β G2 , which is the (second-time) FGLS or 2FGLS of β. More precisely, we use the FGLS residuals in Step 2 to construct H t and Q t to obtain β G2 . Then, the value of the log-likelihood function is computed. If the likelihood ratio (from OLS to 1FGLS or from 1FGLS to 2FGLS) cannot be computed or is extraordinarily large, such as greater than 1e+10, we disregard the 1FGLS and 2FGLS estimators because both indicate that the variance-covariance matrix is not precisely estimated (degenerated). In such a case, we only record OLS. In addition, we define 2FGLS' as GLS using Y T = Z β G1 1 and − b * 0 = −C −1 β G1 2 in place of ε t and η t , respectively, to compute H t and Q t , where β G1 1 and β G1 2 are the corresponding elements of 1FGLS, β G1 . The reason why we use β G2 , which denotes 2FGLS', is that it is expected to ameliorate the effects arising from poorly estimated β G1 . That is perhaps due to misspecified H and Q. When those matrices are not correctly estimated, β G1 may be far from its true value; hence, the residuals computed from β G1 should not be used for further FGLS because the repeated use of the wrong variance-covariance matrices may make the estimator worse. In such a case, it may make sense to obtain β G2 as it does not repeat the same type of misspecification. In summary, our procedure is based on the assumptions that the error terms have time-invariant variances and that the heteroskedasticity arising from the different sizes of H and Q can be correctly handled by the repeated use of FGLS. To validate our assumptions and procedure, we investigate the degree to which our procedure precisely estimates the true β using simulations.

Simulations
Among some influential empirical studies of TV-VAR, both Sargent (2001, 2005) and Primiceri (2005) employ a three-variable TV-VAR(2) model. Hence, in our simulation study, we adopt the same specification and use simulations to assess how well the GLS-based approach recovers the true time-varying parameters relative to the Bayesian approach. First, we compute the means and variances of the estimated time-varying parameters and compare them with those of the true time-varying parameters to evaluate the accuracy of both the GLS-based and the Bayesian 8 approaches. While comparing the first and second moments of the estimates to those of the true process may be inadequate to determine whether the GLS-based approach yields precise estimates, it is a useful way to grasp the overall accuracy of the estimates. 9 Second, we consider the possibility of the pile-up problem. According to Primiceri (2005), the Bayesian approach is preferred when estimating time-varying parameter models because, among other reasons, it can potentially avoid the pile-up problem. However, the extent to which this problem affects our estimate is not immediately obvious because the literature (e.g., Shephard and Harvey 1990) provides theoretical explanations only for limited (simple) cases. On the contrary, our model can have a vector of time-varying terms (β t ), unlike prior studies that have analyzed scalar time-varying terms for simplicity. Therefore, it is reasonable to conduct a simulation study to reveal the extent to which our GLS-based approach suffers from the pile-up problem. Because the concern about the pile-up problem grows when the variance of the state equation error or the SNR is small, we study the performance of the GLS-based approach more comprehensively by altering the SNR in the data-generating process.
Third, we also evaluate the performance of the GLS-based approach when stochastic volatility and non-Gaussian errors are present. We investigate the effect of stochastic volatility on the GLS-based approach because macroeconomic research, including Cogley and Sargent (2005) and Primiceri (2005), has been allowing such shocks in the TV-VAR model. While the GLS-based approach does not require the assumption of i.i.d. errors to estimate β t , we are interested in the extent to which the accuracy of the GLS-based approach is affected by the stochastic volatility of the errors. For the non-Gaussian errors, we focus on the possible structural breaks in the time-varying coefficients, β t . By allowing a mixture of normal errors, as explained in the following subsection, we can model structural breaks or abrupt changes in β t , as opposed to the gradual changes that the time-varying model generally assumes. Our simulation study is thus expected to shed light on the performance of the GLS-based approach when such errors are present.
Finally, as mentioned in Section 1, since we consider OLS to be a component of the GLS-based approach, we study its performance using simulations to clarify the relative performance FGLS and OLS, especially for small samples.

Data-Generating Process
We generate pseudo data by the system of Equations (3) and (4) with T = {100, 250}, H = 0.02 2 I, 0.2 2 I, 1 2 I and Q = 0.03 2 I . By changing the variance of the error to the observation equation, we consider the role of the SNR, which we define as the average trace of the variance-covariance matrix of η t relative to that of ε t : In our simulation, we consider the SNRs for 0.03 2 /0.02 2 , 0.03 2 /0.2 2 and 0.03 2 /1 2 . The SNR is particularly important when we consider the possibility of the pile-up problem, which will be discussed in the next section. For the initial values, we set b * 0 = 0.

Non-Gaussian Errors
The original motivation to employ time-varying models for macroeconomic research was to allow β t to change gradually. However, structural breaks or abrupt changes may exist in β t , which means that β t is almost constant over time until some point in the sample, for example, T b ; it then jumps to a different level afterward. One way to model such a break is to assume non-Gaussian errors for η t . In particular, we assume mixtures of normal distributions (e.g., Perron and Wada 2009 for each element of error vector η t : Intuitively, with a probability of 95%, η t is ζ 1,t , which is drawn from a normal distribution with a small variance. This small η t keeps β t nearly constant over time. However, a large η t , which is ζ 2,t , is drawn from a normal distribution with a (relatively) large variance. This η t causes β t to jump to a new level, with a 5% probability. Since we use the assumption of Gaussian errors to derive the equivalence between GLS and the Kalman-smoothed estimator, how non-Gaussian errors affect the accuracy of the GLS estimator when estimating β t should be evaluated using simulations.

Stochastic Volatility and Autoregressive Stochastic Volatility
As Cogley and Sargent (2005) argue, in response to the criticism of Cogley and Sargent (2001), it is more flexible and realistic to assume that the variance of the shock ε t is timevarying. Intuitively, not all shocks are generated from the same i.i.d. process. Since the GLS-based approach can handle the heteroskedasticity in H t and Q t , we can estimate the time-varying model with stochastic volatility, such as the one used by Primiceri (2005), at least theoretically. The error term ε t may also follow the autoregressive stochastic volatility process described by Taylor (2007) and elsewhere.
However, in general, FGLS is merely a remedy to more precisely estimate the coefficients (in our case, β t ) when heteroskedasticity is present and FGLS is not primarily designed to estimate the process that the error term (or its variance) follows.
Nevertheless, we use the following data-generating process to assess the performance of the GLS-based approach.

Eliminating Outliers
Since time-varying parameters mean that the generated series are generally nonstationary, some such series may be explosive and thus disregarded because they lack practical usefulness. In particular, if the standard deviation of the last 50 observations of a generated series is at least three times greater than that of the first 50 observations or if the inverse of Z Z does not exist, the series is discarded.

Mean and Variance of the Estimated β t and Likelihood
Since we simulate a TV-VAR(2) model with time-varying intercepts, β t is a 21 × 1 vector. Let β t,i,n denote the true (data-generating process) β t,i,n (i.e., the i-th element of vector β t,n ) and let β t,i,n denote the estimate of β t,i,n . In the Bayesian MCMC case, we use the posterior mean for β t,i,n . 10 Since b * 0 is unknown in practice when estimating β t,i,n , we estimate b * 0 as the coefficient vector from a full-sample time-invariant (i.e., usual) VAR(2) model before estimating β t,i,n by OLS, GLS, or Bayesian. The sample means and sample standard deviations of the estimate over the sample period are then computed: Similarly, we compute those of the true (data-generating) process: From (11) and (12) and their data-generating process counterparts, (13) and (14), we have 21 means and standard deviations for each replication. After N = 1000 replications, we compute the averages of β i,n , β i,n , sd β i,n , and sd(β i,n ) over the replications. We then have 21 means of time-varying parameters and 21 means of standard deviations (i.e., i = 1, 2, . . . , 21).
Since both m and s are aggregate means, a small difference between m and m or between s and s is only an indication that the estimator is close to what it is expected to estimate. Hence, we further investigate the similarities of β and β. Comparing each element of β, we define the distance, "dist", as follows: Similarly, we compare the standard deviations of each element of β as a ratio of the standard deviation of β i,n to the standard deviation of the true process, β i,n : In this simulation study, we focus on both dist i and rat i . Our criteria for a good estimator are whether the dist i of an estimate is close to zero and whether the rat i of that estimate is close to one.

Simulation Results 1: The SNR, Sample Size and Estimation Precision
Tables 1 and 2 display the medians of dist i and rat i as well as the medians of m i and s i for T = 100 and T = 250, respectively. (1) The numbers in the column under "True"are computed from the data-generating process described in Section 4.2.
(2) "m", "s", "dist", and "rat"stand for the mean, the standard deviation, the distance from the true values, and the ratio of the standard deviation of the estimates to that of the true values of β.
(3) The bold numbers are the smallest (for median "dist") and the closest to one (for median "rat"), indicating the best method out of the four (OLS, 1FGLS, 2FGLS, 2FGLS', Primiceri). (1) The numbers in the column under "True"are computed from the data-generating process described in Section 4.1.
(2) "m", "s", "dist", and "rat"stand for the mean, the standard deviation, the distance from the true values, and the ratio of the standard deviation of the estimates to that of the true values of β.
(3) The bold numbers are the smallest (for median "dist") and the closest to one (for median "rat"), indicating the best method out of the four (OLS, 1FGLS, 2FGLS, 2FGLS', Primiceri).
We focus on the median values of the estimated parameters because we sometimes encounter outliers. Our view is that the non-stationary nature of the data-generating process together with the possibility of poorly estimated H and Q, especially when the SNR is low, creates those outliers in the estimated coefficients. OLS works relatively well when the SNR is relatively large because, as Tables 1 and 2 show, the median distance of the estimate from the true process (i.e., dist i ) is small, and the median sample variance of the estimated β t is closer to that of the true process compared with other approaches (i.e., rat i is closer to one). 11 On the contrary, 2FGLS' works relatively well when the SNR is small. General tendencies from Table 1 can be summarized as follows. First, OLS, 1FGLS and 2FGLS share largely the same characteristics. However, the volatility of β t estimated by 2FGLS' is much smaller than those estimated by OLS, 1FGLS and 2FGLS. Second, OLS, 1FGLS, 2FGLS and 2FGLS' all tend to have larger rat i as the SNR increases. More precisely, OLS, 1FGLS, 2FGLS and 2FGLS' overestimate (underestimate) the volatility of β t when the SNR is very small (large). Third, for the median distance of the estimate from the true process (i.e., dist i ) for T = 250, the best case for OLS and 1FGLS is when the SNR is 2.25. This phenomenon is easy to understand because an SNR that is either too small or too large make the estimation of β t difficult, since an SNR far from one means the degree of heterogeneity is quite serious. In such a situation, it is easy to imagine that OLS cannot recover β t well and 1FGLS may thus be unsuitable for implementing FGLS.
What is the effect of increasing the sample size? A comparison of Tables 1 and 2 show that the degrees to which the volatility of β t is over-or underestimated is largely mitigated for OLS, 1FGLS and 2FGLS when the sample size increases from 100 to 250. At the same time, the median distances of the estimate from the true process for OLS, 1FGLS and 2FGLS generally shorten as the sample size increases, showing that the accuracy of OLS and 1FGLS improves with the sample size. However, such effects of an increased sample size do not clearly hold for 2FGLS'. Furthermore, Primiceri's (2005) Bayesian estimation is inaccurate at recovering the true parameter values throughout the simulation. As Primiceri (2005) explains in detail, one caveat is that the prior for Q is crucial for the volatility of the posterior mean of β t . Hence, it may not be appropriate to use the same set of priors as Primiceri (2005) for this simulation study because more suitable prior values may improve the results. However, we must note that the other estimates do not necessitate such a choice depending on the SNR.
Additionally, instead of computing the median value of each measure over i, we can look closely at dist i and rat i for each i. There are some parameters (i) where Primiceri's (2005) method outperforms the others. Yet, our focus here is to see the overall performance of each method. Table 3 presents the effects of non-Gaussian errors as well as stochastic volatility and stochastic autoregressive errors. (1) The numbers in the column under "True"are computed from the data-generating process:

Simulation Results 2: The Effects of Non-i.i.d. and Non-Gaussian Errors
log h i,t = ρ log h i,t−1 + e t where ρ = 1 when stochastic volatility is considered (labeled as RW), ρ = 0.9 when autoregressive stochastic volatility is considered (labeled as AR),ε it is the i-th element of ε t ; log h i,0 = 0, e t ∼ N 0, 0.02 2 , and ξ t ∼ N(0, 1). (2) "m", "s", "dist", and "rat"stand for the mean, the standard deviation, the distance from the true values, and the ratio of the standard deviation of the estimates to that of the true values of β. (3) The bold numbers are the smallest (for median "dist") and the closest to one (for median "rat"), indicating the best method out of the four (OLS, 1FGLS, 2FGLS, 2FGLS', Primiceri).
The general tendencies that appear in the Gaussian error case (Tables 1 and 2) are maintained. While both OLS, 1FGLS and 2FGLS overestimate the volatility of β t , the degree of overestimation is largely mitigated when the sample size increases. Moreover, 2FGLS' underestimates the volatility of β t , and increasing the sample size helps 2FGLS' estimate β t more accurately, and Primiceri's (2005) Bayesian approach cannot estimate β t well. Remarkably, given the value of the autoregressive parameter ρ, there is also a negligible difference between the stochastic volatility and autoregressive stochastic volatility cases.
When only the non-Gaussian error is considered, as Tables 4 and 5 show, we obtain mostly the same results as those presented in Tables 1 and 2. Once again, except for the Bayesian estimator, the degree of overestimation (underestimation) depends on the SNR. Similar to the results in Tables 1 and 2, a larger sample size generally improves the estimation by OLS, 1FGLS and 2FGLS in that the degree of over-or underestimation is largely reduced when the sample size increases. In addition, for OLS, 1FGLS and 2FGLS, the median distance between the true and estimated β t shortens with the sample size. However, this tendency does not apply to 2FGLS'.
What is the effect of scholastic volatility or autoregressive volatility in the observation equation error (ε t ) on our estimation? Table 6 shows that except for Primiceri's (2005) Bayesian approach, the results arising from such errors are similar to the small SNR cases in Tables 1 and 2. (1) The numbers in the column under "True"are computed from the data-generating process: Bernoulli(0.95), ζ 1,t ∼ N(0, 0.03 2 ), ζ 2,t ∼ N(0, 0.1 2 ).
(2) "m", "s", "dist", and "rat"stand for the mean, the standard deviation, the distance from the true values, and the ratio of the standard deviation of the estimates to that of the true values of β. (3) The bold numbers are the smallest (for median "dist") and the closest to one (for median "rat"), indicating the best method out of the four (OLS, 1FGLS, 2FGLS, 2FGLS', Primiceri). This is because the observation error (ε t ) has a variance larger than one due to the stochastic volatility ( √ h t ) term. The results of the stochastic volatility and autoregressive stochastic volatility cases are therefore similar. (1) The numbers in the column under "True"are computed from the data-generating process: (2) "m", "s", "dist", and "rat"stand for the mean, the standard deviation, the distance from the true values, and the ratio of the standard deviation of the estimates to that of the true values of β. (3) The bold numbers are the smallest (for median "dist") and the closest to one (for median "rat"), indicating the best method out of the four (OLS, 1FGLS, 2FGLS, 2FGLS', Primiceri). (1) The numbers in the column under "True"are computed from the data-generating process: ε it = h i,t ξ t , log h i,t = ρ log h i,t−1 + e t where ρ = 1 when stochastic volatility is considered (labeled as RW), ρ = 0.9 when autoregressive stochastic volatility is considered (labeled as AR), ε it is the i-th element of ε t ; log h i,0 = 0, e t ∼ N 0, 0.02 2 , and ξ t ∼ N(0, 1). (2) "m", "s", "dist", and "rat"stand for the mean, the standard deviation, the distance from the true values, and the ratio of the standard deviation of the estimates to that of the true values of β.
(3) The bold numbers are the smallest (for median "dist") and the closest to one (for median "rat"), indicating the best method out of the four (OLS, 1FGLS, 2FGLS, 2FGLS', Primiceri).

Discussion: The Pile-Up Problem
Our results suggest that the GLS-based approach does not suffer from the pile-up problem and that lower SNRs often lead to the overestimation of the volatility of β t , especially when OLS, 1FGLS or 2FGLS is used (Tables 1 and 2). Moreover, the degree of overestimation of the sample variance of β t becomes more severe when the sample size is small. This may be puzzling given the fact that OLS and ML are generally equivalent and that GLS and ML are equivalent if the errors are not i.i.d. (i.e., the errors heteroskedastic or autocorrelated). However, this statement is not true if FGLS fails to deal with non-i.i.d. errors appropriately. As OLS, 1FGLS and 2FGLS would then be unable to estimate an equivalent β t to that under ML, this explains why the GLS-based approach does not suffer from the pile-up problem. Interestingly, our simulations reveal that the Bayesian estimator yields much smaller variations over time.
Therefore, our simulations seem to suggest both that the use of 2FGLS' is recommended when the sample size is small, and that OLS (1FGLS and 2FGLS as well) can recover the time-varying parameters fairly well when the sample size is large.

Application to the TV-VAR(2) Model with Interest Rates, Inflation, and Unemployment
A number of studies that employ TV-VAR models, including Cogley and Sargent (2005) and Primiceri (2005), focus on recovering the structural parameters from the estimated reduced form. Although we do not aim to identify fundamental shocks or compute impulse responses, we present the estimated TV-VAR(2) parameter using OLS (Figure 1)  While Primiceri's (2005) Bayesian MCMC posterior means are virtually time-invariant and the estimates by 2FGLS' are slightly more volatile, the estimates provided by OLS, 1FGLS and 2FGLS have much larger volatility.
Interestingly, as detailed in the supplementary appendix (available upon request), the coefficients on the interest rate vary noticeably over time, exhibiting distinct patterns in the early 1980s (dip), late 1990s (up) and early 2000s (down). Similar to Primiceri's (2005) Bayesian posterior means, the intercepts (three time-varying coefficients) are largely stable over time. This finding is consistent with our simulation results. Overall, Primiceri's (2005) Bayesian estimator for the time-varying parameter (β t ) tends to have very small variation over time, resulting in virtually time-invariant parameters. 13

Conclusions
The multivariate non-Bayesian regression-based or GLS-based approach for the timevarying parameter model is presented and assessed from a simulation aspect. Although this approach has already been theoretically justified and (at least partly) used by Ito et al. (2014Ito et al. ( , 2016Ito et al. ( , 2021, it is shown that using the GLS-based approach has at least four advantages. First, this approach can produce equivalent estimates without needing Kalman filtering or smoothing; it is also applicable to a wide range of time-varying parameter models (e.g., TV-AR, TV-VAR, and TV-VEC models) by adjusting the regression matrix accordingly. Second, the GLS-based approach works reasonably well in practice in that it can estimate the time-varying parameters even when non-i.i.d. errors or non-Gaussian errors exist in the model. In particular, we find that GLS outperforms Primiceri's (2005) Bayesian approach in recovering the true parameter values because it can take into account generally heteroskedastic error terms. The ability to deal with non-Gaussian errors is particularly important in empirical studies because it allows us to consider possible abrupt changes in time-varying parameters instead of gradual changes due to Gaussian errors. Remarkably, GLS can even outperform Primiceri's (2005) Bayesian approach that is often employed to estimate the TV-VAR models. However, in practice, the most appropriate method (OLS, 1FGLS, 2FGLS or 2FGLS') should be chosen depending on the sample size and SNR. More precisely, OLS is acceptable when the SNR is not far from one or the sample size is not small; otherwise, 2FGLS' is recommended. The reason why the sample size and SNR are important for choosing the preferred of the three methods is that 1FGLS, 2FGLS and 2FGLS' are not ideal GLS; hence, they cannot fully deal with the heterogeneity arising from our regression equation that includes both observation equation errors and state equation errors. However, because we do not maximize the unconditional likelihood function with respect to the variances of the errors and because 1FGLS, 2FGLS and 2FGLS' are not ideal GLS, the true variances are imprecisely estimated, and our GLS-based approach does not suffer from the pile-up problem that often occurs with ML.
While our focus in this paper is the estimation method of relatively simple TV models, more flexible models, such as a TV-VAR with time-varying variances of the structural shocks, have a higher necessity for macroeconomic analysis. Extending our approach to such complex models is of great importance to both econometricians and macroeconomists.
Proof of Proposition A1.
The GLS estimate of β is Here, we used Lemma,

Appendix B. Model with Time-Invariant Intercepts
While our model, (1) and (2), and its matrix formulation, (3) and (4), are flexible enough to admit time-varying coefficients, it is sometimes assumed that the class of TV-AR models has time-invariant intercepts. For the purpose of deriving the likelihood function, here, we modify our model to admit time-invariant intercepts. Suppose we have a k × k vector of time-invariant intercepts, v, in our model. Then, (1) and (2) become In this case, it is convenient to use a matrix form to derive the likelihood function. With the vector of intercepts, our model in matrix form, (3) and (4), is then modified to where I = I k I k · · · I k , and I k is a k × k identity matrix. Similar to our assumption that time-varying intercepts, if they exist, are unknown, we assume that the vector of time-invariant intercepts, v, is the unknown parameter vector. From our matrix formulation of (A4) and (A5), the log-likelihood function for Y T given the covariance matrices of the errors (H and Q), intercept (v), and initial value vector (b * 0 ) is

. The GLS Estimator under the Presence of Time-Invariant Intercepts
As we discuss in the previous section, our model admits time-invariant intercepts. Therefore, it is straightforward to define the GLS estimator for such models. To do so, assuming that the time-invariant intercepts are unknown, let us define the vector of unknown parameters, β * = v β . Then, the matrix form for regression that is analogous to Here, one of the advantages of utilizing the regression approach (A7) over Kalman smoothing (A5) is that the unknown intercept vector v is estimated simultaneously with β. Then, it can be shown that the GLS estimate v is indeed the maximum likelihood estimate.
Proposition A2. The GLS estimate v of model (A7) is the maximum likelihood estimate (MLE) of (A5), v ML conditional on H, Q, and b * 0 .
Proof. From the likelihood function, (A6), the normal equations pertaining to v are Now, the GLS estimates for β * in model (A7) are Using the Lemma, we arrive at the following (see Appendix B.2 "Detailed Proof of Proposition A2" for details): Proposition A3. The GLS estimate β of model (A7) is the Kalman-smoothed estimate of model (A5).
Proof. Thanks to the intercept, the Kalman-smoothed estimate is now From (A9), it follows that We prove the equivalence.
It is clear that the GLS-based approach can compute the Kalman-smoothed β and estimate the unknown intercepts, v, simultaneously. The next question is how we can obtain the statistical inference about β. More precisely, at issue is whether the GLS-based approach yields the same MSE as the Kalman smoother. The answer to this question is negative for β. Proposition A4. The mean squared error of the Kalman smoothed estimate is whereas the variance estimated from the GLS-based approach (A9) is In our case, Other useful equations are ; Then, for (A9), we arrive at Therefore,
Then,  This assumption does not change our conclusions below. The main difference is that Var(β) = C P * 0 + Q C and Var(Y T ) = ZC P * 0 + Q C Z + H = Ω. An exception is when the diffuse prior is used and the likelihood function is computed excluding the first few observations. In such a case, the estimates of the unknown intercept parameters under the two approaches would differ. 7 By contrast, Duncan and Horn (1972) assume that matrix F is known, which renders their estimation impractical. The original form of Maddala and Kim (1998, pp. 469-70) is similar to ours, but it is a general form for a scalar y t . Hence, it does not aim to deal with the autoregressive part of time-varying parameter models nor consider vector processes. 8 For the Bayesian approach, we focus on the posterior mean from MCMC. Since our simulations are based on Primiceri's (2005) model, we use the same priors as his. The Matlab codes provided by D. Korobilis are used, which can be downloaded from: https://drive.google.com/file/d/1pYNP96FeGgBH1KpnDEEdXGqZ62ZPw_PQ/view, accessed on 14 March 2022. 9 In addition, we can compute the values of the log-likelihood function to evaluate whether the repeated use of FGLS improves estimation accuracy. Our simulation tends to show that 2FGLS has a higher likelihood value than 1FGLS. 10 After computing β t,i,n , we discard first 40 (t = 1, . . . , 40) of them in accordance with Korobilis's Matlab codes for the Bayesian MCMC method. Hence, we use T = length o f the data −40 to compute the sample moments of β by (11) through (18).

11
Throughout this simulation study, we use bold numbers to highlight the best (the smallest median dist and the median rat closest to one) estimation method of the four approaches (OLS, 1FGLS, 2FGLS, 2FGLS' and Primiceri). 12 We use the data and MATLAB codes provided by Koop and Korobilis (2010). 13 Note that the impulse responses of Primiceri's (2005) VAR vary largely over time. This is not because the time-varying parameters (β t ) are very volatile over time, but mainly because the variance of the shocks are time-dependent and vary greatly, as shown in Figure 1 of Primiceri (2005, p. 832) and as discussed in the conclusion thereof.