Model Validation and DSGE Modeling

The primary objective of this paper is to revisit DSGE models with a view to bringing out their key weaknesses, including statistical misspecification, non-identification of deep parameters, substantive inadequacy, weak forecasting performance, and potentially misleading policy analysis. It is argued that most of these weaknesses stem from failing to distinguish between statistical and substantive adequacy and secure the former before assessing the latter. The paper untangles the statistical from the substantive premises of inference to delineate the above-mentioned issues and propose solutions. The discussion revolves around a typical DSGE model using US quarterly data. It is shown that this model is statistically misspecified, and when respecified to arrive at a statistically adequate model gives rise to the Student’s t VAR model. This statistical model is shown to (i) provide a sound basis for testing the DSGE overidentifying restrictions as well as probing the identifiability of the deep parameters, (ii) suggest ways to meliorate its substantive inadequacy, and (iii) give rise to reliable forecasts and policy simulations.


Introduction
The Real Business Cycle (RBC) models proposed by (Kydland and Prescott 1982;Prescott 1986) were heralded by (Wickens 1995) as A Needed Revolution in Macroeconometrics: "The main failing of most macroeconometric models is in not taking macroeconomic theory seriously enough with the result that little or nothing is learned about key parameter values, a fault no amount of econometric sophistication will compensate for". (p. 1637) The original RBC models were subsequently extended in several directions that eventually led to the broader family of Dynamic Stochastic General Equilibrium (DSGE) models. The DSGE models combined the RBC perspective with (Lucas 1976) call for structural models to be built on sound microfoundations, with the parameters of interest reflecting primarily the preference of the decision-makers as well as the relevant technical and institutional constraints. The DSGE models are built upon an inter-temporal general equilibrium framework with a well-defined long-run structure and intrinsic dynamics. (Lucas 1976) argued that such structural models with deep parameters are likely to be invariant to policy interventions and thus provide a better basis for prediction and policy evaluation. This produced structural models that are founded on the inter-dependence of certain representative rational agents (e.g., household, firm, government, central bank) intertemporal optimization (e.g., the maximization of life-time utility) that integrates their expectations; see (Canova 2007;DeJong and Dave 2011).
From the theory perspective, DSGE modeling has been a success in revolutionizing macroeconomics by providing more cogent microfoundations and introducing intrinsic and extrinsic dynamics through shocks into macroeconomic models. DSGE models are currently dominating both the empirical modeling in macroeconomics as well as the economic policy evaluation; see (Hashimzade and Thornton 2013).
From the empirical perspective, however, DSGE models have been criticized on several grounds. First, DSGE models do not fully account for the probabilistic structure of the data; see (Favero 2001). Second, the use of 'calibration' to quantify DSGE models has been called into question; see (Gregory and Smith 1993;Kim and Pagan 1994). Third, the identification of their 'deep' parameters remains problematic; see (Canova 2009;Consolo et al. 2009). Fourth, the appropriateness of the Hodrick-Prescott (H-P) filter has been seriously challenged; see (Chang et al. 2007;Harvey and Jaeger 1993;Saijo 2013). Fifth, the forecasting capacity of DSGE models is rather weak; see (Edge and Gurkaynak 2010). In light of that, one can make a case that, despite the current popularity of DSGE models, there is a lot to be done to ensure their empirical adequacy for inference and policy simulation purposes.
On the positive side, there have been several attempts to remedy some of these weaknesses, including a trend toward estimating the parameters of such models Ireland 2004;Smets and Wouters 2003), as well as identifying the structural parameters using statistical techniques (Consolo et al. 2009). In addition, questions relating to various forms of possible 'substantive' misspecifications of DSGE models have been raised; see (Canova 2009;Del Negro and Schorfeide 2009): "Over the last 20 years dynamic stochastic general equilibrium (DSGE) models have become more detailed and complex and numerous features have been added to the original real business cycle core. Still, even the best practice DSGE model is likely to be misspecified; either because features, such as heterogeneities in expectations, are missing or because researchers leave out aspects deemed tangential to the issues of interest".
Unfortunately, the form of expectations and potentially relevant variables omitted from a DSGE model (e.g., Tables 1-3) pertain to substantive (structural) misspecifications that will have statistical implications, but the literature has ignored statistical misspecification: invalid probabilistic assumptions imposed on one's data. The two forms of misspecification are very different and before one can reliably probe for substantive misspecification one needs to secure statistical adequacy to ensure the reliability of the inference procedures used in probing substantive misspecification; see (Spanos 2006b).
The primary aim of the discussion that follows is to propose novel ways to address some of the empirical weaknesses mentioned above by bridging the gap between DSGE models and the relevant data more coherently. In particular, the paper proposes modeling strategies that bring out the statistical model M θ (z) implicit in every DSGE model M ϕ (z), and suggests effective ways to test the validity of the probabilistic assumptions comprising M θ (z) vis-a-vis data Z 0 to establish its statistical adequacy. It is argued that a mispecified M θ (z) will undermine the reliability of any inference based on the estimated DSGE model, rendering the ensuing evidence untrustworthy. To avoid that one needs to respecify the original M θ (z) to account for all the statistical information (the chance regularity patterns) exhibited by data Z 0 . When a statistically adequate model is secured, one could then proceed to probe the substantive adequacy of the DSGE model by testing the validity of its overidentifying restrictions, as well as any other relevant issues in [b]. Section 2 focuses on the importance of separating the statistical (M θ (z)) from the substantive (M ϕ (z)) model by discussing how a statistically M θ (z) undermines the reliability of all inferences based on M ϕ (z). This perspective is then used in Sections 3 and 4 to revisit the Smets and Wouters (2007) DSGE model to: (a) test the statistical adequacy of its implicit statistical model M θ (z), (b) respecify M θ (z) to attain statistical adequacy, (c) appraise the empirical validity of the DSGE model M ϕ (z), (d) evaluate the reliability of its forecasting and impulse Econometrics 2022, 10, 17 3 of 25 response analysis, and (e) propose a procedure to probe the identifiability of its structural parameters ϕ.

Macroeconometric Models
Arguably, the single most important weakness of the macroeconometric models of the 1970s (Bodkin et al. 1991;McCarthy 1972) was their unreliable inferences, including poor forecasting performance. When these models were compared on forecasting grounds with data-driven single equation AR(p) models, they were found wanting; (Nelson 1972). In retrospect, the poor forecasting performance of these models can be attributed to several different sources.
The new classical macroeconomics of the 1980s blamed their inadequacy as stemming from their ad hoc specification and their lack of proper theoretical microfoundations; see (Lucas 1980). Indeed, these weaknesses are often used to motivate the introduction of calibration in RBC modeling (DeJong and Dave 2011): ". . . an important component of Kydland and Prescott's advocacy of calibration is based on a criticism of the probability approach. . . . In sum, the use of calibration exercises as a means for facilitating the empirical implementation of DSGE models arose in the aftermath of the demise of system of equations analyses". (p. 257) Equally plausible, however, is the argument that traces their predictive failure to their statistical misspecification in the sense that these empirical models did not account for the statistical regularities in the data; see (Spanos 2010b(Spanos , 2021. As argued by (Granger and Newbold 1986, p. 280), statistically misspecified models are likely to give rise to untrustworthy empirical evidence and poor predictive performance. Lucas (Lucas 1987) called attention to the substantive adequacy of macroeconometric models but ignored their statistical misspecification as a source of untrustworthiness of the ensuing empirical evidence.

Structural vs. Statistical Models
A strong case can be made that the predictive failure of the empirical macroeconometric models of the 1980s can be traced to the questionable modeling strategy of foisting a substantive (structural) model M ϕ (z) on data z 0 and proceeding to draw inferences. Such a strategy, however, will invariably give rise to an empirical model which is both substantively and statistically misspecified. This stems primarily from the fact that the modeler (a) treats the substantive model as established knowledge, and not as a tentative explanation to be evaluated against the data, and/or (b) largely ignores the validity of the probabilistic assumptions imposed (directly or indirectly) on the data via the error term. This raises a serious philosophical problem known as Duhem's conundrum where one cannot separate the two sources of misspecification (statistical or substantive) and apportion blame with a view to find ways to address it; see (Mayo 1996).
A case can be made (Spanos 1986) that the key to addressing this conundrum is to untangle the statistical model, M θ (z), that is implicit in every substantive model, M ϕ (z), whose generic forms are: where p < m < n, f (z; θ) denotes the distribution of the sample Z:= (Z 1 , . . . , Z n ), Θ and R m , the parameter and samples spaces, respectively. Most importantly, the substantive model constitutes a reparametrization/restriction of the statistical model via restrictions that can be generically specified by G(ϕ, θ) = 0, where ϕ∈Φ and θ∈Θ denote the structural and statistical parameters, respectively. This can be achieved by delimiting the statistical model to comprise solely the probabilistic assumptions imposed on data z 0 , or more accurately on the stochastic process {Z t , t∈N:= (1, 2, . . . , n, . . . )} underlying z 0 , by viewing the statistical model as a particular parametrization of the stochastic process {Z t , t∈N} without any substantive restrictions imposed; (Spanos 2006b that has dominated textbook econometrics since the 1960s. It can be shown that the implicit statistical model, M θ (z), is its (unrestricted) reduced form (Spanos 1986) : The two models are related via B(θ)= − Γ (ϕ) −1 ∆ (ϕ) and u t = Γ −1 ε t , yielding the identifying restrictions: where the structural parameters ϕ:=(Γ, ∆, Ω) are said to be identified, if, for a given θ:=(B, Σ) there exists a unique solution of G(ϕ, θ) = 0 for ϕ. The reduced form in (2), when interpreted as an unrestricted parameterization of the stochastic process {Z t :=(Y t , X t ), t∈N} is the implicit M θ (z), which can be specified in terms of a complete, internally consistent and testable set of probabilistic assumptions [1]-[5] as shown in Table 1; see (Spanos 1990). Statistical GM: This untangling of the statistical and substantive models enables one to distinguish clearly between two different forms of adequacy: [a] Statistical adequacy: does the statistical model M θ (z) account for the chance regularities in z 0 ? Equivalently, does data Z 0 constitute a truly typical realization of the statistical Generating Mechanism (GM) in M θ (z)? The answer to these questions is that the validity of M θ (z) can be evaluated using thorough Mis-Specification (M-S) testing; see (Spanos 2006b(Spanos , 2018. [b] Substantive adequacy: does the substantive (structural) model M ϕ (z) adequately capture (describes, explains, predicts) the phenomenon of interest? Substantive inadequacy arises from errors in narrowing down the relevant aspects of the phenomenon of interest, flawed ceteris paribus clauses, missing crucial variables and/or confounding factors, etc.; see (Spanos 2006a(Spanos , 2010b. What renders the inference procedures based on the estimated structural model (1) reliable and the ensuing evidence statistically trustworthy, is the validity of assumptions [1]-[5] for data Z 0 .
In the traditional approach to DSGE modeling, the statistical model M θ (z) is specified indirectly by attaching errors (shocks) to the behavioral equations comprising the structural model M ϕ (z). As a result, the primary concern in the DSGE literature has been on the substantive and not the statistical misspecification; see (Del Negro and Schorfeide 2009;Del Negro et al. 2007). This is an important development but it has a crucial weakness. Probing for substantive misspecifications in M ϕ (z) without ensuring statistical adequacy of M θ (z) will undermine any form of substantive probing based on M ϕ (z); see (Consolo et al. 2009).
The statistical adequacy of M θ (z) needs to be established first because that will ensure the 'optimality' and reliability of the inference procedures employed to probe the substantive adequacy of M ϕ (z). This is because statistical misspecification undermines the optimality and reliability of frequentist inference via: Econometrics 2022, 10, 17 5 of 25 (i) Rendering the distribution of the sample f (z; θ), z∈R n , as well as the likelihood function L(θ; z 0 )∝ f (z 0 ; θ), θ∈Θ, erroneous. (ii) Distorting the relevant sampling distribution, f (y n ; θ) = dF n (y)/dy, ∀y n ∈R, of any statistic (estimator, test, predictor), Y n = g(Z 1 , Z 2 , . . . , Z n ), that underlies the inference in question since: (iii) Undermining the reliability of inference procedures by belying their optimality and inducing sizeable discrepancies between the actual and the nominal (the ones assuming the validity of M θ (z)) error probabilities. Applying a 0.05 significance level (nominal) test when the actual type I error is greater than 0.80 will lead an inference astray. This unreliability affects not just testing and estimation but also goodness-of-fit and prediction measures rendering them highly misleading. Statistical adequacy secures the reliability of inference by securing the optimality of inference and ensuring that the actual error probabilities approximate closely the nominal ones. As shown in (Spanos and McGuirk 2001), such discrepancies can easily arise for what are often considered 'minor' statistical misspecifications.
In relation to statistical misspecification, it is also important to emphasize that all approaches to inference (frequentist, Bayesian, nonparametric), as well as the Akaike-type model selection procedures, invoke statistical models, and thus they are all vulnerable to statistical misspecification. In the case of Bayesian inference, a misspecified model M θ (z) gives rise to a false f (z; θ), leading to an erroneous likelihood function L(θ; z 0 )∝ f (z 0 ; θ), θ∈Θ, and that in turn gives rise to an incorrect posterior: π(θ|z 0 ) = π(θ)·L(θ; z 0 ), θ∈Θ, undermining all forms of Bayesian inference. Moreover, no amount of finessing of the prior π(θ) can rectify the statistical misspecification problem induced by an invalid L(θ; z 0 ); see (Spanos 2010a). This is particularly relevant for the recent trend to estimate DSGE models using Bayesian methods; see Wouters 2005, 2007;Del Negro et al. 2007;Galí and Wouters 2011).
The importance of the distinction between a substantive (structural), M ϕ (z), and its implicit statistical model, M θ (z), stems from the fact that the error-reliability of inference stems solely from the validity of the probabilistic assumptions defining M θ (z) vis-a-vis data z 0 ; see (Spanos 1986).
At a practical level, one can summarize the proposed modeling process in the form of the following stages.
Stage 1. Untangle the statistical model M θ (z) from the substantive model M ϕ (z), without compromising the integrity of either source of information because the two models are ontologically distinct. From this perspective, the structural model M ϕ (z) derives its statistical meaningfulness from M θ (z) and the latter derives its theoretical meaningfulness from the former.
Stage 2. Establish the statistical adequacy of M θ (z) using comprehensive M-S testing (Mayo and Spanos 2004;Spanos 2018) by assessing the validity of the probabilistic assumptions comprising M θ (z). Without it one cannot rely on statistical inference to reliably assess any substantive questions of interest, including the adequacy of M ϕ (z) vis-a-vis the phenomenon of interest-the reliability of such inferences will be unknown; see (Spanos 2009a(Spanos , 2012. Stage 3. When the original statistical model, M θ (z), is misspecified, one needs to respecify it to account for all the chance (statistical) regularities in the data, i.e., ensure that the respecified model M ϑ (z) is statistically adequate for data z 0 .
Stage 4. Armed with a statistically adequate M θ (z), one can proceed to evaluate the substantive adequacy of M ϕ (z), including testing the overidentifying restrictions stemming from the implicit restrictions G(ϕ, θ) = 0. If these restrictions do not belie data z 0 , the estimated M ϕ (z) can is empirically valid, but not substantively adequate until further probes vis-a-vis the phenomenon of interest reveal no serious flaws relating to [b] above; see (Spanos 2006a).

Revisiting DSGE Modeling
DSGE models aim to describe the behavior of the economy in an equilibrium steadystate stemming from optimal microeconomic decisions associated with several representative agents (households, firms, governments, central banks, etc.). These decisions are based on the intertemporal optimizing the behavior of representative agents, with the first-order conditions of the optimization problem linearized around a constant steady-state using a first-order Taylor approximation; 2nd order terms raise problems beyond the scope of the present paper; see (DeJong and Dave 2011;Heer and Maussner 2009;Klein 2000). After linearization, the model is specified in terms of log difference, which is thought to be substantively more meaningful.

Smets and Wouters 2007 DSGE Model
Consider the DSGE model proposed by Smets and Wouters (2007). Exogenous Shocks: There are 7 exogenous shocks.
which relate to the following deeper structural parameters: That is, there are 41 deep structural parameters in the S-W model out of which 5 are calibrated, and the rest are estimated using data Z 0 ; see Smets and Wouters (2007).
After linearization, the DSGE model M ψ (z; ξ; t ) is expressed in terms of three types of variables (Table 2): labor hour, π t -inflation rate, w t -real wage rate, and r t -interest rate. (ii) Latent variables ξ t = (z t , q t , k s 7 of 25 Resource constraint: Rental rate of capital: Taylor rule: Exogeneous spending: Monetary policy: The estimable form of M ψ (z; ξ; t ), the structural DSGE model M ϕ (z), is derived by solving a system of linear expectational difference equations and eliminating certain variables. M ϕ (z) is specified in terms of the observables Z t := ( c t , i t , y t , w t , π t , l t , r t ): the log difference of real GDP ( y t = y t −y t−1 + γ), real consumption ( c t = c t −c t−1 + γ), real investment ( i t = i t −i t−1 + γ) and the real wage ( w t = w t −w t−1 + γ), log hours worked ( l t = l t + l), the log difference of the GDP deflator ( π t = π t + π), and the federal funds rate ( r t = r t + r), where γ = 100(γ−1) is the common quarterly growth rate of real GDP, π = 100(Π * −1) is the quarterly steady-state inflation rate; and r = 100(β −1 γ σ c Π * −1) is the steady-state nominal interest rate and l is steady-state hours worked, which is normalized to be equal to zero. All the three steady-state values (γ, π and r) are evaluated using observed data as a part of the modeling. (2007)  Purely backward (or purely predetermined) variables: Those that appear only at the current and past periods in the model, but not at any future period. Purely forward variables: Those that appear only at the current, and future periods in the model, but not at past periods. Mixed variables: Those that appear at current, past, and future periods in the model. Static variables: Those that appear only at current, not past and future periods in the model.

Smets and Wouters
Using the Dynare software the solution of the structural model (Smets and Wouters 2007) yields the restricted state-space formulation: where X t :=[s t :x t ] is a vector of 40 variables consisting of 20 state variables. (14 predetermined variables and 6 mixed variables) as s t and 20 control variables (6 purely forward variables and 14 static variables) as matrix and ε t is a vector of 7 exogenous shocks. The restricted state space solution (5) of the DSGE model provides the basis for calibration; note that A 1 (ψ) and A 2 (ψ) are defined by (Blanchard and Kahn 1980) algorithm. The calibration is accomplished using the following steps.
Step 1. Select substantively meaningful values for the structural parameters ϕ.
Step 2. Select the sample size, say n, and the initial values x 0 .
Step 3. Use the values in steps 1-2, together with Normal pseudo-random numbers for ε t to simulate N samples of size n.
Step 4. After 'de-trending' using the Hodrick-Prescott (H-P) filter, use the simulated data Z s 0 to evaluate the first two moment statistics (mean, variances, covariances) of interest for each run of size n, and their empirical distributions for all N.
Step 5. Compare the relevant moments of the simulated data Z s 0 with those of the actual data Z 0 , finessing the original values of ϕ to ensure that (i) these moments are close to each other using the minimization: min ϕ∈Φ Cov(Z s 0 ; ϕ) − Cov(Z 0 ) , as well as (ii) ensuring that the model gives rise to realistic-looking data; the simulated data mimic the actual data.
Calibration. In applying this procedure, five parameters are fixed at specific values:

Confronting the DSGE Model with Data
Smets and Wouters (2007) estimate their model using Bayesian statistics, where the reliability of inference depends crucially on the statistical adequacy of the implicit M θ (z), since the posterior: π(θ|z 0 ) = π(θ)·L(θ; z 0 ), θ∈Θ, invokes its validity via the likelihood function L(θ; z 0 )∝ f (z 0 ; θ), θ∈Θ. The data used for the estimation/calibration of the DSGE model in Table 2 are US quarterly time series for the period 1947:2-2004:4 (n = 231): the log difference of real GDP ( y t ), real consumption ( c t ), real investment ( i t ), and the real wage ( w t ), log hours worked ( l t ), the log difference of the GDP deflator ( π t ), and the federal funds rate ( r t ).
The validation of the DSGE structural model M ϕ (z) will be achieved in three steps.
Step 1. Unveil the statistical model M θ (z) implicit in the DSGE model M ϕ (z).
Step 2. Secure the statistical adequacy of M θ (z) using M-S testing and respecification.
Step 3. Test the overidentifying restrictions in the context of a statistically adequate model secured in Step 2.
An obvious form of potential statistical misspecification stems from the fact that most of the data series exhibit non-stationarity that cannot be fully accounted for using log differences as implicitly assumed. As shown below, one needs to add trends to account for the mean-heterogeneity in the data. The implicit statistical model M θ (z) behind the linearized structural model M ϕ (z) in terms of the observables: Z t :=( y t , c t , i t , π t , w t , l t , r t ), is a Normal VAR(p) model (Table 4), with p = 2. For the details connecting the structural and the statistical model in Tables 2 and 4, see Appendix A. Normality: Linearity: Homosked.: 3.3.1. Evaluating the Validity of the Implicit Statistical Model As shown in , the solution of the Smets and Wouters (2007) structural model gives rise to a Normal, VARMA(2,1) model. However, since the latter imposes unnecessary statistical restrictions due to the MA(1) component, the implicit statistical model is a VAR(p), p ≥ 2, which imposes no such restrictions, and the value of p will be decided on statistical adequacy grounds.
Although Mis-Specification (M-S) testing can take a variety of forms (Lutkepohl 2005), in the case of the Normal VAR(p) [N-VAR(p)] model, the most coherent procedure is to use joint M-S tests based on auxiliary regressions relating to the first two conditional moments; see Spanos (2018Spanos ( , 2019. The auxiliary regressions for testing the validity of assumptions [1]- [5] are written in terms of the standardized residuals of the seven observable variables. For instance, in the case of a single estimated equation based on Y t whose residuals are denoted by u t , the auxiliary regressions take the generic forms: The form of the auxiliary regressions being used for joint M-S testing depends on a number of different factors, and the robustness of its results is evaluated by examining several alternative formulations. The hypotheses being tested for different joint M-S tests are given in Table 5. The M-S test for Normality is the (Anderson and Darling 1952) test because it is more robust to a few outliers than the Skewness-Kurtosis or the Kolmogorov test; see (Spanos 1990). The results of the joint M-S tests in Table 5, reported in Tables 6 and 7 [p-values in square brackets] indicate that the N-VAR(2) and N-VAR(3) models are statistically misspecified; only the assumptions [2] Linearity and [4] Markov (2) dependence are valid for data Z 0 . Also, the decision to leave out the MA component on statistical adequacy grounds, stems from the fact that, after thorough M-S testing, the estimated VAR(2) model fully accounts for the temporal dependence in data Z 0 . If one needed an MA error term to account for that dependence, then the Markov (2) assumption would have been rejected by data Z 0 . As shown below, however, [1] Normality, [3] homoskedasticity and [5] t-invariance are invalid for the VAR(p) model, for both values p = 2 (Table 6) and p = 3 (Table 7), giving rise to statistically misspecified models.

Null Hypotheses
[1] Normality (Anderson-Darling) Z t N(., .) [2] Linearity: F(228,1) Hence, no reliable inferences can be drawn based on a calibrated or an estimated N-VAR(p), for p = 2, 3, which also includes testing the validity of the DSGE restrictions in light of that, any inference, including forecasting, based on the estimated/calibrated DSGE model will give rise to spurious/untrustworthy results.

Respecifying the Implicit Statistical Model
In light of the above detected statistical misspecification, the next step is to respecify the original N-VAR(2) model to account for the statistical information that lingers on in the residuals. The departures indicating non-Normality, Heteroskedasticity, and second-order temporal dependence in conjunction with the validity of the linearity assumption suggest that the best way to respecify the N-VAR(2) model is to replace the Normality with another distribution from the Elliptically Symmetric (ES) family. This family retains the bell-shape symmetry and the linearity of the autoregression, but allows for heteroskedasticity and second-order temporal dependence. This is because within the ES family, homoskedasticity characterizes the Normal distribution; see (Spanos 2019), chp. 7. Hence, an obvious choice is to assume that the process {Z t , t∈N} is Student's t, Markov (p), covariance stationary but mean trending. This gives rise to the Student's t VAR(p) [St-VAR(p)] model with ν degrees of freedom in Table 8.
There are two key differences between the N-VAR(2) ( Table 4) and St-VAR(2) (Table 8) models. The first is that the St-VAR(2) allows for trends µ(t) to account for the mean hetero-geneity in the data series. The second is that the St-VAR(2) is heteroskedastic [Var(Z t |σ(Z 0 t−1 )) is a function of Z 0 t−1 ] and its conditional variance is heterogeneous [Var(Z t |σ(Z 0 t−1 )) is a function of t via the unconditional mean µ(t)]. In relation to this distinction, it is important to note that (Primiceri and Justiniano 2008) model the volatility in the context of DSGE models using conditional variance heterogeneity and not heteroskedasticity. As shown below, both play a very important role in accounting for the volatility in the data series. It is important to note that to reach a statistically adequate model with p = 2, and ν = 5, the interest rate term (r t ) had to be replaced with (z t :=∆ ln r t ).
In Tables 9 and 10, the estimates of the autoregressive functions of the St-VAR(2) and N-VAR(2) models are compared and contrasted. First, there are significant differences between the estimates corresponding to the same coefficients even though their respective autoregressive functions are identical; significant differences are indicated by the sign ( ). Second, the trend polynomials for the St-VAR(2) model are very significant and their absence from the N-VAR(2) model gives rise to misleading results because the coefficients are based on deviations from the 'wrong' mean, calling into question the use of the steadystate. In relation to the trend polynomials, it is important to emphasize that filtering the data using the H-P filter does not eliminate potential departures from the t-invariance of the conditional variance. Instead, it distorts the mean heterogeneity as well as the temporal dependence in data Z 0 . Third, the most crucial difference is that the homoskedasticity assumption for the N-VAR(p) model is clearly invalid (see Tables 6 and 7). .004 The inappropriateness of the constant conditional variance-covariance associated with the N-VAR(2) model is illustrated in Figures 1 and 2, where the squared residuals from N-VAR(2) that exhibit great volatility are plotted with Var(y t |Z 0 t−1 ) and Var(c t |Z 0 t−1 ) based on the St-VAR(2) model (Table 11), indicating that they capture most of the volatility. Note that all seven conditional variances are scaled versions of each other.
The inappropriateness of a constant conditional variances associated with the N-VAR(2) model is illustrated in figures 1-2, where the squared residuals from N-VAR(2) that exhibit great volatility are plotted with d  (  |Z 0 −1 ) and d  (  |Z 0 −1 ) based on the St-VAR(2) model (table 11), indicating that they capture most of the volatility. Note that all seven conditional variances are scaled versions of each other. Fig. 1: N

Evaluating the Statistical Adequacy of the St-VAR(2) Model
To take into account the heteroskedastic conditional variance-covariance, one needs to reconsider the notion of what constitutes the 'relevant residuals' for M-S testing purposes. In the case of the St-VAR(p) model the relevant residuals are the standardized ones defined by: Here, L t is changing with t and Z 0 t−1 as opposed to being constant in the N-VAR(2) model. An indicative pair of auxiliary regressions based on these residuals is: where y t denotes the fitted values and σ 2 t = Var(y t |σ(Z 0 t−1 )). The hypotheses being tested are directly analogous to those in Table 5 above.
The results of the M-S testing for the estimated Student's VAR(2) model, reported in Table 12, indicate no departures from its assumptions [1]- [5]. In assessing these results it is important to note that the p-values decrease with the sample size n, implying that one needs to decrease the appropriate threshold as n increases. For n = 231, a more appropriate threshold is α = 0.01; see (Spanos 2014). The statistical adequacy of the St-VAR(2) is also reflected by the constancy of the variation around a constant mean, exhibited by its residuals in  In light of the fact that the Student's t VAR(2) can proceed to probe the empirical adequacy o actual error probabilities provide a close appr ones. This includes testing the DSGE over-iden This should be contrasted with the N-VAR(2) residuals in Figure 4 which seem to indicate a shift in both the mean and variance between the period 1983-2000 and afterward. This, however, is misleading since Figure 4 represents the residuals of a statistically misspecified model. On the other hand, Figure 3 represents the residuals of a statistically adequate model and suggests that the lower volatility arises as an inherent chance regularity stemming from {Z t , t∈N} being a Student's t Markov (2) process. Indeed, the sequence of successive periods of large and small volatility represents a chance regularity pattern reflecting second-order temporal dependence, initially noted by (Mandelbrot 1963): ". . . large changes tend to be followed by large changes-of either sign-and small changes tend to be followed by small changes". (p. 418).
This calls into question the hypothesis known as the 'great moderation' (Stock and Watson 2002) based on Figure 4, since the residuals from the N-VAR(2) model do not account for the second-order dependence. That is, the 'great moderation' hypothesis stems form an erroneous interpretation based on statistical misspecification. The relevant residuals from a statistically adequate St-VAR(2) model in Figure 3 represent a realization of a Student's t Martingale Difference process as they should. Fig. 4: Normal-VAR(2) residuals terpretation of the hypothesis known as the 'great moderation' 02) based on Figure 3, since the residuals from the N-VAR(2) ally different from white-noise. The relevant residuals from te St-VAR(2) model in figure 3 represent a realization of a Difference process as they should.

Testing the Over-Identifying Restrictions
In light of the fact that the Student's t VAR(2) is a statistically adequate model, one can proceed to probe the empirical adequacy of the DSGE model, knowing that the actual error probabilities provide a close approximation to the nominal (assumed) ones. This includes testing the DSGE over-identifying restrictions: For m = 98, for α = 0.01, c α = 68.4, the observed test statistic yields: The tiny p-value provides indisputably strong evidence against the DSGE model. Hence, when a DSGE model M ϕ (z) is tested against reliable statistical evidence in the form of a statistically adequate M θ (z) [Student's t VAR(2)], M ϕ (z) is strongly rejected. The natural way forward for DSGE modeling is to find ways to modify DSGE models with a view to account for the statistical regularities in the data brought out by the Student's t VAR(2). These regularities include the leptokurticity as well as the second-order temporal dependence exemplified by the heteroskedastic Var(Y t |Z 0 t−1 ). It is important to note that the (Del Negro et al. 2007) substantive misspecification analysis based on the DSGE-VAR(λ) model differs from the above frequentist over-identifying restrictions test; see (Consolo et al. 2009). Apart from the fact that the former uses a Bayesian approach, the key difference is that the probabilistic assumptions underlying the VAR(λ) specification are presumed valid and are not tested against the data.

Forecasting Performance
Typical examples of out-of-sample forecasting capacity of both the DSGE and the Student's t VAR(2) models for 8 periods ahead [2003Q1-2004Q4; estimation period 1947Q2-2002Q4] is shown in Figures 5 and 6 for wages and consumption growth, with the actual data denoted by a solid line. Figures 5 and 6 are typical cases of the forecasting performance of the Smets and Wouters DSGE model and illustrate a good and a bad case. In general, the forecast line of the DSGE model tends to be over smoothed in a way that largely ignores the systematic temporal dependence/heterogeneity in the data. When the forecast line happens to overlap with the actual data, it appears to track the trend reasonably well but not the cycles; see Figure 6.
When the forecast line misses the actual data line (see Figure 6), the forecasts are particularly bad because the DSGE model over or under predicts systematically, giving rise to systematic (non-white noise) prediction errors. This pattern is symptomatic of statistical msisspecification. In relation to this, it is very important to emphasize that when the forecasts errors are statistically systematic-they exhibit over or under prediction-the Root Mean Square Error (RMSE) can be highly misleading as a measure of forecasting capacity. The RMSE is a reliable measure only when the forecast errors are statistically non-systematic. In that sense, Figures 5 and 6 show that the performance of the St-VAR(2) model is much better than that of the DSGE model, irrespective of the RMSEs.
Note that in the case of the St-VAR model, statistically non-systematic means that its residuals and forecast errors constitute realizations of martingale difference processes.
Interestingly, the poor forecasting performance of DSGE models is well-known, but it is rendered acceptable by comparing it to that of N-VAR models: ". . . we find that the benchmark estimated medium scale DSGE model forecasts inflation and GDP growth very poorly, although statistical and judgemental forecasts do equally poorly". (Edge and Gurkaynak 2010, p. 209) This claim fails to recognize that the poor forecasting performance stems primarily from the statistical inadequacy of the underlying estimated model; see Tables 6 and 7.

Potentially Misleading Impulse Response Analysis
The statistical inadequacy of the underlying statistical model also affects the reliability of its impulse response analysis, giving rise to misleading results about the reaction to exogenous shocks over time. Indeed, the estimated Var(y t |Z 0 t−1 ) brings out the potential unreliability of any impulse response and variance decomposition analysis based on assuming a constant conditional variance. Figure 7 compares the impulse responses of a 1% increase in labor hours (l t ) on inflation (π t ) from the Normal and Student's t VAR models. The statistically adequate St-VAR(2) model produces a sharp and big decline and a slow recovery in the rate of inflation. However, the Normal VAR model produces a different impulse response. The rate of inflation decreases sharply first, and then sharply increases before falling again and rising slowly. Figure 8 compares the impulse responses of a 1% increase in labor hours rate (l t ) on output (y t ) from the Normal and Student's t VAR models. The heterogeneous St-VAR model produces a mild decline and a slow recovery in the growth rate of per-capita real GDP. But the effects produced by the stationary Normal VAR model are completely different. The growth rate increases first, falls and recovers slowly.
Student's t VAR(2) models for 8 periods ahead [2003 1947Q2-2002Q4] is shown in figures 6 and 7 for wages the actual data denoted by a solid line. Figures 6 and 7 are typical cases of the forecasting Wouters DSGE model and illustrate a good and a ba line of the DSGE model tends to be oversmoothed in systematic temporal dependence/heterogeneity in th happens to overlap with the actual data, it appears well but not the cycles; see figure 6. When the forec line (see figure 7), the forecasts are particularly bad b under predicts systematically, giving rise to systemat errors. This pattern is symptomatic of statistical in it is very important to emphasize that when the fo systematic -they exhibit over or under prediction -(RMSE) can be highly misleading as a measure of fo is a reliable measure only when the forecast errors a In that sense, figures 6-7 show that performance of th that of the DSGE model, irrespective of the RMSEs. VAR model, statistically non-systematic means that constitute realizations of martingale difference proces Interestingly, the poor forecasting performance o but it is rendered acceptable by comparing it to that "... we find that the benchmark estimated mediu inflation and GDP growth very poorly, although statisti equally poorly ." (Edge and Gurkaynak, 2010, p. 209) This claim fails to recognize that the poor forecasts  re typical cases of the forecasting performance of the Smets and el and illustrate a good and a bad case. In general, the forecast del tends to be oversmoothed in a way that largely ignores the dependence/heterogeneity in the data. When the forecast line ith the actual data, it appears to track the trend reasonably les; see figure 6. When the forecast line misses the actual data e forecasts are particularly bad because the DSGE model over or atically, giving rise to systematic (non-white noise) prediction is symptomatic of statistical inadequacy. In relation to this, to emphasize that when the forecasts errors are statistically hibit over or under prediction -the Root Mean Square Error ly misleading as a measure of forecasting capacity. The RMSE only when the forecast errors are statistically non-systematic.

Potentially misleading impulse respon
The statistical inadequacy of the underlying statis ability of its impulse response analysis, giving rise reaction to exogenous shocks over time. Indeed, th out the potential unreliability of any impulse resp analysis based on assuming a constant conditional res the impulse responses of a 1% increase in labor hours (  ) m the Normal and Student's t VAR models. The statistically ) model produces a sharp and big decline and a slow recovery in . However, the Normal VAR model produces a different impulse of inflation decreases sharply first, and then sharply increases and rising slowly. es the impulse responses of a 1% increase in labor hours rate (  ) the Normal and Student's t VAR models. The heterogeneous duces a mild decline and a slow recovery in the growth rate of . But the effects produced by the stationary Normal VAR model ent. The growth rate increases sharply first, sharply falls and e vs. Statistical adequacy ment: "Any model that is well enough articulated to give clear ans we put to it will necessarily be artificial, abstract, patently 'unreal'" ing because it blurs the distinction between substantive and stahere is nothing wrong with constructing a simple, abstract and del M  (z) aiming to capture key features of the phenomenon view to shed light on (understand, explain, forecast) economic est, as well as gain insight concerning alternative policies. Un-

Identification of the 'Deep' Structural Parameters
A crucial issue raised in the DSGE literature is the identification of the structural parameters; see (Canova 2007;Iskrev 2010). The problem is that there is often no direct way to relate the statistical parameters (θ) to the structural parameters (ϕ) because the implicit function G(θ, ϕ) = 0 is not only highly non-linear, but it also involves algorithms like the Schur decomposition of the structural matrices involved.
An indirect way to probe the identification of the above DSGE model is to use the estimated statistical model, St-VAR(2) to generate, say N, faithful (true to the probabilistic structure of Z 0 ) replications of the original data Z 0 . The statistical adequacy of the estimated St-VAR(2) ensures that it accounts for the statistical regularities in the data, and thus the simulated data will have the same probabilistic structure as the original observations. This enables the modeler to learn about the identifiability of the deep parameters using these faithful replicas of the original data to estimate the structural DSGE model.
The N simulated data series can be used to estimate the structural parameters (ϕ) using the original 'quantification' procedures in (Smets and Wouters 2007). When the histogram of each ϕ i , for i = 1, 2, . . . , p, is concentrated around a particular value, with a narrow interval of support, then ϕ i can be regarded as identifiable. When the histogram exhibits a large range of values or/and multiple modes, it indicates that the parameter in question is non-identifiable.
Out of 36 parameters, simulation is applied to 27 keeping the rest of the parameters (7 of them are the variance of the shocks) constant. The 27 histograms in Figures 9 and 10 were generated using N = 3000 replications of the original data of sample size n = 230 based on the estimated statistical model, St-VAR(2); increasing N does not change the above results. Looking at these histograms we can distinguish between three different groups of identified/non-identified parameters. First is the group where the estimated/calibrated value is close to the mode of the histogram. In this case, the parameters π, µ p , σ c , Φ, ρ p , ρ g are potentially identifiable vis-avis the data.
Second is the group where the estimated/calibrated values are significantly different from the mode of the histogram. In such a case, the parameters in question l, α, ϕ, h, ι w , ι p , σ l , r π , ρ, ρ a , ρ r , ρ w are likely to be unidentifiable.
Third is the group where the estimated/calibrated value lies outside the actual range of values of the histogram. The parameters in question ρ ga , µ w , ξ w , ξ p , r y , ρ i , ρ b , ρ p are clearly non-identifiable.
Taken together, only six of the twenty-seven parameters have estimated or calibrated values that are potentially identifiable vis-a-vis the data.
This simulation exercise indicates that, in contrast to the statistical parameters of the St-VAR(2) which are inherently identifiable, the identification and constancy of the 'deep' DSGE parameters is called into question. The results also question the appropriateness of the 'estimation' of these deep parameters using traditional methods such as the method of moments, maximum likelihood, and Bayesian techniques; see (Smets and Wouters 2003, 2005, 2007Ireland 2004Ireland , 2011. A question that needs to be addressed is whether the Bayesian techniques narrow down the range of values of the deep parameters to render them "artificially" identifiable. Indeed, the broader question which naturally arises when one is dealing with a calibrated model is: what is one calibrating/evaluating when the structural parameters are non-identifiable?

Substantive vs. Statistical Adequacy
Lucas's (Lucas 1980) argument that: "Any model that is well enough articulated to give clear answers to the questions we put to it will necessarily be artificial, abstract, patently 'unreal'" (p. 696), is highly misleading because it blurs the distinction between substantive and statistical adequacy. There is nothing wrong with constructing a simple, abstract, and idealized theory-model M ϕ (z) aiming to capture key features of the phenomenon of interest, to shed light on (understand, explain, forecast) economic phenomena of interest, as well as gain insight concerning alternative policies. Unreliability of inference problems arise when the statistical model M θ (z) implicitly specified by M ϕ (z) is statistically misspecified, and no attempt is made to reliably assess whether M ϕ (z) does, indeed, capture the key features of the phenomenon of interest; see (Spanos 2009b). That is, the strategy 'theory or bust' makes no sense in empirical modeling. As argued by Hendry (Hendry 2009): "This implication is not a tract for mindless modeling of data in the absence of economic analysis, but instead suggests formulating more general initial models that embed the available economic theory as a special case, consistent with our knowledge of the institutional framework, historical record, and the data properties. . . Applied econometrics cannot be conducted without an economic theoretical framework to guide its endeavors and help interpret its findings. Nevertheless, since economic theory is not complete, correct, and immutable, and never will be, one also cannot justify an insistence on deriving empirical models from theory alone". (pp. 56-57) Statistical misspecification is not the inevitable result of abstraction and simplification but stems from imposing invalid probabilistic assumptions on the data. Moreover, the latter goes a long way toward explaining the poor forecasting performance of the traditional macroeconometric models in the 1970s (Nelson 1972) and can explain the poor forecasting performance of DSGE models.
Unfortunately, the current literature on DSGE modeling adopts the (Kydland and Prescott 1991) view that misspecification is inevitable. For instance, (Canova 2007) goes further by arguing: "DSGE models are misspecified in the sense that they are, in general, too simple to capture the complex probabilistic nature of the data. Hence, it may be fruitless to compare their outcomes with the data . . . Both academic economists and policy makers use DSGE models to tell stories about how the economy responds to unexpected movements in the exogenous variables". (p. 160) There is nothing complicated about the probabilistic nature of economic time series data. The probabilistic assumptions needed to account for the chance regularity patterns in such data come from three broad categories, (D) Distribution, (M) Dependence, and (H) Heterogeneity, with simple generic ways to account for (M) and (H) using lags and trend polynomials. Moreover, when any of the probabilistic assumptions are found wanting, they can be easily replaced with more appropriate ones (respecification); see Spanos (2019). Regarding the use of empirical modeling as 'story-telling', it should noted that when an estimated DSGE model (M ψ (z)) is statistically misspecified, then the stories based on it have nothing to do with the economy that gave rise to the data since the empirical evidence invoked are untrustworthy. It will be better for the scientific reputation of macroeconometrics to skip the 'data' part and just tell the stories associated with simulating M ψ (z) using parameter values that seem 'appropriate' to a DSGE modeler. Why pretend that these values stem from the data?

Summary and Conclusions
The literature on DSGE modeling rightly points out that reliance on chance regularities for statistical inference purposes, as in the case of the traditional VAR(p) model, is not sufficient to represent substantively meaningful (structural) models that can be used to forecast and evaluate different macroeconomic policies. On the other hand, estimating structural models that belie the chance regularities in the data would only give rise to untrustworthy inference results.
Estimating the structural model directly often leads to an impasse since the estimated model is often both statistically and substantively inadequate. This renders any proposed substantive respecifications of the original structural model (Del Negro and Schorfeide 2009;Del Negro et al. 2007) questionable since the respecified model is declared 'better' or 'worse' on the basis of untrustworthy evidence when the estimated statistical model is misspecified.
A way to address this quandary is to separate, ab initio, the structural model M ϕ (z) from the statistical model M θ (z), and establish statistical adequacy before posing any substantive questions of interest. An estimated DSGE model M ϕ (z) whose statistical premises M θ (z) are misspecified constitutes an unreliable basis for any form of inference. From a purely probabilistic perspective, M θ (z) is viewed as a parameterization of the process {Z t , t∈N} underlying data Z 0 , chosen so that it (parametrically) nests M ϕ (z) via G(θ, ϕ) = 0. The crucial distinction between statistical and substantive premises suggests that various traditional conundrums, such as theory-driven vs. data-driven, realistic vs. unrealistic, and policy-oriented vs. non-policy-oriented models, are largely false dilemmas. Statistical adequacy of M θ (z) is a necessary precondition for securing the reliability of any form of inference.
Using quarterly US data for the period 1947:2-2004:4, the confrontation of the (Smets and Wouters 2007) DSGE model M ϕ (z) with a statistically adequate M θ (z) [Student's t VAR(2)] strongly rejects M ϕ (z), and calls into question the reliability of any inferences based on it. The Bayesian estimation techniques used by the authors are likely to be equally unreliable because the implicit likelihood function is invalid. Indeed, in light of the unidentifiability of most of the structural parameters shown in Section 4.4, questions arise about the role of the priors in quantifying such parameters. Based on the above discussion, a way forward for DSGE modeling is to engage in the following recommendations.
(a) The modeler needs to bring out the statistical model M θ (z) implicitly specified by the structural model M ϕ (z), with the former specified in terms of a complete set of testable probabilistic assumptions, as in Tables 1, 4 and 8. (b) When a DSGE model M ϕ (z) is estimated directly, the statistical reliability of any inferences drawn is questionable. Before any reliable inferences can be drawn, the modeler needs to test the validity of the assumptions of the statistical model. (c) When the statistical model M θ (z) is found to be misspecified, the modeler needs to respecify it to account for the statistical information in the data. Only when the statistical adequacy of M θ (z) is established, one should proceed to the inference stage. (d) The evaluation of the empirical validity of the structural model M ϕ (z) begins with testing the validity of the over-identifying restrictions G(θ, ϕ) = 0, in the context of a statistically adequate M θ (z). (e) In cases where the overidentifying restrictions are rejected, the modeler needs to return to M ϕ (z) in order to respecify it substantively, to account for the statistical regularities summarized by the statistically adequate M θ (z). The misspecification/respecification scenarios proposed by (Del Negro et al. 2007) and (Del Negro and Schorfeide 2009) enter the modeling at this stage, and not before.

Author Contributions:
The two authors contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Data Availability Statement:
The data set used in this paper is the same as that used in (Smets and Wouters 2007).

Conflicts of Interest:
The authors declare no conflict of interest.