Indirect Inference: Which Moments to Match?

Frazier, David T.; Renault, Eric

doi:10.3390/econometrics7010014

Open AccessArticle

Indirect Inference: Which Moments to Match?

by

David T. Frazier

^1,* and

Eric Renault

^2,†

¹

Department of Econometrics and Business Statistics, Monash University, Melbourne 3800, Australia

²

Department of Economics, University of Warwick, Coventry CV4 7AL, UK

^*

Author to whom correspondence should be addressed.

^†

We thank Geert Dhaene for helpful comments and discussions.

Econometrics 2019, 7(1), 14; https://doi.org/10.3390/econometrics7010014

Submission received: 19 December 2018 / Revised: 17 February 2019 / Accepted: 7 March 2019 / Published: 19 March 2019

(This article belongs to the Special Issue Resampling Methods in Econometrics)

Download Versions Notes

Abstract

:

The standard approach to indirect inference estimation considers that the auxiliary parameters, which carry the identifying information about the structural parameters of interest, are obtained from some recently identified vector of estimating equations. In contrast to this standard interpretation, we demonstrate that the case of overidentified auxiliary parameters is both possible, and, indeed, more commonly encountered than one may initially realize. We then revisit the “moment matching” and “parameter matching” versions of indirect inference in this context and devise efficient estimation strategies in this more general framework. Perhaps surprisingly, we demonstrate that if one were to consider the naive choice of an efficient Generalized Method of Moments (GMM)-based estimator for the auxiliary parameters, the resulting indirect inference estimators would be inefficient. In this general context, we demonstrate that efficient indirect inference estimation actually requires a two-step estimation procedure, whereby the goal of the first step is to obtain an efficient version of the auxiliary model. These two-step estimators are presented both within the context of moment matching and parameter matching.

Keywords:

indirect inference; auxiliary models; overidentification

JEL Classification:

C10; C14; C15

1. Introduction

Twenty-five years ago, with the publication of their manuscript on “Efficient Method of Moments” (hereafter, EMM), Gallant and Tauchen (1996) (hereafter, GT) made a seminal contribution to the field of simulation-based estimation and inference. The EMM estimation approach proposed by GT estimates parameters of the underlying structural model by “matching moments” defined through a score generator, namely the score function of some hypothesized auxiliary model. In the EMM approach, the efficiency of the resulting structural parameter estimators will occur when the score function for the chosen auxiliary model asymptotically spans the score function of the well-specified parametric model that has generated the data.

The efficiency argument underlying EMM estimation was further developed in Gallant and Long (1997), where it was argued that the score function of the SNP (SemiNonParametric) density of Gallant and Nychka (1987) spans the score of most relevant distributions, at least when the number of terms, K, in the SNP expansion diverges to infinity as the sample size diverges. Similarly, and concurrent with the development of EMM, following Smith (1993), Gourieroux et al. (1993) (hereafter, GMR) demonstrate that, for well-chosen weighting matrices used for moment matching, indirect inference estimators based on score matching will be asymptotically equivalent to indirect inference estimators based on the direct matching of auxiliary parameters.

Therefore, the question of “Which Moments to Match” is actually independent of the efficient indirect inference estimation strategy employed, which we dub, by analogy with the trinity of asymptotic tests, the score approach (i.e., matching the score function of the auxiliary model) and the Wald approach (i.e., matching directly the estimators of auxiliary parameters). In contrast, GMR have shown that a likelihood ratio type of approach (also proposed by Smith (1993)) can lead to inefficient indirect inference estimators. Given the asymptotic equivalence between the score and Wald indirect inference approaches, the only pending issue for satisfactory application of indirect inference is then the selection of the auxiliary model.

Several authors, including Gallant and Long (1997), Andersen and Lund (1997) (hereafter, AL), and Gallant et al. (1997), have carefully discussed the choice of auxiliary model in the context of EMM, namely through the use of some SNP score generator. Not surprisingly they find that, as in any moment matching exercise, to achieve good finite-sample performance of the indirect inference estimator “it is important to conserve on the number of elements in the score generator” (AL), that is, on the number of moments to match within estimation. While this concern for parsimony is obviously sensible, one may consider different ways that it can be achieved. Basically, the aforementioned studies, as well as other EMM studies, put forward two principles.

Principle 1: Following Eastwood (1991), implementation of the SNP approach requires choosing the truncation degree in the expansion in an adaptive (i.e., random, data-dependent) manner. While AL interpret the results of Eastwood (1991) to suggest that AIC is the optimal model choice strategy for this adaptive truncation, they eventually decide to elicit a “choice of score generator(s) (...) guided by the more conservative HQC and BIC criteria”. In addition, Gallant and Long (1997) and Gallant et al. (1997) also use the BIC in determining the choice of the auxiliary model, while the latter stresses that “to implement the EMM estimator we require a score generator that fits these data well”, leading to the use of the BIC to measure the trade-off between parsimony vs goodness-of-fit. Principle 2: For a given number of terms in the SNP expansion, the score generator can be interpreted as the score of an unconstrained parametric model, which ensures that, by definition, we end up with a just-identified set of moment conditions to match: the number of auxiliary parameters to estimate is exactly the number of components in the score vector. For instance, and in contrast to Gallant and Long (1997), who allow for conditional heterogeneity in the innovation density, AL (see p. 364) “find no evidence that such an extension is required”, and, by the same token, AL eliminate the additional heterogeneity parameter introduced by Gallant and Long (1997) and the corresponding moments to match. However, one may realize that an alternative approach would have been adding moment conditions aimed at utilizing the knowledge that this kind of heterogeneity is not present in the data, and which would then lead to an overidentified set of moments to match.

The purpose of the present paper is to revisit the issue of selecting an auxiliary model that is optimal for the purpose of indirect inference, in terms of efficiency, without tying our hands by having to adhere to Principles 1 and 2 above. We contend that these principles are stricter than necessary for the following reasons.

(i): We argue that the first principle puts too much emphasis on the idea that “to implement the EMM estimator we require a score generator that fits these data well”. Gallant and Long (1997) show that (see their Lemma 1, p. 135) under convenient regularity conditions, asymptotic efficiency is reached if and only if the linear span of a “true score” (i.e., the score of a well-specified parametric model for the structural model) is asymptotically included (at the true value of the structural parameters) in the linear span of the score of the auxiliary model. This does not require in any way that the auxiliary model is (even asymptotically for an arbitrary large number of parameters) a well-specified model, in the sense that it is consistent with the Data Generating Process (DGP). Of course, as emphasized by GT, a sufficient condition for this score spanning property is the so-called smooth embedding of the score generator, which means that there is a one-to-one and twice continuously differentiable mapping between the two parametrizations (i.e., the auxiliary and structural). Then score spanning is just a consequence of computing compounded derivatives. However, this sufficient condition for score spanning is definitely not necessary and, thus, there is no logical argument to impose a model selection criterion, like AIC or BIC, to select an auxiliary model. We remind the reader that the purpose of the auxiliary model is not to describe the DGP, but to provide informative estimating equations. After all, it may be possible to satisfy the linear spanning condition by using a vector of moments that define well-suited auxiliary parameters but have no interpretation as a score function of a quasi-likelihood.
(ii): The next point is the realization that what determines the efficiency of indirect inference estimators is the moments they match, and not necessarily the auxiliary parameters. Hence, and in contrast to the example given in Principle 2 above, one may well contemplate using a set of moment conditions that overidentify the vector of unknown auxiliary parameters. Of course, moment estimation of the auxiliary parameters will eventually resort to a just-identified set of moment conditions, through the choice of a particular linear combination of the (possibly) overidentified moment conditions. However, we argue that the choice of this just-identified set of moment conditions should not be guided by efficiency of the resulting estimator of auxiliary parameters (as would be the case with an efficient two-step Generalized Method of Moments (GMM) estimator) but, on the contrary, by our goal of obtaining an asymptotically efficient indirect estimator of the structural parameters. We demonstrate that this new focus of interest produces a novel way to devise a two-step GMM estimator of the auxiliary parameters that is, in general, different from standard efficient two-step GMM estimators.

The fact that we wish to relax Principles 1 and 2, with our main focus on asymptotically efficiency of the elicited indirect inference estimator, does not mean that we overlook the need “to conserve on the number” of moments to match, for the purpose of finite-sample performance. However, our claim is that the trade-off between parsimony and asymptotic efficiency should set the focus on indirect estimation of structural parameters and not on goodness-of-fit of the auxiliary model. Ideally, one should devise a procedure to select the valid and relevant moments to match, for instance by resorting to “an information-based LASSO for GMM with many moments” as recently proposed by Cheng and Liao (2015). The key idea of the Cheng and Liao (2015) approach is to define a new adaptive penalty that ensures the valid and relevant moment conditions are consistently selected in the GMM shrinkage estimation. This adaptive penalty depends on a measure of the information content of the moment conditions. However, while the measure of information used by Cheng and Liao (2015) is naturally based on the asymptotic variance of the resulting GMM estimator, we do not really care about the asymptotic variance of the GMM estimator of our auxiliary parameters. As explained above, our focus of interest is the asymptotic variance of the resulting indirect inference estimator of structural parameters. To this end, the present paper demonstrates the relevant way to measure the information content of the moments we wish to match. While we only consider here a given finite set of moments, the use of this new measure of information for the purpose of moment selection in a possibly infinite set, by a well-tuned procedure in the spirit of Cheng and Liao (2015), is left for future research.

The remainder of the paper is organized as follows. In Section 2, we present a very general framework for indirect inference consisting of a set of moment conditions that are able to identify both the structural parameters (given a value of the auxiliary parameters) and the auxiliary parameters (given a value of the structural parameters). This duality leads us to revisit the issue of model choice criteria for eliciting an informative auxiliary model. In Section 3, we demonstrate that there exists an efficient choice of the auxiliary model that can be used to construct asymptotically efficient indirect inference estimators. The equivalence between moment matching and parameter matching is maintained in this general setting, and, in both cases, we devise an efficient two-step procedure to construct efficient indirect inference estimators, where the goal of the first-step estimator is to obtain an efficient choice of the auxiliary model. We conclude the paper in Section 4 by paving the way for these tools to be used in the context of several popular auxiliary models, including moment models, and vector auto-regressive models, for which there is a clear trade-off between asymptotic efficiency and parsimony. Proofs of certain results are detailed in the Appendix A.

2. Auxiliary versus Structural Models

We argue that the most general framework for Indirect Inference can be accommodated through a set of k moment conditions whose information content is two-fold identification:

Identification of the true unknown value $θ^{0}$ of the structural parameters $θ \in Θ \subset R^{d_{θ}}$ , for a given value (the true unknown one) of the auxiliary parameters.
Identification of the true unknown value $β^{0}$ of the auxiliary parameters $β \in B \subset R^{d_{β}}$ , for a given value (the true unknown one $θ^{0}$ ) of the structural parameters.

While the former identification will be encapsulated in the definition of the structural model, including a parametric model that will allow us to replace analytically intractable moment conditions by their Monte Carlo counterparts, evaluated at any possible value of

θ \in Θ

, the latter identification will remain true to the standard GMM setting, where the true unknown value

θ^{0}

is implicitly contained in the data generating process (hereafter, DGP). Note that both identification schemes maintain that the moment conditions are numerous enough to identify, or overidentify, the structural (resp., auxiliary) parameters when the true values of the auxiliary (resp., structural) parameters are given; i.e., it must be that

k \geq min \{d_{β}, d_{θ}\} .

2.1. Auxiliary Model

We consider an auxiliary model characterized by a finite number of moment restrictions. We set our focus on a vector of auxiliary moment functions

g (y, β)

taking values in

R^{k}

, which (possibly) depends on a random vector y and unknown auxiliary parameters

β \in B \subset R^{d_{β}}

. This model admits a true unknown value

β^{0}

of the auxiliary parameters, which satisfy the moments conditions

E [g (y, β^{0})] = 0,

(1)

and where

E [.]

denotes expectation taken with respect to the distribution of y. In other words, the unknown DGP is assumed to fulfill a vector of moment restrictions that are known up to a vector of auxiliary parameters

β

that must be consistently estimated from data.

For this purpose, we have at our disposal a sequence of stationary observations

{y_{t}}_{t = 1}^{T}

on the random vector y, which allows us to compute sample counterparts of the moment functions

{\bar{g}}_{T} (β) = \frac{1}{T} \sum_{t = 1}^{T} g (y_{t}, β) .

In this section, we maintain the classical assumptions of Generalized Method of Moments (hereafter, GMM), implying that the true unknown value

β^{0}

is identified, both globally and locally at first-order, and the moments are not redundant at the true value.

Assumption 1.

The following assumptions are satisfied.

(i): $E [g (y, β)] = 0 \Leftrightarrow β = β^{0}$ .
(ii): $J = \frac{\partial E [g (y, β)]}{\partial β^{'}} |_{β = β^{0}}$ is full column rank.
(iii): $Ω = {plim}_{T = \infty} [\sqrt{T} {\bar{g}}_{T} (β^{0})]$ is a positive definite matrix.

Any consistent estimator of

β^{0}

based on these moment conditions, and denoted by

{\hat{β}}_{T} (K)

, can be seen, at least asymptotically, as the solution of the just identified set of equations

K {\bar{g}}_{T} ({\hat{β}}_{T} (K)) = \frac{1}{T} \sum_{t = 1}^{T} K g (y_{t}, {\hat{β}}_{T} (K)) = 0,

(2)

for some (

d_{β} \times k

)-dimensional selection matrix K, with rank

d_{β}

. For example, Equation (2) may be defined from the first-order conditions of the minimization problem

min_{β \in B} {∥ {\bar{g}}_{T} (β) ∥}_{W}^{2},

(3)

where

{∥ x ∥}_{W}^{2} : = x^{'} W x

denotes the weighted Euclidean norm. In such an example, the selection matrix K would actually be a random matrix depending on the observed sample and where the Jacobian matrix J and possibly also the weighting matrix in the squared norm are replaced by sample counterparts.

However, recalling that our ultimate goal is efficient estimation of

θ^{0}

and not efficient estimation of

β^{0}

, the results of Bates and White (1993) (see their Section 3.2), which implies that among all the discrepancy functions allowing us to consistently estimate

β^{0}

the optimal one is a quadratic form of

{\bar{g}}_{T} (β)

, are not applicable in this context. Therefore, Equation (2) must be viewed as general estimating equations where the selection matrix K may be replaced by a random sample counterpart

{\hat{K}}_{T}

. However, as far as first-order asymptotic distributional theory is concerned, the choice of a consistent estimator

{\hat{K}}_{T}

of a selection matrix K is immaterial. For this reason, we simplify the notations by overlooking the (possible) dependence of K on the observed sample. Since

{\hat{β}}_{T} (K)

is defined only through the estimating Equation (2), and not necessarily via a minimization problem like (3), we will require a slight strengthening of Assumption 1 (ii).

Assumption 2.

The matrix

K J

is non-singular.

We note that Assumption 2 would be implied by Assumption 1 in the case where the estimating Equations in (2) are obtained from the minimization of a quadratic form. However, since we do not wish to impose such a restriction, we must explicitly maintain Assumption 2 in the general case. Assumption Assumption 2 implies that the selection matrix K is of full row rank and also further restricts it with respect to the Jacobian matrix J.

Note that it is quite natural in practice to expect that Assumption 2 is fulfilled. For example, if a subset

\tilde{g} (y, β)

of the components of

g (y, β)

just identifies

β

(with a Jacobian matrix

\tilde{J}

conformable to Assumption 1), then a selection matrix K built by combining

(k - d_{β})

zero columns with a non-singular (

d_{β} \times d_{β}

)-dimensional matrix

\tilde{K}

will satisfy Assumption 2 so long as the zero columns of K are such that they only elicit the components of

\tilde{g} (y, β)

in the sense that:

K J = \tilde{K} \tilde{J} .

Under standard regularity conditions, Assumptions 1 and 2 together with a Taylor expansion of Equation (2) will allow us to view

{\hat{β}}_{T} (K)

as a “linear estimator” with asymptotic expansion

\sqrt{T} [{{\hat{β}}_{T} (K) - β}^{0}] = - {(K J)}^{- 1} K \sqrt{T} {\bar{g}}_{T} (β^{0}) + o_{P} (1) .

(4)

In particular,

\sqrt{T} [{{\hat{β}}_{T} (K) - β}^{0}]

will be asymptotically normal with variance

Σ_{K} (β^{0}) = {[(K J)]}^{- 1} K Ω K^{'} {[{(K J)}^{'}]}^{- 1} .

While efficient GMM estimation of

β^{0}

would minimize this variance matrix, by eliciting a selection matrix

K = J^{'} Ω^{- 1}

(to obtain the asymptotic variance

{[J^{'} Ω^{- 1} J]}^{- 1}

), it is not the purpose of the present paper.

More generally, we note that, for any weighting matrix W, minimization of

∥ {\bar{g}}_{T} {(β) ∥}_{W}^{2}

would amount to a selection matrix

K = J^{'} W

. However, as we will see in Section 3, the optimal selection matrix, for the purpose of efficient indirect inference estimation, does not belong to this general family.

2.2. The Structural Model

We assume the parametric structural model is characterized by a transition density function

p (y_{t} |Y_{t - 1}; θ), θ \in Θ \subset R^{d_{θ}}

and

Y_{t - 1} = {y_{1}, y_{2}, . . ., y_{t - 1}}

. We denote, for

h = 1, . . ., H

,

{y_{t}^{(h)} (θ)}_{t = 1}^{T}

a simulated sample path from

p (. |.; θ)

. Then, we end up with

(H + 1)

mutually independent paths consisting of T stationary observations on y, H simulated paths

{y_{t}^{(h)} (θ)}_{t = 1}^{T}

(

h = 1, . . ., H

), which can be computed for any possible value

θ \in Θ

and always using the same random seeds, and the observed path

{\{y_{t}\}}_{t = 1}^{T}

. Throughout the remainder we let

E_{θ} [\cdot]

denote the expectation taken with respect to the distribution of

y_{t}^{(h)} (θ) .

As announced at the beginning of this section, we require identification assumptions about the structural parameters (for given

β^{0}

) similar to the identification assumptions maintained about

β

in Assumption 1, and where

θ^{0}

is implicitly given by the DGP. Note that, by definition of the true value

θ^{0}

, it corresponds to the DGP in the sense that, for all

β \in B

,

E_{θ^{0}} [g (y, β)] = E [g (y, β)] .

To ensure identification of

θ^{0}

, we maintain the following global and local identification assumptions.

Assumption 3.

The following assumptions are satisfied.

(i): $E_{θ} [g (y, β^{0})] = 0 \Leftrightarrow θ = θ^{0}$ .
(ii): $Γ = \frac{\partial E_{θ} [g (y, β^{0})]}{\partial θ^{'}} |_{θ = θ^{0}}$ is full column rank.

Assumption 3 is required to define a consistent and asymptotically normal indirect inference (hereafter, II) estimator of

θ^{0}

. More precisely, we consider II estimators defined through the following minimization program:

{\hat{θ}}_{T, H} (K, W) = arg min_{θ \in Θ} {[\frac{1}{T H} \sum_{h = 1}^{H} \sum_{t = 1}^{T} g (y_{t}^{(h)} (θ), {\hat{β}}_{T} (K))]}^{'} W [\frac{1}{T H} \sum_{h = 1}^{H} \sum_{t = 1}^{T} g (y_{t}^{(h)} (θ), {\hat{β}}_{T} (K))],

(5)

where W is a given positive definite weighting matrix.1

The II estimator

{\hat{θ}}_{T, H} (K, W)

is a generalization of the score-matching one initially proposed by GT. In particular, GT consider a similar minimization program to (5), but specifically require that

g (\cdot)

is the score vector of some parametric auxiliary model with parameters

β

, which ensures that the dimension of

g (\cdot)

and the dimension of

β

coincide. As a result, no selection matrix K is required in the GT approach.

In particular, our setting includes the case where, as in GT, the function

g (\cdot)

is the score vector of some auxiliary model, but where this model is subject to a set of exclusion restrictions, such that the number of free parameters

d_{β}

is much smaller than the dimension k of the score vector

g (\cdot)

. A typical example of this situation would be a Vector Auto-Regressions (VAR) auxiliary model (see, e.g., Smith (1993)), which, for the sake of parsimony, is subject to some exclusion restrictions. Note that when AL eliminates the additional heterogeneity parameters introduced by Tauchen (1997), we would again be in a case where

k > d_{β}

if some components of the complete score vector had not been arbitrarily eliminated.

In contrast to the GT approach, the II estimator (5) only requires

k \geq d_{β}

. This estimator is our focus of interest in this paper. We first note that we can describe this estimator through a standard linear representation, similarly to common II estimators: under standard regularity conditions, a Taylor expansion of the first-order conditions for (5) and the linear expansion of

\sqrt{T} [{{\hat{β}}_{T} (K) - β}^{0}]

in (4) yield2

\begin{matrix} \sqrt{T} [{\hat{θ}}_{T, H} (K, W) - θ^{0}] = {[Γ^{'} W Γ]}^{- 1} Γ^{'} W \times \\ \{J {(K J)}^{- 1} K \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{g (y_{t}, β^{0})\} - \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{\frac{1}{H} \sum_{h = 1}^{H} g (y_{t}^{(h)} (θ^{0}), β^{0})\}\} + o_{P} (1) . \end{matrix}

(6)

From Equation (6), we can conclude that

\sqrt{T} [{\hat{θ}}_{T, H} (K, W) - θ^{0}]

is an asymptotically normal vector whose asymptotic variance may be reduced by the fact that our minimization (5) involves the entire vector of moment functions

g (y, β)

and not just the subset

g_{K} (y, β) = K g (y, β)

.

The reason for this possibility of variance reduction is two-fold. First, the component of (6) that is computed from simulated data uses the entire set of estimating functions

g (y, β)

and not only the subset

g_{K} (y, β)

. This should be beneficial in terms of asymptotic variance in the same way that in standard GMM (see, e.g., the linear representation (4)), adding valid moment functions can only decrease (weakly) the asymptotic variance of the efficient GMM estimator. However, in the case of (6), this efficiency gain would vanish as the number simulated paths, H, diverges to infinity.

Second, and more importantly, the multiplicative factor in the asymptotic expansion for

{\hat{θ}}_{T} (K, W)

depends on the matrix

{[Γ^{'} W Γ]}^{- 1}

, which is determined by the Jacobian matrix for the entire set of estimating functions

g (\cdot, β)

and not the subset of moments

g_{K} (\cdot, β)

. As such, and for

Ω

as defined in Assumption 1, we know from the theory of efficient GMM estimation that an asymptotic variance

{[Γ^{'} Ω^{- 1} Γ]}^{- 1}

is smaller (or equal) to the asymptotic variance

{[Γ^{'} K^{'} {(K Ω K^{'})}^{- 1} K Γ]}^{- 1}

, where the latter is obtained by considering only the subset

g_{K} (y, β)

of estimating functions.

The intuition behind this possible efficiency gain will be formally confirmed in Section 3. For this purpose, it is worth shedding first more light on the issue of model choice that, as explained in the introduction, has been often used in the extant literature for the selection of the auxiliary model.

2.3. Moment Selection Criterion

As already explained, the standard strategy for II based on a score generator amounts to picking a parsimonious auxiliary model, with the implication being that certain components within a possibly large set of estimating functions are eliminated (jointly with some auxiliary parameters) to arrive at a vector of just identified auxiliary parameters

β

. We can nest this particular case within our general setup by examining the case where, all along the II estimation strategy, including the minimization (5), we decide to use only a given just identified subset of

g (y, β)

, namely

g_{K} (y, β) = K g (y, β) .

When considering the just identified subset of moments

g_{K} (y, β)

, we must redefine the Jacobian matrices J and

Γ

by

J_{K} = K J

and

Γ_{K} = K Γ

, respectively, as well as the variance matrix

Ω_{K} = K Ω K^{'}

. Denoting

W_{K}

as the weighting matrix used in the minimization program (5) (with

g (y, β)

replaced by

g_{K} (y, β)

), we can deduce that the corresponding II estimator has the following linear asymptotic expansion:

\begin{matrix} \sqrt{T} [{\hat{θ}}_{T, H}^{(K)} (K, W) - θ^{0}] & = & {[Γ_{K}^{'} W_{K} Γ_{K}]}^{- 1} Γ_{K}^{'} W_{K} \\ \{\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{g_{K} (y_{t}, β^{0})\} - \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{\frac{1}{H} \sum_{h = 1}^{H} g_{K} (y_{t}^{(h)} (θ^{0}), β^{0})\}\} + o_{P} (1), \end{matrix}

and the asymptotic variance of this II estimator is given by

Σ_{H}^{(K)} (K, W_{K}) = (1 + \frac{1}{H}) {[Γ_{K}^{'} W_{K} Γ_{K}]}^{- 1} Γ_{K}^{'} W_{K} Ω_{K} W_{K} Γ_{K} {[Γ_{K}^{'} W_{K} Γ_{K}]}^{- 1} .

In this case, the optimal choice of the weighting matrix is

W_{K} = Ω_{K}^{- 1}

, which would lead to an allegedly optimal asymptotic variance given by estimator

Σ_{H}^{* (K)} (K) = (1 + \frac{1}{H}) {[Γ_{K}^{'} Ω_{K}^{- 1} Γ_{K}]}^{- 1} .

(7)

However, the optimality of this II estimator is questionable. In particular, a more efficient II estimator could be obtained if we could overcome the information loss due to the selection of moments through the matrix K, which would allow us to obtain the asymptotic variance

Σ_{H}^{*} = (1 + \frac{1}{H}) {[Γ^{'} Ω^{- 1} Γ]}^{- 1} .

(8)

Surprisingly enough, we will show in the next section that (8) is an efficiency bound and that it can be feasibly reached in the standard context of II where

d_{β} \geq d_{θ}

, i.e., when the vector of auxiliary parameters is large enough to possibly identify the structural parameters. However, it will turn out that obtaining such an efficiency bound will require a two-step II estimation procedure, where an optimal selection matrix

K^{*}

will be estimated in the first step, and the second step will use this estimator to deduce an II estimator that is capable of reaching the efficiency bound (8).

As one would suspect, two-step II estimators that reach the efficiency bound in (8) can be obtained under both the “moment matching” and “parameter matching” approaches to II. We note here that the reference to “moment matching” and “parameter matching” is explicit, and from hereon we will no longer refer to such approaches as “score matching” and “Wald approach”, since the auxiliary model may not be a parametric model endowed with a score function. In particular, in this generalization it may not be possible to interpret the estimator of the auxiliary parameters as a quasi-maximum likelihood estimator, as these auxiliary estimators will be based on a preliminary estimator of an optimal selection matrix,

K^{*}

, the solution of the equation

Γ_{K}^{'} Ω_{K}^{- 1} Γ_{K} = Γ^{'} Ω^{- 1} Γ .

(9)

In this way, the optimal choice of the selection matrix K must be related to the Jacobian matrix

Γ

of the structural model and is definitely not what would be elicited by a model selection criterion blindly applied to the auxiliary model. As will be made clear later, the optimal selection matrix

K^{*}

, defined through (9), is not in general consistent with GMM estimation of the auxiliary parameters.

3. Which Moments to Match?

3.1. Optimal Selection Matrix K

The goal of this subsection is to deduce a selection matrix K that solves Equation (9). For this purpose, it is worth revisiting Equation (7) in terms of orthogonal projections. More precisely, let us associate to any given selection matrix K, of dimension

d_{β} \times k

and rank

d_{β}

, a full column rank matrix X, defined as

X = {(Ω^{1 / 2})}^{'} K^{'},

where

Ω^{1 / 2}

is any matrix such that

Ω = (Ω^{1 / 2}) {(Ω^{1 / 2})}^{'} .

We can then write

\begin{matrix} Γ_{K}^{'} Ω_{K}^{- 1} Γ_{K} & = & Γ^{'} K^{'} {(K Ω K^{'})}^{- 1} K Γ \\ = & Γ^{'} {[{(Ω^{1 / 2})}^{'}]}^{- 1} X {(X^{'} X)}^{- 1} X^{'} {[Ω^{1 / 2}]}^{- 1} Γ \\ = & {\tilde{Γ}}^{'} P_{X} \tilde{Γ}, \end{matrix}

where

\tilde{Γ} = {[Ω^{1 / 2}]}^{- 1} Γ, and P_{X} = X {(X^{'} X)}^{- 1} X^{'} .

Taking

A ⪰ B

to mean that the difference

(A - B)

is positive semi-definite, we obviously deduce

{[Γ_{K}^{'} Ω_{K}^{- 1} Γ_{K}]}^{- 1} = {[(P_{X} {\tilde{Γ}}^{'}) (P_{X} \tilde{Γ})]}^{- 1} ⪰ {[{\tilde{Γ}}^{'} \tilde{Γ}]}^{- 1} = {[Γ^{'} Ω^{- 1} Γ]}^{- 1},

with equality if

P_{X} \tilde{Γ} = \tilde{Γ} .

(10)

The above confirms that the asymptotic variance of the II estimator

{\hat{θ}}_{T, H} (K, Ω_{K}^{- 1})

does not, in general, reach the efficiency bound (8) but can reach this bound when condition (10) is fulfilled, which requires that the columns of

\tilde{Γ}

are in the range of the matrix X; i.e., this requires that

{[Ω^{1 / 2}]}^{- 1} Γ = X Λ = {(Ω^{1 / 2})}^{'} K^{'} Λ

for some matrix

Λ .

In other words, a matrix K will satisfy (9) if there exists some matrix

Λ

such that

K^{'} Λ = Ω^{- 1} Γ .

(11)

If the selection matrix K fulfills Equation (11), not only will the corresponding II estimator

{\hat{θ}}_{T, H} (K, Ω_{K}^{- 1})

reach the efficiency bound in (8), but it will also be asymptotically equivalent to the optimal II estimator

{\hat{θ}}_{T, H} (K, W)

defined in (5). To see this, we note from Equation (6)

\begin{matrix} \underset{T \to \infty}{plim} Var \{\sqrt{T} [{\hat{θ}}_{T, H} (K, W) - θ^{0}]\} & = & {[Γ^{'} W Γ]}^{- 1} Γ^{'} W J {(K J)}^{- 1} K Ω K^{'} {(J^{'} K^{'})}^{- 1} J^{'} W Γ {[Γ^{'} W Γ]}^{- 1} \\ + & \frac{1}{H} {[Γ^{'} W Γ]}^{- 1} Γ^{'} W Ω W Γ {[Γ^{'} W Γ]}^{- 1} . \end{matrix}

The second term is obviously minimized by choosing

W = Ω^{- 1}

. Interestingly enough, this choice is also optimal for the minimization of the first term, at least when condition (11) is fulfilled. To see this note that

\begin{matrix} Γ^{'} W J {(K J)}^{- 1} K Ω K^{'} {(J^{'} K^{'})}^{- 1} W Γ & = & Γ^{'} Ω^{- 1} J {(K J)}^{- 1} K Ω K^{'} {(J^{'} K^{'})}^{- 1} J^{'} Ω^{- 1} Γ \\ = & Λ^{'} K J {(K J)}^{- 1} K Ω K^{'} {(J^{'} K^{'})}^{- 1} J^{'} K^{'} Λ \\ = & Λ^{'} K Ω K^{'} Λ \\ = & Γ^{'} Ω^{- 1} Ω Ω^{- 1} Γ \\ = & Γ^{'} Ω^{- 1} Γ . \end{matrix}

Hence, when K satisfies Equation (11), the asymptotic variance of

{\hat{θ}}_{T, H} (K, W)

is minimized at

W = Ω^{- 1}

, and this asymptotic variance achieves the efficiency bound in (8). Moreover, we note that when

d_{β} \geq d_{θ}

, it is always possible to construct a choice of K conformable to (11); a selection matrix satisfying Equation (11) is given by

K^{*'} = [\begin{matrix} Ω^{- 1} Γ & C \end{matrix}],

(12)

for C an arbitrary

k \times (d_{β} - d_{θ})

-dimensional matrix, with rank

(d_{β} - d_{θ})

and whose columns do not belong to the space spanned by the columns of

Ω^{- 1} Γ

. In this case, Equation (11) is fulfilled for

Λ = [\begin{matrix} I d_{d_{θ}} \\ 0 \end{matrix}],

where the zero lower block of

Λ

has dimension

(d_{β} - d_{θ}) \times d_{θ} .

In order to summarize, three comments are in order. First,

Σ_{H}^{*}

defined by (8) is obviously an asymptotic efficiency bound for estimation of

θ^{0}

, at least when the number H of simulation goes to infinity. The asymptotic variance

{lim}_{H \to \infty} Σ_{H}^{*}

corresponds to the asymptotic variance of the efficient GMM estimator based on the infeasible moment conditions for

θ

:

E_{θ} [g (y_{t}^{(h)} (θ), β^{0})] = 0,

which are infeasible since they depend on the unknown value

β^{0}

of the auxiliary parameters.

Second, importantly this efficiency bound is feasible, insofar as

d_{β} \geq d_{θ}

and, of course, up to consistent estimation of a selection matrix

K^{*}

satisfying (12) (the following two subsections describe the construction of such a consistent estimator).3 It must be stressed that the identities in (7) and (8) imply that, when

K = K^{*}

, there is no additional efficiency gain when working with the whole vector

g (y_{t}^{(h)} (θ), {\hat{β}}_{T} (K))

of moment functions in the moment matching estimator (5).

Third, and in contrary to the above point, when

d_{β} < d_{θ}

we cannot reach this efficiency bound in general and it would be more efficient to use the entire vector

g (y_{t}^{(h)} (θ), {\hat{β}}_{T} (K))

of moment functions in the moment matching estimator (5). Otherwise, any selection matrix K used in the computation

{\hat{θ}}_{T, H}^{(K)} (K, W)

will likely lead to some information loss, since it cannot be chosen in an optimal manner (according to minimal asymptotic variance).

3.2. Efficient Two-Step Moment Matching

When

d_{β} \geq d_{θ}

, an efficient two-step II estimation procedure for

θ^{0}

can proceed as follows. Let

{\tilde{β}}_{T} = {\hat{β}}_{T} (K)

be associated to some arbitrary (

d_{β} \times k

)-dimensional selection matrix K with rank

d_{β}

. As with standard two-step GMM, this will allow us to compute a consistent estimator

{\hat{Ω}}_{T}

of the matrix

Ω .

In turn, this allows us to compute a consistent estimator

{\tilde{θ}}_{T}

of

θ^{0}

as

{\tilde{θ}}_{T} = arg min_{θ \in Θ} {[\frac{1}{T} \sum_{t = 1}^{T} g (y_{t}^{(1)} (θ), {\tilde{β}}_{T})]}^{'} {\hat{Ω}}_{T}^{- 1} [\frac{1}{T} \sum_{t = 1}^{T} g (y_{t}^{(1)} (θ), {\tilde{β}}_{T})] .

(13)

If

{\tilde{β}}_{T}

were almost surely equal to the true unknown value

β^{0}

, the estimator of

θ^{0}

defined by (13) would actually reach the efficiency bound (8). To see this, recall that, as already mentioned, the efficiency bound actually coincides with the asymptotic variance of an efficient GMM estimator of

θ^{0}

based on the moment conditions

E_{θ} [g (y, β^{0})] = 0

. Unfortunately, these moment conditions are not feasible in general and, thus,

{\tilde{θ}}_{T}

incurs some efficiency loss because we have used a first-step estimator

{\tilde{β}}_{T} = {\hat{β}}_{T} (K)

, of

β^{0}

, that is sub-optimal (as far as estimation of

θ^{0}

is concerned). Therefore,

{\tilde{θ}}_{T}

is nothing but a possibly inefficient “first-step” consistent estimator of

θ^{0}

. Indeed, there is no compelling reason to consider the first-step estimator

{\tilde{θ}}_{T}

, defined by (13), over the more naive first-step estimator

{\tilde{θ}}_{T} = arg min_{θ \in Θ} {∥\frac{1}{T} \sum_{t = 1}^{T} g (y_{t}^{(1)} (θ), {\tilde{β}}_{T})∥}^{2},

since both estimators are inefficient.

So long as

{\tilde{θ}}_{T}

is a consistent estimator of

θ^{0}

(irrespective of the weighting matrix used for its computation), we can deduce a consistent estimator

{\hat{Γ}}_{T, H^{*}}

, of

Γ (θ^{0})

, using4

{\hat{Γ}}_{T, H^{*}} = \frac{\partial}{\partial θ^{'}} {\{\frac{1}{T H^{*}} \sum_{h = 1}^{H^{*}} \sum_{t = 1}^{T} g (y_{t}^{(h)} (θ), {\tilde{β}}_{T})\}}_{θ = {\tilde{θ}}_{T}} .

Using this consistent estimator of

Γ (θ^{0})

, define the selection matrix

{\hat{K}}_{T}

as

{\hat{K}}_{T}^{'} = [\begin{matrix} {\hat{Ω}}_{T}^{- 1} {\hat{Γ}}_{T, H^{*}} & C_{T} \end{matrix}]

where

C_{T}

is an arbitrary

k \times (d_{β} - d_{θ})

-dimensional matrix with rank

(d_{β} - d_{θ})

, and whose columns do not belong to the space spanned by the columns of

{\hat{Ω}}_{T}^{- 1} {\hat{Γ}}_{T, H^{*}}

.

For

C_{T} \to_{p} C

, where C is an arbitrary

k \times (d_{β} - d_{θ})

-dimensional matrix, with rank

(d_{β} - d_{θ})

, and whose columns do not belong to the space spanned by the columns of

Γ^{'} (θ^{0}) Ω^{- 1}

, we can conclude that

\underset{T \to \infty}{plim} [\begin{matrix} {\hat{Γ}}_{T, H^{*}}^{'} {\hat{Ω}}_{T}^{- 1} \\ C_{T}^{'} \end{matrix}] = [\begin{matrix} Γ^{'} (θ^{0}) Ω^{- 1} \\ C^{'} \end{matrix}] = K^{*}

Now, letting

{\hat{β}}_{T} = {\hat{β}}_{T} ({\hat{K}}_{T})

be defined as the solution of

{\hat{K}}_{T} {\bar{g}}_{T} ({\hat{β}}_{T}) = \frac{1}{T} \sum_{t = 1}^{T} {\hat{K}}_{T} g (y_{t}, {\hat{β}}_{T}) = 0,

it directly follows, from the above arguments, that the feasible two-step II estimator

{\hat{θ}}_{T, H} ({\hat{K}}_{T}, {\hat{Ω}}_{T}^{- 1}) = arg min_{θ \in Θ} {[\frac{1}{T H} \sum_{h = 1}^{H} \sum_{t = 1}^{T} g (y_{t}^{(h)} (θ), {\hat{β}}_{T} ({\hat{K}}_{T}))]}^{'} {\hat{Ω}}_{T}^{- 1} [\frac{1}{T H} \sum_{h = 1}^{H} \sum_{t = 1}^{T} g (y_{t}^{(h)} (θ), {\hat{β}}_{T} ({\hat{K}}_{T}))]

will reach the efficiency bound in Equation (8).

3.3. Efficient Two-Step Parameter Matching

We now examine a version of the above II estimator based on parameter matching, where we explicitly work under the assumption that the “auxiliary” parameters

β

“identify” the structural parameters

θ

. Note that this assumption was not maintained for the general moment matching considered above. Assumption 4 thus complements Assumption 3 in this respect.

Assumption 4.

The selection matrix K is a (

d_{β} \times k

)-dimensional matrix with rank

d_{β}

, which fulfills the following conditions.

(i): There exists a function $b_{K} (.)$ such that, for all $θ \in Θ$ ,

$E_{θ} [K g (y, b_{K} (θ))] = 0 .$
(ii): $b_{K} (θ^{0}) = β^{0} = {plim}_{T \to \infty} {\hat{β}}_{T} (K)$ and $b_{K} (θ) = {plim}_{T \to \infty} {\tilde{β}}_{T, H} (θ, K)$ , where ${\tilde{β}}_{T, H} (θ, K)$ is the solution of

$\frac{1}{T H} \sum_{t = 1}^{T} \sum_{h = 1}^{H} K g (y_{t}^{(h)} (θ), {\tilde{β}}_{T, H} (θ, K)) = 0 .$
(iii): $b_{K} (θ) = β^{0} \Leftrightarrow θ = θ^{0}$ .
(iv): The Jacobian matrix $\partial b_{K} (θ^{0}) / \partial θ^{'}$ has rank $d_{θ}$ .

In this context, we can define a parameter matching II estimator of

θ^{0}

as

{\hat{θ}}_{T, H}^{P} (K, W) = arg min_{θ \in Θ} {[{\hat{β}}_{T} (K) - {\tilde{β}}_{T, H} (θ, K)]}^{'} W_{T} [{\hat{β}}_{T} (K) - {\tilde{β}}_{T, H} (θ, K)],

(14)

where

W = {plim}_{T \to \infty} W_{T}

is a

(d_{β} \times d_{β})

-dimensional positive definite matrix. GMR show that (for given K), asymptotically efficient estimation of

θ^{0}

is delivered by an optimal choice

W (K)

of W that is the inverse of the asymptotic variance of

{\hat{β}}_{T} (K)

. Therefore, from the expansion in Equation (4), the optimal

W (K)

is given by

W (K) = {\{{(K J)}^{- 1} (K Ω K^{'}) {(J^{'} K^{'})}^{- 1}\}}^{- 1} .

Hence, by using the result of GMR, the asymptotic variance of the II estimator in Equation (14) is given by

\underset{T \to \infty}{plim} Var [\sqrt{T} [{\hat{θ}}_{T, H}^{P} [K, W (K)] - θ^{0}]] = (1 + \frac{1}{H}) {\{\frac{\partial b_{K}^{'} (θ^{0})}{\partial θ} {(K J)}^{'} {(K Ω K^{'})}^{- 1} (K J) \frac{\partial b_{K} (θ^{0})}{\partial θ^{'}}\}}^{- 1} .

(15)

Moreover, by differentiating the identity

K E_{θ} [g (y_{t}^{(h)} (θ), b_{K} (θ))] = 0,

we obtain

K Γ (θ^{0}) + K J \frac{\partial b_{K} (θ^{0})}{\partial θ^{'}} = 0

(16)

so that Equation (15) can be rewritten as

\underset{T \to \infty}{plim} Var [\sqrt{T} [{\hat{θ}}_{T, H}^{P} [K, W (K)] - θ^{0}]] = (1 + \frac{1}{H}) {\{Γ^{'} (θ^{0}) K^{'} {(K Ω K^{'})}^{- 1} K Γ (θ^{0})\}}^{- 1} .

Not surprisingly, we find that the optimal parameter matching II estimator

{\hat{θ}}_{T, H}^{P} [K, W (K)]

based on a selection matrix K is asymptotically equivalent to the optimal moment matching II estimator

{\hat{θ}}_{T, H} (K, Ω^{- 1})

based on the same selection matrix. As a consequence, an optimal choice

K^{*}

satisfying Equation (11) will allow us to reach the efficiency bound (8). Moreover, this optimal selection matrix is the same for the optimal II estimator based on moment matching and the optimal II estimator based on parameter matching, with the two estimators being asymptotically equivalent. Indeed, an efficient two-step parameter matching estimator can be devised in a similar fashion to the efficient two-step moment matching estimator described in the former subsection.

However, it is important to note the distinction between the required identification conditions underpinning the two estimators, i.e., the moment matching estimator,

{\hat{θ}}_{T, H} (K, W)

, and the parameter matching estimator,

{\hat{θ}}_{T, H}^{P} (K, W)

. In particular, the identification conditions underpinning the parameter matching approach is much more restrictive than the conditions required for the moment matching estimator. In particular, Assumption 3 simply assumes, for given

β^{0}

, identification of the moment conditions at

θ^{0}

, as well as the existence of a consistent and asymptotically normal estimator

{\hat{β}}_{T} (K)

. However, in addition to the existence of a consistent and asymptotically normal estimator

{\hat{β}}_{T} (K)

, Assumption 4 assumes that we consider only selection matrices K for which there exists both a continuously differentiable limit map,

b_{K} (θ)

, with full column-rank Jacobian, as well as a consistent estimator.

3.4. Interpretation of Results and Discussion

As alluded to in Section 3.1, the optimal asymptotic variance (8) demonstrates that we do not pay a price for ignoring the value of

β^{0}

in terms of optimal II estimation of

θ^{0}

, which means that

{lim}_{H \to \infty} Σ_{H}^{*}

corresponds to the asymptotic variance of an efficient GMM estimator for the infeasible moment conditions about

θ

,

E_{θ} [g (y_{t}^{(h)} (θ), β^{0})] = 0 .

The intuition behind this result is relatively simple and can be expressed as follows: when the binding function

b (.)

is known, which is precisely the case for an infinite number of simulations, we can estimate

β^{0}

through additional estimating equations

β - b (θ) = 0

that just identify

β

. This result echoes the following well-known result in GMM estimation theory (see, e.g., (Breusch et al. 1999)): when additional moment restrictions just identify the additional nuisance parameters that they introduce, they do not modify the accuracy of the efficient GMM estimator of the parameters of interest. As already announced in the introduction, what really matters for the efficiency of the II estimators is a choice of the auxiliary model well focused on the structural model. Hence the definition of the optimal selection matrix

K^{*}

.

To better understand this issue, imagine a favorable case where the overidentifying information pertains to the complete path of the binding function, meaning that, for all

θ \in Θ

,

E_{θ} [g (y, b (θ))] = 0 .

(17)

In other words, the binding function

θ \mapsto b (θ)

does not depend on a specific selection matrix K, and thus there is no conflict between efficient estimation of

β^{0}

and efficient estimation of

θ^{0}

. We can actually check this directly, since differentiating the identity (17) yields

Γ (θ^{0}) + J \frac{\partial b (θ^{0})}{\partial θ^{'}} = 0 .

(18)

Therefore,

Ω^{- 1} Γ (θ^{0}) = - Ω^{- 1} J \frac{\partial b (θ^{0})}{\partial θ^{'}} = - K^{'} \frac{\partial b (θ^{0})}{\partial θ^{'}}

(19)

when K has been chosen for efficient GMM estimation of

β^{0}

(i.e.,

K^{'} = Ω^{- 1} J

). In this case, the selection matrix K corresponding to efficient GMM estimation of

β^{0}

does fulfill the optimality condition (11) that produces an efficient II estimator of

θ^{0}

. The identity (19) indeed demonstrates that, for this choice of K, the columns of

Ω^{- 1} Γ (θ^{0})

are linear combinations of the columns of

K^{'}

. However, it may be argued that an identity like (17) should be the exception rather than the rule. There is a striking difference between the identity (16) that we obtained in a general setting (but for a given selection matrix K) and the much stronger condition (18) that does not depend on K and allows us to get altogether efficient II estimation of

θ^{0}

and efficient GMM estimation of

β^{0}

. In contrast, it is worth stressing that any choice of K based on a GMM estimator of

β^{0}

is generically inconsistent with efficient II estimation of

θ^{0}

since, in general, for any weighting matrix W and any matrix

Λ

K = J^{'} W ⟹ K^{'} Λ = W J Λ \neq Ω^{- 1} Γ .

It is worth realizing that situations of tension between efficient estimation of

β^{0}

on the one hand and efficient estimation of

θ^{0}

on the other hand (typically when (17) is violated) go beyond the simple framework considered in this paper. For instance, Sargan (1983) and Dovonon and Renault (2013) have stressed that for non-linear GMM,

β

may be globally identified by (1) while first-order identification may fail at some particular value

β^{0}

because the matrix

Γ

is not full column rank. It turns out that in many circumstances (see, e.g., (Dovonon and Renault 2013)) the particular value at which rank deficiency occurs is precisely the case of interest. Dovonon and Hall (2018) have documented the implication of such a lack of first-order identification for II when using the naive selection matrix

K = J^{'} Ω^{- 1}

.

Recall that the messages of this paper are two-fold: one, the naive selection matrix

K = J^{'} Ω^{- 1}

may not be an efficient choice; two, more importantly, the efficient choice is based on a matrix

Γ

, the rank of which has no reason to be deficient when there is a rank deficiency in the matrix J. Therefore, it may well be the case that standard asymptotic theory for II is still valid, in contrast with the case of Dovonon and Hall (2018), when II is performed efficiently.

A similar argument applies in the case of weak identification (see, e.g., (Stock and Wright 2000) and (Kleibergen 2005)), that is, when the matrix

Γ

is only asymptotically rank deficient. A general theory of II in the case of first-order under-identification or weak identification of the auxiliary parameters is left for future research.

4. Conclusions

The overall message of this paper can be summarized as follows: application of the II methodology may require, for the sake of finite-sample performance, the imposition of certain constraints on the auxiliary model, leading to auxiliary parameters that are defined by an overidentified system of moment conditions. Typically, when II is based on a score generator, a la GT, possibly due to some interpretation of the auxiliary parameters, one would imagine that there exist more restrictions than are needed to identify the auxiliary parameters. In this context, we demonstrate that efficient indirect estimators of the structural parameters should take advantage of the overidentifying restrictions, but for the purpose of optimizing the accuracy of the II estimator of the structural parameters

θ

, not for the GMM estimator of the auxiliary parameters

β

, which is generally not equivalent.

The general characterization of a two-step efficient II estimator proposed in this paper may be used in various contexts. As a first example, we have in mind procedures that select valid and relevant moments to match, as recently devised by Cheng and Liao (2015), through an “information-based LASSO”. If we contemplate applying this procedure to the choice of the auxiliary moment model, the information criterion of Cheng and Liao (2015) would be based on the asymptotic variance

{[J^{'} Ω^{- 1} J]}^{- 1}

of the efficient GMM estimator of

β

. In contrast, for the purpose of efficient II estimation of

θ^{0}

, it would be more relevant to use as an information criterion the efficient asymptotic variance

{[Γ^{'} Ω^{- 1} Γ]}^{- 1}

of an II estimator of

θ

(up to a correction factor for the number of simulations). Theoretically speaking, nothing prevents us to revisit the theory of Cheng and Liao (2015) in this new context of indirect moment-based estimation.

A second example may be inspired by the recent work of Hansen (2016) on “Stein Combination Shrinkage” for Vector Auto-Regressions (VAR). Since the work of Smith (1993), VAR models have been a popular class of auxiliary models for II estimation of structural models stemming from macroeconomic theory, however, there exists little guidance in the extant literature about the trade-off between efficiency and parsimony in the specification of these VAR models for II estimation. Moreover, the poor finite-sample performance of estimators for VAR models of dimension larger than two, and with more than one lag, has been widely documented, but the consequences for employing such a class of auxiliary models within II estimation have not been meaningfully studied. Hansen (2016) uses the tool of model averaging for the aggregation of estimators of VAR parameters provided by different possible shrinkage strategies. In our framework, this can be understood as averaging over estimators of auxiliary parameters

β

provided by different selection matrices K. Then, in our context, the procedure of model averaging should be elicited with regards to efficient II estimation of the structural parameters

θ

, which differs from the issue of efficient estimation of the VAR parameters

β

.

More generally this paper contributes to the search for efficiency in the context of II estimation. We emphasize the fact that the moments to match, or equivalently, the score generator provided by the auxiliary model, should not be treated as a statistical object whose inference must be efficient within the logic of the auxiliary world. Instead, auxiliary models should only be used as lenses focused on minimizing the asymptotic variance of the indirect estimator of

θ

obtained by calibrating the estimating equations on

β

, without overlooking some finite sample issues related to the parsimony in the choice of these equations. Future research includes extensions to the cases of weak identification, first-order under-identification, and model misspecification.

Author Contributions

The creation and publication of this manuscript has been a collaborative effort, with both authors contributing to every facet of the manuscript. This includes conceptualization and investigation of the main ideas in the manuscript, methodology proposals, and formal analysis, as well as all aspects of the writing process.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Appendix A.1. Proof of Equation (6)

The first-order conditions that define

{\hat{θ}}_{T, H} (K, W)

can be written

Γ^{'} (θ^{0}) W \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{\frac{1}{H} \sum_{h = 1}^{H} g (y_{t}^{(h)} ({\hat{θ}}_{T, H} (W)), {\hat{β}}_{T} (K))\} = o_{P} (1),

which, after a first order Taylor expansion gives

\begin{matrix} o_{P} (1) & = & Γ^{'} (θ^{0}) W \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{\frac{1}{H} \sum_{h = 1}^{H} g (y_{t}^{(h)} (θ^{0}), β^{0})\} + Γ^{'} (θ^{0}) W Γ (θ^{0}) \sqrt{T} [{\hat{θ}}_{T, H} (W) - θ^{0}] \\ + Γ^{'} (θ^{0}) W J \sqrt{T} [{\hat{β}}_{T} (K) - β^{0}] . \end{matrix}

Plugging in the linear expansion (4) of the estimator

{\hat{β}}_{T} (K)

, i.e.,

\sqrt{T} [{{\hat{β}}_{T} (K) - β}^{0}] = - {(K J)}^{- 1} K \sqrt{T} {\bar{g}}_{T} (β^{0}) + o_{P} (1) .

we obtain

\begin{matrix} \sqrt{T} [{\hat{θ}}_{T, H} (K, W) - θ^{0}] & = & {[Γ^{'} (θ^{0}) W Γ (θ^{0})]}^{- 1} Γ^{'} (θ^{0}) W \\ \{J {(K J)}^{- 1} K \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{g (y_{t}, β^{0})\} - \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{\frac{1}{H} \sum_{h = 1}^{H} g (y_{t}^{(h)} (θ^{0}), β^{0})\}\} + o_{P} (1) \end{matrix}

(A1)

In other words,

\sqrt{T} [{\hat{θ}}_{T, H} (K, W) - θ^{0}]

is asymptotically a linear function of the asymptotically Gaussian vector

{[\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{g (y_{t}, β^{0})\}, \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} g (y_{t}^{(h)} (θ^{0}), β^{0})]}_{1 \leq h \leq H}

Appendix A.2. Proof of Equation (15)

If

{\hat{β}}_{T} (K)

has been computed from the sample counterpart of estimating equations

K {\bar{g}}_{T} ({\hat{β}}_{T} (K)) = 0

for some matrix K of dimension

d_{β} \times k

and of rank

d_{β}

, we have, by assumption,

\sqrt{T} [{\hat{β}}_{T} (K) - β^{0}] = - {(K J)}^{- 1} K \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{g (y_{t}, β^{0})\} + o_{P} (1)

and

\sqrt{T} [{\tilde{β}}_{T, H} (θ^{0}, K) - β^{0}] = - {(K J)}^{- 1} K \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \frac{1}{H} \sum_{h = 1}^{H} g (y_{t}^{(h)} (θ^{0}), β^{0}) + o_{P} (1) .

Thus, taking the difference

\begin{matrix} \sqrt{T} [{\hat{β}}_{T} (K) - {\tilde{β}}_{T, H} (θ^{0}, K)] \\ = & - {(K J)}^{- 1} K \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \{g (y_{t}, β^{0}) - \frac{1}{H} \sum_{h = 1}^{H} g (y_{t}^{(h)} (θ^{0}), β^{0})\} + o_{P} (1) \end{matrix}

Therefore,

\sqrt{T} [{\hat{β}}_{T} (K) - {\tilde{β}}_{T, H} (θ^{0}, K)]

is asymptotically normal with asymptotic variance

(1 + \frac{1}{H}) {(K J)}^{- 1} K Ω K^{'} {[{(K J)}^{'}]}^{- 1}

Hence, by using the result of GMR, the asymptotic variance of the indirect inference estimator is then

\{\underset{T \to \infty}{plim} Var [\sqrt{T} [{\hat{θ}}_{T, H}^{P} [K, W (K)] - θ^{0}]]\} = (1 + \frac{1}{H}) {\{\frac{\partial b_{K}^{'} (θ^{0})}{\partial θ} {(K J)}^{'} {(K Ω K^{'})}^{- 1} (K J) \frac{\partial b_{K} (θ^{0})}{\partial θ^{'}}\}}^{- 1}

(A2)

Moreover, by differentiating the identity

K E [g (y_{t}^{(h)} (θ), b_{K} (θ))] = 0

we obtain

K Γ (θ^{0}) + K J \frac{\partial b_{K} (θ^{0})}{\partial θ^{'}} = 0

so that (A2) can be rewritten:

\{\underset{T \to \infty}{plim} Var [\sqrt{T} [{\hat{θ}}_{T, H}^{P} P [K, W (K)] - θ^{0}]]\} = (1 + \frac{1}{H}) {\{Γ^{'} (θ^{0}) K^{'} {(K Ω K^{'})}^{- 1} K Γ (θ^{0})\}}^{- 1} .

References

Andersen, Torben G., and Lund Jesper. 1997. Estimating continuous-time stochastic volatility models of the short-term interest rate. Journal of Econometrics 77: 343–77. [Google Scholar] [CrossRef]
Bates, Charles E., and White Halbert. 1993. Determination of estimators with minimum asymptotic covariance matrices. Econometric Theory 9: 633–48. [Google Scholar] [CrossRef]
Breusch, Trevor, Hailong Qian, Peter Schmidt, and Wyhowski Donald. 1999. Redundancy of moment conditions. Journal of Econometrics 91: 89–111. [Google Scholar] [CrossRef]
Cheng, Xu, and Zhipeng Liao. 2015. Select the valid and relevant moments: An information-based lasso for GMM with many moments. Journal of Econometrics 186: 443–64. [Google Scholar] [CrossRef]
Dovonon, Prosper, and Alastair R. Hall. 2018. The asymptotic properties of GMM and indirect inference under second-order identification. Journal of Econometrics 205: 76–111. [Google Scholar] [CrossRef] [Green Version]
Dovonon, Prosper, and Renault Eric. 2013. Testing for common conditionally heteroskedastic factors. Econometrica 81: 2561–86. [Google Scholar]
Eastwood, Brian J. 1991. Asymptotic normality and consistency of semi-nonparametric regression estimators using an upwards F test truncation rule. Journal of Econometrics 48: 151–81. [Google Scholar] [CrossRef]
Gallant, A. Ronald, and Douglas W. Nychka. 1987. Semi-nonparametric maximum likelihood estimation. Econometrica: Journal of the Econometric Society 55: 363–90. [Google Scholar] [CrossRef]
Gallant, A. Ronald, and George Tauchen. 1996. Which moments to match? Econometric Theory 12: 657–81. [Google Scholar] [CrossRef]
Gallant, A. Ronald, and Jonathan R. Long. 1997. Estimating stochastic differential equations efficiently by minimum chi-squared. Biometrika 84: 125–41. [Google Scholar] [CrossRef]
Gallant, A. Ronald, Hsieh David, and George Tauchen. 1997. Estimation of stochastic volatility models with diagnostics. Journal of Econometrics 81: 159–92. [Google Scholar] [CrossRef]
Gourieroux, Christian, Monfort Alain, and Renault Eric. 1993. Indirect inference. Journal of Applied Econometrics 8: S85–S118. [Google Scholar] [CrossRef]
Hansen, Bruce E. 2016. Stein Combination Shrinkage for Vector Autoregressions. Working Paper A39, Sir Clive Granger Building, University Park, PA, USA. [Google Scholar]
Kleibergen, Frank. 2005. Testing parameters in GMM without assuming that they are identified. Econometrica 73: 1103–23. [Google Scholar] [CrossRef]
Sargan, J. D. 1983. Identification and lack of identification. Econometrica: Journal of the Econometric Society 51: 1605–33. [Google Scholar] [CrossRef]
Smith, A. A., Jr. 1993. Estimating nonlinear time-series models using simulated vector autoregressions. Journal of Applied Econometrics 8: S63–S84. [Google Scholar] [CrossRef] [Green Version]
Stock, James H., and Jonathan H. Wright. 2000. GMM with weak identification. Econometrica 68: 1055–96. [Google Scholar] [CrossRef]
Tauchen, George. 1997. New minimum chi-square methods in empirical finance. Econometric Society Monographs 28: 279–317. [Google Scholar] [CrossRef]

1	Similar to the selection matrix K, the weighting matrix W can be replaced by a consistent, data-dependent estimator, say $W_{T}$ such that $W_{T} \to_{p} W$ as $T \to \infty$ , without altering the first-order asymptotic theory of the II estimator.
2	See the Appendix A for more detailed derivations.
3	A discussion and interpretation of this surprising feasibility (in spite of the fact that $β^{0}$ is unknown) is given in Section 3.4.
4	Note that one may want to compute this derivative numerically and to choose $H^{*}$ very large (and possibly different from H in the rest of the procedure) to take advantage of the smoothness properties of the population moments.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Frazier, D.T.; Renault, E. Indirect Inference: Which Moments to Match? Econometrics 2019, 7, 14. https://doi.org/10.3390/econometrics7010014

AMA Style

Frazier DT, Renault E. Indirect Inference: Which Moments to Match? Econometrics. 2019; 7(1):14. https://doi.org/10.3390/econometrics7010014

Chicago/Turabian Style

Frazier, David T., and Eric Renault. 2019. "Indirect Inference: Which Moments to Match?" Econometrics 7, no. 1: 14. https://doi.org/10.3390/econometrics7010014

APA Style

Frazier, D. T., & Renault, E. (2019). Indirect Inference: Which Moments to Match? Econometrics, 7(1), 14. https://doi.org/10.3390/econometrics7010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Indirect Inference: Which Moments to Match?

Abstract

1. Introduction

2. Auxiliary versus Structural Models

2.1. Auxiliary Model

2.2. The Structural Model

2.3. Moment Selection Criterion

3. Which Moments to Match?

3.1. Optimal Selection Matrix K

3.2. Efficient Two-Step Moment Matching

3.3. Efficient Two-Step Parameter Matching

3.4. Interpretation of Results and Discussion

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proofs

Appendix A.1. Proof of Equation (6)

Appendix A.2. Proof of Equation (15)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI