Next Article in Journal
Influence of Digitalisation on Business Success in Austrian Traded Prime Market Companies—A Longitudinal Study
Previous Article in Journal
Is Monetary Policy a Driver of Cryptocurrencies? Evidence from a Structural Break GARCH-MIDAS Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimating Linear Dynamic Panels with Recentered Moments

Department of Economics, Purdue University, West Lafayette, IN 47907, USA
Econometrics 2024, 12(1), 3; https://doi.org/10.3390/econometrics12010003
Submission received: 20 November 2023 / Revised: 8 January 2024 / Accepted: 15 January 2024 / Published: 17 January 2024

Abstract

:
This paper proposes estimating linear dynamic panels by explicitly exploiting the endogeneity of lagged dependent variables and expressing the crossmoments between the endogenous lagged dependent variables and disturbances in terms of model parameters. These moments, when recentered, form the basis for model estimation. The resulting estimator’s asymptotic properties are derived under different asymptotic regimes (large number of cross-sectional units or long time spans), stable conditions (with or without a unit root), and error characteristics (homoskedasticity or heteroskedasticity of different forms). Monte Carlo experiments show that it has very good finite-sample performance.
JEL Classification:
C23; C13; C26

1. Introduction

It is well known that the standard fixed-effects or within-group (WG) estimator for linear dynamic panel (DP) models suffers from the issue of bias (Nickell 1981) when the time series dimension T is small. Alternative estimators include those based on instrumental variables (IVs) (Anderson and Hsiao 1981), the generalized method of moments (GMM) (e.g., Arellano and Bond 1991; Holtz-Eakin et al. 1988), the maximum likelihood (ML) approach (e.g., Alvarez and Arellano 2003, 2022; Blundell and Smith 1991; Hsiao et al. 2002; Lancaster 2002), those that correct the score function in the ML framework (e.g., Alvarez and Arellano 2022; Breitung et al. 2022; Dhaene and Jochmans 2016), and those that directly correct the bias of the WG estimator (e.g., Bao and Yu 2023; Bun and Carree 2005; Dhaene and Jochmans 2015; Everaert and Pozzi 2007; Gouriéroux et al. 2010; Hahn and Kuersteiner 2002; Kiviet 1995).1
This paper proposes estimating linear dynamic panels by recentering the cross moments between the lagged dependent variables, which are endogenous, and the error term in the model by their non-zero expectations.2 The resulting estimator is named the recentered method of moments (RMM) estimator accordingly. These recentered moments are functions of model parameters and, together with moment conditions from other exogenous regressors, if any, form the basis for model estimation. Essentially, it is based on the idea that the best “instrument” for any of the endogenous lagged dependent variables is itself. As such, one does not need to search for IVs and can avoid issues of weak instruments and many instruments that exist in the GMM framework. It is closely related to the bias-correction literature, but there is no correction procedure involved. In particular, Appendix B illustrates that the estimator proposed in this paper is numerically equivalent to the indirect inference (II) estimator, as introduced by Bao and Yu (2023). However, they are motivated very differently. The strategy of Bao and Yu (2023) starts with a biased estimator, but this paper does not have a biased estimator to begin with and designs recentered moment conditions directly by expressing the cross moments between the endogenous lagged dependent variables and disturbances in terms of model parameters.3 Furthermore, there are three major contributions in this paper that are absent elsewhere.
First, it allows for more general assumptions regarding the data-generating process (DGP). Specifically, the most general form of error heteroskedasticity, arising from cross-sectional units, time, or both, is considered. Breitung et al. (2022) focus primarily on the situation of cross-sectional heteroskedasticity.4 Alvarez and Arellano (2022) consider only time-series heteroskedasticity. Bao and Yu (2023) discuss a robust version of their II estimator that is valid under both forms of heteroskedasticity, though falling short of deriving its asymptotic distribution. They focus instead on the case when there is only time-series heteroskedasticity. Bun and Carree (2006) derive the asymptotic bias of the WG estimator when both forms of heteroskedasticity are present and propose bias correcting the WG estimator when T is fixed for the first-order DP (DP(1)) model. Juodis (2013) points out that the bias-correction procedure of Bun and Carree (2006) is, in fact, inconsistent and designs a consistent one for a higher-order DP.5 Note that both Bun and Carree (2006) and Juodis (2013) estimate the temporal heteroskedasticity parameters so that they can be plugged into the bias expression of the WG estimator for the purpose of bias correction. However, they do not provide insights on how to conduct asymptotic inference on the resulting bias-corrected estimator. In contrast, Alvarez and Arellano (2022) jointly estimate the temporal heteroskedasticity parameters and model parameters and also derive their joint asymptotic distribution.
Second, this paper explicitly includes the case when there is a unit root. If the time span is short, this may not matter, as the inference procedure is under the large-N asymptotic regime, where N is the number of cross-sectional units. Bun and Carree (2006), Juodis (2013), Alvarez and Arellano (2022), and Bao and Yu (2023) all consider short panels. But for long panels, the issue of unit root cannot be simply ignored. Dhaene and Jochmans (2015) do not consider the unit-root case, and their jackknife method is developed under the rectangular-array asymptotic regime (namely, both N and T are large). Breitung et al. (2022) also present some results under the rectangular-array regime, but do not explicitly consider the unit-root case.
Third, asymptotic distribution results are derived that, in general, do not require both N and T to be large. The asymptotic distribution of the proposed RMM estimator under large T resembles the familiar ordinary least squares (OLS) result in traditional regression analysis, and its asymptotic variance achieves the efficiency bound under homoskedasticity. Similar to time series literature, the convergence rate of the estimator of the autoregressive parameters is different when there is a unit root under large T, but the standard t-test procedure carries through when one is conducting hypothesis testing. Under homoskedasticity, Han and Phillips (2010) report a unified asymptotic distribution result for their first difference least squares (FDLS) estimator for DP(1) when the unit-root case is allowed. In their setup, the fixed effects disappear under the unit-root case. Hayakawa (2009) derives the asymptotic properties of the IV estimator for higher-order DP models, with neither heteroskedasticity nor exogenous regressors present, when the panel is dynamically stable under large N and large T, though he remarks that it should also hold under other asymptotic regimes.
The plan of this paper is as follows. The next section introduces the model and notation. Section 3 presents the RMM estimator under the baseline set-up of homoskedasticity, though it is further shown that the estimator is robust to cross-sectional heteroskedasticity as well. An important message here is that when T is large, there is no asymptotic bias, standing in contrast to the consistent WG estimator that may possess an asymptotic bias. Section 4 introduces a robust estimator under cross-sectional and temporal heteroskedasticity. Section 5 illustrates the good finite-sample performance of the proposed estimator by Monte Carlo experiments. The last section concludes and discusses possible future research. Technical details, including lemmas that are used for the proofs of the main results in this paper, together with some extended discussions and additional simulation results, are provided in the appendices. Throughout, matrix/vector dimension subscripts are typically omitted unless confusion may arise. A subscript 0 is used to signify the true value of a parameter that is to be estimated.

2. Model, Notation, and Assumptions

The p-th order linear dynamic panel model, DP(p) for short, is
y i t = = 1 p ϕ y i , t + x i t β + α i + u i t = w i t θ + α i + u i t ,
i = 1 , , N , t = 1 , , T , where the dependent variable y i t is related to its lagged values, up to order p, fixed effects α i , the k × 1 vector of exogenous variables x i t = ( x i t , 1 , , x i t , k ) , and an idiosyncratic disturbance term u i t . The vector w i t = ( y i , t 1 , , y i , t p , x i t ) collects all the right-hand side variables and θ = ( ϕ , β ) , where ϕ = ( ϕ 1 , , ϕ p ) . For each i, let y i = ( y i 1 , , y i T ) , y i , ( ) = ( y i , 1 , , y i , T ) , Y i = ( y i , ( 1 ) , , y i , ( p ) ) , X i = ( x i 1 , , x i T ) , W i = ( Y i , X i ) = ( w i 1 , , w i T ) , and u i = ( u i 1 , , u i T ) . Stacking over i, one has α = ( α 1 , , α N ) , y = ( y 1 , , y N ) , y ( ) = ( y 1 , ( ) , , y N , ( ) ) , Y = ( Y 1 , , Y N ) = ( y ( 1 ) , , y ( p ) ) , X = ( X 1 , , X N ) , W = ( Y , X ) = ( W 1 , , W N ) , and u = ( u 1 , , u N ) . Let ⊗ denote the matrix Kronecker product operator and 1 = 1 T be a T × 1 column vector of ones. Then, in matrix notation, (1) can be written as
y = W θ + α 1 + u .
Suppose there exists a matrix A such that when pre-multiplying it to (2), the fixed effects are wiped out, namely,
A y = A W θ + A u .
The N T × N T matrix I N M T comes as a natural choice for A , where M T = I T T 1 1 1 M , and I N ( I T ) denotes the identity matrix of size N (T). In this case, applying the OLS procedure to (3) yields the WG estimator. For a given A , the regression model (3) contains endogenous lagged dependent variables y ( 1 ) , ⋯, y ( p ) as regressors and E ( W A A u ) 0 . The IV/GMM literature focuses on using various instruments for y ( ) (or its various differences). The best “instrument” for y ( ) is of course itself, though it violates the definition of instrument. Nevertheless, if one can explicitly analyze how y ( ) is correlated with u , subject to the transformation induced by A , then this piece of information can be used to estimate model parameters. This is essentially the idea of this paper. In the sequel, A = I N M is used (and thus, A A = A ), and the following assumptions are made. Different assumptions about the idiosyncratic error term are deferred into the next two sections when different forms of heteroskedasticity are discussed.
Assumption 1.
The series of fixed effects α i , i = 1 , , N , is i.i.d. across individuals with finite moments up to the fourth order.
Assumption 2.
The error terms u i t and fixed effects α i are independent of each other for any i = 1 , , N , t = 1 , , T .
Assumption 3.
The regressors X , if present, are either fixed or random and ( N T ) 1 X A X converges (in probability) to a nonsingular matrix as N T . When they are fixed, x i t = O ( 1 ) . When they are random: (i) they are strictly exogenous with respect to error terms; (ii) x i t = O p ( 1 ) with finite moments up to the fourth order; (ii) E ( α i r 1 x i t , s r 2 ) = O ( 1 ) and Cov ( α i r 1 x i t , s i r 2 , α j r 1 x j t , s j r 2 ) = 0 , r 1 + r 2 4 , r 1 0 , r 2 0 , i j , i , j = 1 , , N , s , s i , s j = 1 , , k .
Assumption 4.
The initial values y i , s , i = 1 , , N , s = 0 , 1 , , p , are either fixed or random. When they are fixed, y i , s = O ( 1 ) . When they are random: (i) y i , s = O p ( 1 ) with finite moments up to the fourth order; (ii) E ( α i r 1 y i , s r 2 ) = O ( 1 ) and Cov ( α i r 1 y i , s r 2 , α j r 1 y j , s r 2 ) = 0 , r 1 + r 2 4 , r 1 0 , r 2 0 , i j , i , j = 1 , , N , s = 0 , , p ; (iii) Cov ( y i , s , u i t ) = 0 , s = 0 , , p , t = 1 , , T , i = 1 , , N .
Without loss of generality, in what follows, the analysis in this paper is conditioning on X for ease of presentation. No distribution assumption is made, so long as some moment conditions hold. Assumption 3 does not rule out the possible correlation between X and the fixed effects, but rules out correlation among the products of them. For simplicity, the strictly exogenous regressors contained in X are bounded (in probability). For the estimation strategy to be introduced in the next section, the inclusion of explosive exogenous regressors can affect the convergence rate of the associated parameter estimators, but does not affect the convergence rate of those associated with the lagged dependent variables. Assumption 4 does not spell out how the initial values are generated, except that there are restrictions on how they may be correlated with the fixed effects and idiosyncratic errors. The panel under consideration is not restricted to be dynamically stable. The fixed effects in Assumption 1 are treated as randomly generated from some distribution. This assumption itself is not needed for the purpose of estimation, since the recentered moment conditions to be presented wipe out the fixed effects. However, it is used, together with Assumptions 2–4 and Assumption 5 or 6, to ensure that the (scaled) moments and the associated gradient have properly defined probability limits (see, for example, Lemma A5 in the Appendix C), which in turn are used to establish the asymptotic distribution of the RMM estimator.6 When there is a unit root, the asymptotic behavior of the RMM estimator to be introduced depends on the fixed effects, so for the sake of tractability, they are assumed to be deterministic by following Hahn and Kuersteiner (2002).7
In the next two sections, the asymptotic properties are derived of the RMM estimator under homoskedasticity and heteroskedasticity, without or with a unit root. Under large N and finite T, the asymptotic distribution is of the same form, whereas under large T, subject to the appropriate scaling factors, the asymptotic distribution resembles the OLS result in traditional regression analysis.
Note that, as established in Bao and Yu (2023), the vectors of lagged observations from model (1) (at the true parameter vector θ 0 ) for each cross-sectional unit can be written as
y i , ( ) = Φ p 1 [ L ( u i + α i 1 + X i β 0 ) + s = 0 1 Φ s L 1 s e 1 y i , s + s = p 1 Φ ( s p ) L 1 s e 1 y i , s ] ,
= 1 , , p , where L is a T × T strict lower triangular matrix with 1’s on the first sub-diagonals and 0’s elsewhere, e 1 = ( 1 , 0 , , 0 ) is a T × 1 vector, Φ p = Φ p ( ϕ 0 ) , Φ p ( ϕ ) = I ϕ 1 L ϕ p L p , and Φ ( s p ) = Φ s ( ϕ 0 ) Φ p ( ϕ 0 ) .8 Then, stacking over index i, one has, for = 1 , , p ,
y ( ) = ( I N Φ p 1 L ) u + ( I N Φ p 1 L ) X β 0 + ( I N Φ p 1 L 1 ) α + s = 0 1 ( I N Φ p 1 Φ s L 1 s e 1 ) y ( s ) + s = p 1 ( I N Φ p 1 Φ ( s p ) L 1 s e 1 ) y ( s ) ,
where y ( s ) = ( y 1 , s , , y N , s ) is an N × 1 vector collecting initial cross-sectional observations at time s = 0 , , 1 p . This representation of y ( ) is in terms of u , X , α , and initial conditions, so W Au essentially boils down to linear and quadratic forms in the random vector u . This facilitates the derivation of the expectation of W Au , which forms the basis of the moment conditions in this paper for estimating model parameters. When u i , t is independent and identically distributed (i.i.d.) across i and t, moments of quadratic forms in u can be found in Bao and Ullah (2010), and in the presence of heteroskedasticity, results are provided in Appendix A.7 of Ullah (2004).

3. The Baseline Set-Up

This section presents the estimator under the framework when the idiosyncratic errors are homoskedastic across i and t.
Assumption 5.
u i t , i = 1 , , N , t = 1 , , T , is i.i.d. across i and t, E ( u i t ) = 0 , Var ( u i t ) = σ 2 , and has finite moments up to the fourth order.

3.1. Estimation

Given Assumptions 1–5 and using the representation (5), one has, at θ 0 ,
E ( W Au ) = N σ 2 tr ( M Φ p 1 L 1 ) N σ 2 tr ( M Φ p 1 L p ) 0 k = N σ 2 T 1 Φ p 1 L 1 1 Φ p 1 L p 1 0 k ,
where tr ( · ) is the matrix trace operator. The first equality in (6) follows when one substitutes (5) into W = ( y ( 1 ) , , y ( p ) , X ) and takes expectation of quadratic forms in u and the second equality follows from tr ( M Φ p 1 L ) = tr ( Φ p 1 L ) T 1 1 Φ p 1 L 1 , where, since L Φ p 1 is strictly lower triangular, tr ( Φ p 1 L ) = 0 .
This set of moment conditions additionally involves the variance parameter σ 2 . Since E ( u A u ) = N ( T 1 ) σ 2 , one can define the following
g N T ( θ ) = 1 N T W A ( y W θ ) + 1 N T ( y W θ ) A ( y W θ ) h ( θ ) ,
where h ( θ ) = [ T ( T 1 ) ] 1 ( 1 Φ p 1 ( ϕ ) L 1 , , 1 Φ p 1 ( ϕ ) L p 1 , 0 k ) , such that E ( g N T ( θ 0 ) ) = 0 . Thus, an estimator can be defined as θ ^ = arg θ { g N T ( θ ) = 0 } . Appendix B discusses several closely related estimators that are motivated very differently. The traditional method of moments (MM) or GMM is based on moment conditions that have expectations (or the probability limits) equal to zero exactly, where these moments are usually defined as cross moments of exogenous regressors or instruments and disturbances in the model under consideration. Here, the cross moments pertaining to the endogenous lagged dependent variables have non-zero expectations and the moment conditions g N T ( θ ) = 0 for estimation are designed by recentering the cross moments by their non-zero expectation parts. For this reason, θ ^ is named the RMM estimator in this paper.
Note that one can equivalently write the set of moment conditions (7) as
g N T ( θ ) = 1 N T i = 1 N t = 1 T ( w i t w ¯ i ) ϵ i t ( θ ) + ( ϵ i t ( θ ) ϵ ¯ i ( θ ) ) ϵ i t ( θ ) h ( θ ) 1 N T i = 1 N t = 1 T ( z i t ( θ ) z ¯ i ( θ ) ) ϵ i t ( θ ) ,
where ϵ i t ( θ ) = y i t w i t θ , z i t ( θ ) = w i t + h ( θ ) ϵ i t ( θ ) , w ¯ i = T 1 t = 1 T w i t , and ϵ ¯ i ( θ ) and z ¯ i ( θ ) are defined similarly. It turns out that T 1 t = 1 T ( z i t ( θ ) z ¯ i ( θ ) ) ϵ i t ( θ ) is the same as m T i ( θ ) in Breitung et al. (2022).9 They show that m T i ( θ ) has the property of E ( m T i ( θ 0 ) ) = 0 under cross-sectional heteroskedasticity, namely, E ( u i t 2 ) = σ i 2 . Essentially, this is because, in this case, σ ^ T i 2 ( θ 0 ) = ( T 1 ) 1 t = 1 T ( ϵ i t ( θ 0 ) ϵ ¯ i ( θ 0 ) ) ϵ i t ( θ 0 ) is an unbiased estimator of σ i 2 and thus ( T 1 ) 1 i = 1 N t = 1 T ( ϵ i t ( θ 0 ) ϵ ¯ i ( θ 0 ) ) ϵ i t ( θ 0 ) = ( T 1 ) 1 u A u is in fact an unbiased estimator of i = 1 N σ i 2 . In other words, if Assumption 5 is modified such that E ( u i t 2 ) = σ i 2 , then it is still the case that E ( g N T ( θ 0 ) ) = 0 and the estimation procedure does not change. This is invalid, however, when there is time-series heteroskedasticity ( E ( u i t 2 ) = σ t 2 ) or both forms of heteroskedasticity ( E ( u i t 2 ) = σ i t 2 ). Remark 2 of Breitung et al. (2022) suggests that m T i ( θ ) may still be considered a valid set of moment conditions when both forms of heteroskedasticity are present, in the sense that the limit of m T i ( θ ) (as T ) goes to zero, provided that: (i) there is no unit root and (ii) T is large. In the next section, a robust estimator is introduced such that it is based on a set of moment conditions having exact expectation zero under the most general form of heteroskedasticity without imposing the two conditions. To fix ideas, various results in this section are presented by assuming homoskedasticity.
Obviously,
G N T ( θ ) = g N T ( θ ) θ = 1 N T [ ( y W θ ) A ( y W θ ) H ( θ ) W A W 2 h ( θ ) ( y W θ ) A W ] ,
where
H ( θ ) = h ( θ ) θ = 1 T ( T 1 ) R p ( ϕ ) O p × k O k × p O k ,
and R p ( ϕ ) is a p × p matrix with 1 Φ p 1 ( ϕ ) L s Φ p 1 ( ϕ ) L 1 in its ( , s ) -th position, , s = 1 , , p . In light of (5), one notices that both g N T = g N T ( θ 0 ) = ( N T ) 1 [ W A u + u A u h ] and G N T = G N T ( θ 0 ) = ( N T ) 1 [ u A u H W A W 2 h u A W ] are in terms of linear and quadratic forms in u , whereas h = h ( ϕ 0 ) and H = H ( ϕ 0 ) are purely functions of ϕ 0 . From Lemma A1 in Appendix C, regardless of N and T, under homoskedasticity,
Var ( N T g N T ) = 1 N T Var ( W A u ) σ 4 2 ( T 1 ) T + γ 2 ( T 1 ) 2 T 2 h h Ω N T ,
where γ 2 is the excess kurtosis of u i t .10
At the true parameter vector θ 0 , when there is no unit root, ( N T ) 1 W A W = O p ( 1 ) , T h = O ( 1 ) , T H = O ( 1 ) , N 1 u A W = O ( 1 ) + O p ( N 1 / 2 T 1 / 2 ) , and ( N T ) 1 u A u = O p ( 1 ) , so G N T = O p ( 1 ) + [ O ( T 2 ) + O p ( N 1 / 2 T 3 / 2 ) ] + O p ( T 1 ) . When T is large, G is obviously dominated by the O p ( 1 ) term, namely, ( N T ) 1 W A W . (When there is a unit root, the properly scaled G N T has the scaled W A W as its leading term.) From Lemma A6 in Appendix C (or Lemma A14 when there is a unit root), the estimator is locally identified in large samples. When T is finite but N is large, it is still very likely that ( N T ) 1 W A W is relatively large compared with the other two terms of finite order in G N T . In finite samples, one can always check numerically whether G N T ( θ ) is singular against a grid of values of θ . This is similar to the approach adopted by Gospodinov et al. (2017) and Bao and Yu (2023). In this section and the one to follow, it is implicitly assumed that the estimator is (first-order) locally identified. A general statement about sufficient conditions for identification would be desirable but extremely difficult.11

3.2. Inference under Large N

If N and T is finite, by writing g N T = ( N T ) 1 i = 1 N ( W i M u i + u i M u i h ) ( N T ) 1 i = 1 N g i , where g i is independent (across i) with mean 0 and variance O ( 1 ) under Assumptions 1–5, one can apply Lyapunov’s central limit theorem to N 1 / 2 g N T . This straightforwardly gives the following result.
Theorem 1.
Under Assumptions 1–5, if T is finite and G T = plim N G N T and Ω T = lim N Ω N T exist and are nonsingular, then as N ,
N ( θ ^ θ 0 ) d N 0 , G T 1 Ω T G T 1 .
For practical inference, while G T may be consistently estimated by G ^ T = ( N T ) 1 [ ( y W θ ^ ) A ( y W θ ^ ) H ^ W A W 2 h ^ ( y W θ ^ ) A W ] with h ^ = h ( θ ^ ) and H ^ = H ( θ ^ ) , it is not advisable to use a plug-in approach to estimate Ω T . Recall Ω N T (see (10)) contains Var ( W A u ) , which, as shown in Bao and Yu (2023), has a very complicated expression involving further the skewness and kurtosis of u i t , fixed effects, and possible interactions between the fixed effects and initial conditions. Given these complications and the fact that W i M u i + u i M u i h is independent (across i), a White-type (White 1980) estimator is natural for estimating Ω T in this asymptotic regime of large N and finite T. In consequence, a consistent variance estimator of the asymptotic variance of N ( θ ^ θ 0 ) is
V ^ ( θ ^ ) = G ^ T 1 1 N i = 1 N g ^ T i g ^ T i G ^ T 1 ,
where g ^ T i = T 1 ( W i M u ^ i + u ^ i M u ^ i h ^ ) with M u ^ i = M ( y i W i θ ^ ) .
One can show that G T is equal to plim N m N T ( θ 0 ) in Breitung et al. (2022), which, in turn, is equal to lim N N 1 i = 1 N E [ m T i ( θ 0 ) ] , where m N T ( θ ) ( = m N T ( θ ) / θ ) and E [ m T i ( θ ) ] are given in Appendix A of Breitung et al. (2022). Further, Ω T is the same as plim N N 1 i = 1 N m T i ( θ 0 ) m T i ( θ 0 ) . Thus, V ^ ( θ ^ ) is asymptotically the same as the fixed-T consistent variance matrix estimator of Breitung et al. (2022) (see their Equation (6)). Again, despite the discussion in this section being under the set-up of homoskedastic errors, it also applies to the case of cross-sectional heteroskedasticity.

3.3. Inference under Large T

If T , regardless of N, upon substituting (5), one can check that the term y ( ) A u , = 1 , , p , in W A u will be dominated by u A ( I N Φ p 1 L ) u + u A ( I N Φ p 1 L ) X β 0 . When there is a unit root, u A ( I N Φ p 1 L 1 ) α is also a dominating term. In the following, the stable and unit-root cases are discussed separately, since the leading terms in y ( ) A u are of different orders. The relevant matrices in the quadratic forms are uniformly (in T) bounded in row and column sums under the stable case, but not under the unit-root case. These differences create different distribution results for θ ^ that are mainly in terms of the scaling factors. In what follows, let λ r , r = 1 , , p , denote the inverse of roots of 1 ϕ 1 z ϕ p z p = 0 when ϕ = ϕ 0 . The stable case refers to the situation if | λ r |   <   1 , r = 1 , , p , and without loss of generality, the unit-root case refers to the scenario when λ 1 = 1 and | λ r |   <   1 , r = 2 , , p .

3.3.1. Stable Case

In this case, the matrix M Φ p 1 L in u A ( I N Φ p 1 L ) u + u A ( I N Φ p 1 L ) X β 0 = u ( I N M Φ p 1 L ) u + u ( I N M Φ p 1 L ) X β 0 is uniformly (with respect to T) bounded in row and column sums. This boundedness property also holds for I N M Φ p 1 L and thus the central limit theorem of Kelejian and Prucha (2010) on linear and quadratic forms can be invoked.12 That is, as T , Ω 1 , N T 1 / 2 N T g N T d N ( 0 , I ) (see Lemma A7), in which
Ω 1 , N T = 1 N T σ 4 N Γ 1 + σ 2 Γ 2 σ 2 F 1 σ 2 F 1 σ 2 X AX ,
where Γ 1 and Γ 2 are p × p matrices, consisting respectively of elements tr [ ( Φ p 1 L ) M Φ p 1 L s ] and β 0 X [ I N ( Φ p 1 L ) M Φ p 1 L s ] X β 0 , , s = 1 , , p , in their ( , s ) -positions, and F 1 is k × p with X ( I N M Φ p 1 L ) X β 0 , = 1 , , p , as its -th column. Note that Ω 1 , N T is not the exact variance of N T g N T , which is Ω N T as defined by (10). It is better interpreted as an approximation, namely, Ω 1 , N T = Ω N T + o ( 1 ) . It follows that Ω N T 1 / 2 N T g N T = Ω 1 , N T 1 / 2 N T g N T + o p ( 1 ) , regardless of N. In contrast to the exact variance matrix Ω N T , the variance matrix Ω 1 , N T involves θ 0 , X , and σ 2 , but not the fixed effects, initial conditions, or the skewness and kurtosis of the error distribution.
Theorem 2.
Under Assumptions 1–5 and that DP ( p ) is dynamically stable, if plim T G N T and lim T Ω 1 , N T exist and are nonsingular, then as T ,
N T ( θ ^ θ 0 ) d N 0 , plim T G N T 1 lim T Ω 1 , N T plim T G N T 1 .
If N is finite, one could have replaced the convergence rate N T by T and denoted plim T G N T by G N and lim T Ω 1 , N T by Ω 1 , N . But since there is no restriction in N (except that the probability limit of G N T and the limit of Ω 1 , N T are well defined when T ), which may be finite or also diverge as T , no attempt is made to implement such replacements here (and also in the next (sub)sections).13 Lemma A6 gives plim T ( N T ) 1 W A W = lim T σ 2 Ω 1 , N T . Combining all these results, one has the following.
Corollary 1.
Under the conditions in Theorem 2, as T ,
N T ( θ ^ θ 0 )
d N 0 , plim T 1 N T W A W 1 lim T Ω 1 , N T plim T 1 N T W A W 1
d N 0 , σ 4 lim T Ω 1 , N T 1
d N 0 , σ 2 plim T 1 N T W A W 1 .
The asymptotic variances in (15)–(17) resemble the variance formula of a consistent OLS estimator. Recall that the WG estimator, which is also named the least squares dummy variable (LSDV) estimator, is essentially an OLS estimator and is consistent under large T, though it may still possess an asymptotic bias, depending on the divergence rates of N and T. The asymptotic distribution result (17) is the same as Theorem 1(ii) of Breitung et al. (2022) that is derived under cross-sectional heteroskedasticity.14 In view of Theorem 3 in Hahn and Kuersteiner (2002), the asymptotic variance (17) achieves the efficiency bound.
In practice, the asymptotic variance may be estimated by G ^ N T 1 Ω ^ 1 , N T G ^ N T 1 , where Ω ^ 1 , N T = Ω 1 , N T ( θ ^ ) and G ^ N T = G N T ( θ ^ ) , with θ ^ replacing θ 0 in Ω 1 , N T and G N T , respectively, or ( W A W / N T ) 1 Ω ^ 1 ( W A W / N T ) 1 , or σ ^ 4 Ω ^ 1 , N T 1 , or σ ^ 2 ( W A W / N T ) 1 , where σ ^ 2 = ( y W θ ^ ) A ( y W θ ^ ) / [ N ( T 1 ) ] . The last choice is perhaps the easiest one to use.
When k = 0 and p = 1 , Ω 1 , N T / σ 4 = tr [ ( Φ 1 1 L ) M Φ 1 1 L ] / T = 1 / ( 1 ϕ 0 2 ) + O ( T 1 ) , so
N T ( ϕ ^ ϕ 0 ) d N ( 0 , 1 ϕ 0 2 ) ,
as T . This is also the asymptotic distribution of the bias-corrected WG estimator in Hahn and Kuersteiner (2002).

3.3.2. Unit-Root Case

When there is a unit root, the leading terms in W A u are linear forms in u . The matrices in quadratic forms in u that appear in W A u are no longer uniformly (in T) bounded in row and column sums, but they are of lower order compared with the leading linear forms. Thus, define the scaling matrix
Υ = N T 3 I p O p × k O k × p N T I k .
Correspondingly, define g N T = Υ 1 [ W A u E ( W Au ) ] and
Ω 1 , N T ( α ) = Υ 1 / 2 σ 2 Γ 2 + Γ 3 ( α ) F 1 + F 2 ( α ) F 1 + F 2 ( α ) X AX Υ 1 / 2 ,
where Γ 2 and F 1 are defined in (13), Γ 3 ( α ) has 1 ( Φ p 1 L ) M Φ p 1 L s 1 α α in its ( , s )-position, , s = 1 , , p , and F 2 ( α ) has X ( I N M Φ p 1 L 1 ) α as its -th column, t = 1 , , p . Note that the matrix Ω 1 , N T ( α ) involves the fixed effects. If the fixed effects are deterministic and of finite magnitudes, then Ω 1 , N T ( α ) can be interpreted as the approximate variance of Υ 1 / 2 g N T , namely, Ω 1 , N T ( α ) = Var ( Υ 1 / 2 g N T ) + o ( 1 ) . Lemma A15 in Appendix C shows that Ω 1 , N T 1 / 2 ( α ) Υ 1 / 2 g N T d N ( 0 , I ) as T . Denote G N T = G N T ( θ 0 ) with G N T ( θ ) = Υ 1 / 2 { [ W A ( y W θ ) E ( W A ( ( y W θ ) ) ] / θ } Υ 1 / 2 .
Theorem 3.
Under Assumptions 3–5 and that DP ( p ) has a unit root, assume that the fixed effects are deterministic and of finite magnitudes. If plim T G N T and lim T Ω 1 , N T ( α ) exist and are nonsingular, then as T ,
Υ 1 / 2 ( θ ^ θ 0 ) d N 0 , plim T G N T 1 lim T Ω 1 , N T ( α ) plim T G N T 1 .
This, together with Lemmas A13 and A14, implies the following.
Corollary 2.
Under the conditions in Theorem 3, as T ,
Υ 1 / 2 ( θ ^ θ 0 )
d N 0 , plim T Υ 1 / 2 W A W Υ 1 / 2 1 lim T Ω 1 , N T ( α ) plim T Υ 1 / 2 W A W Υ 1 / 2 1
d N 0 , σ 2 plim T Υ 1 / 2 W A W Υ 1 / 2 1 .
When there are no fixed effects ( α = 0 ), but there are still relevant exogenous regressors contained in X , the asymptotic distribution result still holds as T . When there is no X and p = 1 , the leading term in W A W = i = 1 N y i , ( 1 ) M y i , ( 1 ) is i = 1 N α i 2 1 L Φ 1 1 M Φ 1 1 L 1 . As T , T 3 1 L Φ 1 1 M Φ 1 1 L 1 1 / 12 and T 3 Var ( W A u ) σ 2 i = 1 N α i 2 / 12 .
Corollary 3.
Under Assumptions 4 and 5, if further k = 0 , p = 1 , ϕ 0 = 1 , the fixed effects are deterministic and of finite magnitudes, and α 0 , then as T ,
N T 3 ( ϕ ^ ϕ 0 ) d N 0 , 12 σ 2 N 1 α α .
This echoes Theorem 5 of Hahn and Kuersteiner (2002). If N also, then the asymptotic variance in (23) is replaced accordingly with 12 σ 2 / ( lim N N 1 α α ) (assuming that the limit exists).15
It should be pointed out that unit-root situation considered here is different from those in Han and Phillips (2010) and Kruiniger (2018), both focusing on DP(1) and ruling out trends due to the fixed effects and/or exogenous regressors. The specification in Han and Phillips (2010) is y i t = ϕ y i , t 1 + ( 1 ϕ ) α i + u i t . Their FDLS estimator has a unified convergence rate of N ( T 1 ) for ϕ 0 ( 1 , 1 ] and the limiting distribution is Gaussian for any N / T ratio as long as N ( T 1 ) . The model in Kruiniger (2018) is y i t = ϕ y i , t 1 + x i t ( β ϕ β ) + ( 1 ϕ ) α i + u i t . He shows that the modified ML approach (Lancaster 2002) is undesirable in the unit-root case since ϕ 0 is first-order unidentified and is only second-order identified under the large-N-finite-T asymptotic regime. His generalized modified ML estimator, on the other hand, exists in the large-N-finite-T asymptotic regime with probability approaching 1; it is uniquely identified with probability 1 if it exists, and the asymptotic distribution of the estimated autoregressive parameter is non-Gaussian with a convergence rate of N 1 / 4 in the unit-root case. In this paper, it is assumed that either there are always fixed effects or when the fixed effects do not exist, there are relevant exogenous regressors. In this framework, the convergence rate of the RMM estimator is different only under large T for the unit-root case and the asymptotic distribution is still normal.

4. Heteroskedasticity

This section considers the situation of heteroskedastic idiosyncratic errors. Assumption 5 is modified accordingly such that both forms of heteroskedasticity are allowed.
Assumption 6.
The series of error terms u i t , i = 1 , , N , t = 1 , , T , is independent across i and t, E ( u i t ) = 0 , Var ( u i t ) = σ i t 2 , and has finite ( 4 + η ) -th moments, η > 0 .
Let Var ( u i ) = Σ i = Dg ( σ i 1 2 , , σ i T 2 ) , where Dg ( · ) denotes a diagonal matrix with its arguments in order on the diagonal. (When Dg ( · ) has a matrix argument, it collects, in order, the diagonal elements of the matrix and forms a diagonal matrix.) It is obvious that
E ( W Au ) = i = 1 N tr ( M Φ p 1 L 1 Σ i ) tr ( M Φ p 1 L p Σ i ) 0 k .
Following Bao and Yu (2023), define
Ψ ( ϕ ) = T T 2 Dg ( M Φ p 1 ( ϕ ) L ) tr ( M Φ p 1 ( ϕ ) L ) ( T 1 ) ( T 2 ) I .
Then, using tr ( M Φ p 1 L Σ i ) = E ( u i M Ψ M u i ) , where Ψ = Ψ ( ϕ 0 ) , one may define the moment vector as
g N T ( θ ) = 1 N T i = 1 N y i , ( 1 ) M ( y i W i θ ) ( y i W i θ ) M Ψ 1 ( ϕ ) M ( y i W i θ ) y i , ( p ) M ( y i W i θ ) ( y i W i θ ) M Ψ p ( ϕ ) M ( y i W i θ ) X i M ( y i W i θ ) 1 N T i = 1 N g i ( θ )
such that E ( g N T ) = 0 . Note that this set of moment conditions is still valid under Assumption 5. In the previous section under homoskedasticity, σ 2 is treated as a nuisance parameter, and it is replaced with u A u / [ N ( T 1 ) ] as if it were estimated. (It has also been emphasized that u A u / [ N ( T 1 ) ] , in fact, is unbiased for estimating i = 1 N σ i 2 under cross-sectional heteroskedasticity.) Under the most general form of heteroskedasticity, there is no feasible way of estimating all the variance parameters unbiasedly. So this section does not seek to estimate Σ i . Instead, the set of moment conditions (25) is designed such that it is robust to heteroskedasticity.
Denote Ω N T = Var ( N T g N T ) as before, but its expression is different, see (A9) in Appendix C. The gradient (evaluated at θ 0 ) now becomes
G N T = 1 N T i = 1 N 2 u i M Ψ 1 M W i y i , ( 1 ) M W i 2 u i M Ψ p M W i y i , ( p ) M W i X i M W i 1 N T i = 1 N u i M Ψ 11 M u i u i M Ψ 1 p M u i 0 k u i M Ψ p 1 M u i u i M Ψ p p M u i 0 k 0 k 0 k O k ,
where
Ψ s = Ψ ( θ 0 ) ϕ s = T T 2 Dg ( M Φ p 1 L s Φ p 1 L ) tr ( M Φ p 1 L s Φ p 1 L ) ( T 1 ) ( T 2 ) I .
By substituting (4), one can see that y i , ( ) M u i u i M Ψ M u i that populates g i ( θ 0 ) is in terms of linear and quadratic forms in u i (see (A10) in the proof of Lemma A1 in Appendix C). This structure is to be exploited for one to study the asymptotic distribution of the resulting RMM estimator from the set of sample moment conditions (25).

4.1. Inference under Large N

Given any finite T, for each = 1 , , p , one can verify that the Lyapunov condition holds by checking N 1 i = 1 N Var ( y i , ( ) M u i u i M Ψ M u i ) = O ( 1 ) (see Lemma A1 in Appendix C for the expression of Var ( y i , ( ) M u i u i M Ψ M u i ) ) and E ( | y i , ( ) M u i u i M Ψ M u i | 2 + δ ) < for some δ > 0 . The Lyapunov condition also holds for u i M X i , since N 1 i = 1 N Var ( u i M X i ) = N 1 i = 1 N tr ( M X i X i M Σ i ) = O ( 1 ) and E ( | u i M X i | 2 + δ ) < . Moreover, for any linear combination of y i , ( 1 ) M u i u i M Ψ 1 M u i , ⋯, y i , ( p ) M u i u i M Ψ p M u i , and u i M X i , the Lyapunov condition holds as well. Thus, with Ω N T and G N T updated, one has the same asymptotic distribution result as in Theorem 1, and to save space, it is not listed explicitly as a separate theorem.

4.2. Inference under Large T

As in the previous section, cases without and with a unit root are discussed separately, since the orders of various dominating terms in G N T and Ω N T are different.

4.2.1. Stable Case

Note that, for any = 1 , , p , y i , ( ) M u i is dominated by u i M Φ p 1 L ( u i + X i β 0 ) . Dg ( M Φ p 1 L ) has diagonal elements of order O ( T 1 ) (see Lemma A2 in Appendix C). Therefore, y i , ( ) M u i u i M Ψ M u i is dominated by u i M Φ p 1 L ( u i + X i β 0 ) .
Theorem 4.
Under Assumptions 1–4 and 6 and that DP ( p ) is dynamically stable, if plim T G N T and lim T Ω 1 , N T exist and are nonsingular, then as T ,
N T ( θ ^ θ 0 ) d N 0 , plim T G N T 1 lim T Ω 1 , N T plim T G N T 1 ,
where
Ω 1 , N T = 1 N T Γ 1 + Γ 2 F 1 F 1 i = 1 N X i M Σ i M X i
in which Γ 1 and Γ 2 are p × p matrices with, respectively, i = 1 N tr ( Σ i L Φ p 1 M Σ i M Φ p 1 L s ) and i = 1 N β 0 X i L Φ p 1 M Σ i M Φ p 1 L s X i β 0 in their ( , s )-th positions, , s = 1 , , p , and F 1 is k × p with X i M Σ i M Φ p 1 L X i β 0 as its ℓ-th column, = 1 , , p .
In view of Lemmas A9 and A10 in Appendix C, one also has the following corollary.
Corollary 4.
Under the assumptions in Theorem 4, as T ,
N T ( θ ^ θ 0 )
d N 0 , plim T 1 N T W A W 1 lim T Ω 1 , N T plim T 1 N T W A W 1
d N 0 , plim T 1 N T W A W 1 plim T 1 N T i = 1 N W i M Σ i M W i plim T 1 N T W A W 1 .
The asymptotic variance (31) resembles the OLS variance result under heteroskedasticity in cross-sectional regressions. In practice, for meaningful inference, one needs to estimate the meat part of the sandwich form of the asymptotic variance. From the proof of Lemma A8, g N T is dominated by ( N T ) 1 i = 1 N ( y i , ( 1 ) M u i , , y i , ( p ) M u i , u i M X i ) = ( N T ) 1 i = 1 N W i M u i . Therefore, as T ,
plim T 1 N T i = 1 N W i M u i u i M W i = lim T 1 N T i = 1 N E ( W i M u i u i M W i ) = lim T Var ( N T g N T ) = lim T Ω 1 , N T .
But still, u i is not observable. Given that M u i = M ( y i W i θ 0 ) M v i , one may be tempted to use M v ^ i = M ( y i W i θ ^ ) = M [ u i + W i ( θ 0 θ ^ ) ] . By substitution,
1 N T i = 1 N W i M v ^ i v ^ i M W i = 1 N T i = 1 N W i M u i u i M W i + 1 N T i = 1 N W i M W i ( θ ^ θ 0 ) ( θ ^ θ 0 ) W i M W i 1 N T i = 1 N W i M W i ( θ ^ θ 0 ) u i M W i 1 N T i = 1 N W i M u i ( θ ^ θ 0 ) W i M W i ,
where θ ^ θ 0 = O p ( 1 / N T ) , W i M W i = O p ( T ) , and W i M u i = O ( 1 ) + O p ( T / N ) . If it is also the case that N diverges, then ( N T ) 1 i = 1 N W i M W i ( θ ^ θ 0 ) u i M W i = o p ( 1 ) and ( N T ) 1 i = 1 N W i M W i ( θ 0 θ ^ ) ( θ 0 θ ^ ) W i M W i = o p ( 1 ) .
Therefore, if N , T , a feasible consistent estimator of the asymptotic variance is
V ^ ( θ ^ ) = N T W A W 1 i = 1 N W i M v ^ i v ^ i M W i W A W 1 .
The asymptotic distribution result (31) with the the variance matrix estimated by (32) resembles Theorem 3 of Hansen (2007). Appendix D discusses the asymptotic distribution of ( N T ) 1 i = 1 N W i M v ^ i v ^ i M W i if N is fixed and its implication on asymptotic inference when one is using (32). In particular, the N / ( N 1 ) t N 1 approximation proposed by Hansen (2007), where t N 1 is the t distribution with N 1 degrees of freedom, may be used to approximate the asymptotic distribution of the t-statistic.

4.2.2. Unit-Root Case

In this case, the diagonal elements of M Φ p 1 L Σ i = Φ p 1 L Σ i T 1 1 1 Φ p 1 L Σ i are O ( 1 ) , since Φ p 1 L Σ i is strictly lower triangular and T 1 1 1 Φ p 1 L has O ( 1 ) elements. Further, Ψ is dominated by ( T 2 ) 1 T Dg ( Φ p 1 L Σ i ) . As before, the scaling matrix Υ is used such that the set of moment conditions g N T ( θ ) and gradient matrix G N T ( θ ) are defined accordingly. Specifically, g N T ( θ ) is Υ 1 [ N T g N T ( θ ) ] with g N T ( θ ) given by (25) and G N T ( θ ) is Υ 1 / 2 [ N T G N T ( θ ) Υ 1 / 2 ] with G N T ( θ ) given by (26).
Theorem 5.
Under Assumptions 3, 4 and 6 and that DP ( p ) has a unit root, assume the fixed effects are deterministic and of finite magnitudes. If plim T G N T and lim T Ω 1 , N T ( α ) exist and are nonsingular, then as T ,
Υ 1 / 2 ( θ ^ θ 0 ) d N 0 , plim T G 1 lim T Ω 1 , N T ( α ) plim T G 1 ,
where
Ω 1 , N T ( α ) = Υ 1 / 2 Γ 2 + Γ 3 ( α ) F 1 + F 2 ( α ) F 1 + F 2 ( α ) i = 1 N X i M Σ i M X i Υ 1 / 2 ,
in which Γ 2 and F 1 are defined in (29), Γ 3 ( α ) is p × p with i = 1 N 1 Σ i L Φ p 1 M Σ i M Φ p 1 L s 1 α i 2 in its ( , s ) -th position, , s = 1 , , p , and F 2 ( α ) is k × p with i = 1 N X i M Σ i M Φ p 1 L 1 α i as its ℓ-th column, = 1 , , p .
Based on Lemmas A17 and A18, one also has the following result.
Corollary 5.
Under the assumptions in Theorem 5, as T ,
Υ 1 / 2 ( θ ^ θ 0 ) d N 0 , V ,
where V = ( plim T Υ 1 / 2 W A W Υ 1 / 2 ) 1 plim T Υ 1 / 2 ( i = 1 N W i M Σ i M W i ) Υ 1 / 2
( plim T Υ 1 / 2 W A W Υ 1 / 2 ) 1 .
Similar to (32), for practical inference, one needs N , T so that W i M Σ i M W i can be replaced with W i M v ^ i v ^ i M W i in (35) and the probability limits are replaced with the sample analogs to form an estimated asymptotic variance.

5. Monte Carlo Evidence

In this section, Monte Carlo simulations are conducted to assess the finite-sample performance of the proposed estimator. Bao and Yu (2023) provide simulation results for the first- and second-order models when the idiosyncratic errors are either homoskedastic or temporally heteroskedastic. Recall that under the baseline framework, the proposed estimator in this paper is also robust to cross-sectional heteroskedasticity. This section considers a third-order model, and for one to get a comprehensive spectrum of possible heteroskedasticity, four scenarios are included: homoskedasticity (across i and t), cross-sectional heteroskedasticity (across i only), temporal heteroskedasticity (over t only), and double (both cross-sectional and temporal) heteroskedasticity.
The following DP(3) is used:
y i t = α i + ϕ 1 y i , t 1 + ϕ 2 y i , t 2 + ϕ 3 y i , t 3 + β 1 x 1 , i t + β 2 x 2 , i t + u i t , x 1 , i t = ρ x x 1 , i , t 1 + ξ 1 , i t , ρ x = 0.8 , x 2 , i t = ρ i α i + ξ 2 , i t ,
where α i is i.i.d. (across i) following a standard normal distribution, ρ i is i.i.d. (across i) uniformly on the interval [ 0 , 1 ] , ξ 1 , i t is i.i.d. (across i and t) following a standard normal distribution, and ξ 2 , i t is similarly defined. The initial value of x 1 , i t is set to be ξ 0 , i / 1 ρ x 2 , where ξ 0 , i is i.i.d. (across i) following a standard normal distribution. The initial observations on y i t are simulated as α i / ( 1 ϕ 01 ϕ 02 ϕ 03 ) + x i t β 0 + u i t Var ( δ t ) if there is no unit root and α i + x i t β 0 + u i t otherwise, where δ t follows a stationary zero-mean third-order autoregressive (AR(3)) process with coefficients ϕ 01 , ϕ 02 , and ϕ 03 , and its shock term is a unit-variance white noise.
Let e i t be i.i.d. (across i and t) with mean 0 and variance 1. The error term u i t is simulated as follows in the four different scenarios (homoskedasticity, cross-sectional heteroskedasticity, temporal heteroskedasticity, and double heteroskedasticity):
u i t = e i t z i e i t z i U [ 0.5 , i ] or χ 10 2 if U [ 0.5 , i ] 100 z t e i t z t U [ 0.5 , t 2 ] or χ 10 2 if U [ 0.5 , t 2 ] 100 z i t e i t z i t U [ 0.5 , i ] · U [ 0.5 , t 2 ] or χ 10 2 if U [ 0.5 , i ] · U [ 0.5 , t 2 ] 100
where, for example, z i U [ 0.5 , i ] or χ 10 2 if U [ 0.5 , i ] 100 means that z i is independently (across i) drawn from a uniform distribution on the interval [ 0.5 , i ] , and if the realization is no smaller than 100, then it is redrawn from a chi-squared distribution with 10 degrees of freedom. e i t is simulated from a normal distribution, but results under non-normal distributions are available upon request and the conclusions in this section are largely consistent across all distributions.16
In reality, data themselves rarely reveal clearly the lack or presence of heteroskedasticity, so estimation results from the estimator using the recentered moment conditions (7) and the robust one using (25), referred to as RMM and RMM r , respectively, are presented. When the empirical rejection rates of a 5 % two-sided t-test of the relevant parameter equal to its true value are reported, for WG and GMM, also included are their empirical sizes from t-ratios using the White-type robust standard errors (clustered at the individual level, referred to as WG(h) and GMM(h), respectively). There are different choices for the GMM estimator, depending on what and how many instruments are used. Breitung et al. (2022) show that the one-step estimator of Arellano and Bond (1991) is very comparable to other popular choices (Ahn and Schmidt 1995; Blundell and Bond 1998), so the GMM estimator used in this section is the one-step one.17 Different combinations of N and T are experimented with: [100 10; 50 20; 50, 50; 25 40; 20 50; 10 100]. Regardless of N and T, one can always use the estimator in this paper, but in reality, in a situation like N = 25 , T = 40 , to conduct inference, there is typically no convincing evidence for one to favor one asymptotic regime over the other. So in what follows, in addition to the bias and root mean squared error (RMSE), all out of 10,000 simulations, also reported are the empirical sizes using standard errors constructed under different asymptotics. In particular, RMM(N) and RMM r (N) denote the empirical sizes from t-ratios using the large-N standard errors (see Section 3.2 and Section 4.1) for RMM and RMM r , respectively. Likewise, RMM(T) refers to the empirical size from the t-ratio when the large-T standard errors from (17) (or (22)) are used for RMM. Furthermore, RMM r ( N T ) means when the feasible variance (32) (or its version under unit root) is used for RMM r , which is valid if both N and T are large. On the other hand, RMM r (T) does not mean that a different t-ratio is used and instead, it signals that, for the RMM r  t-ratio when the feasible variance is used, the N / ( N 1 ) t N 1 approximation developed by Hansen (2007) under large T is used for conducting hypothesis testing.
Included for comparison are the bias-corrected estimators of Bun and Carree (2006) (BC for short) and Juodis (2013) (BCJ for short).18 As Juodis (2013) points out, BC is consistent under homoskedasticity and cross-sectional heteroskedasticity, but invalid under temporal heteroskedasticity. BCJ, on the other hand, is robust to both forms of heteroskedasticity. Neither Bun and Carree (2006) nor Juodis (2013) derives the asymptotic distributions of their bias-corrected estimators, so only their bias and RMSE results are reported in this section. Included also is the half-panel jackknife (HPJ) estimator of Chudik et al. (2018), where errors can be heteroskedastic across both i and t.
In the experiments, β 0 = ( 1 , 1 ) and three sets of parameter configurations of ϕ 0 are used: ( 0.3 , 0.3 , 0.2 ) , ( 0.3 , 0.2 , 0.1 ) , and ( 0.3 , 0.6 , 0.1 ) . The first configurations reflects a situation when the degree of time-series correlation is relatively strong in the sense that the cumulative partial effect of a past shock, measured by powers of ϕ 01 + ϕ 02 + ϕ 03 , can be high. The second set corresponds to a case of zero past effect, and the last one is a unit-root case where the past effect never dies out. To save space, only results related to ϕ 01 + ϕ 02 + ϕ 03 under the first set of parameter configuration ( 0.3 , 0.3 , 0.2 ) are reported, whereas results under the other two parameter configurations are contained in Appendix F.19
Table 1 reports the bias and RMSE, both multiplied by 100, and empirical rejection rate of the two-sided 5 % t-test related to ϕ 01 + ϕ 02 + ϕ 03 when u i t is homoskedastic. One sees clearly the superb performance of both RMM and RMM r , with the smallest bias and lowest RMSE. GMM performs reasonably well, but its bias and RMSE peak at ( N , T ) = ( 50 , 20 ) . The WG estimator performs the worst when T is small but improves as T gets relatively larger, but is still more biased even when T = 100 compared with RMM or RMM r . BC and BCJ report larger bias and higher RMSE than GMM on many occasions, though there are also cases where they are much better than GMM. Note that BC and BCJ bias-correct the WG estimator, which is in fact consistent under large T. Table A1 in Appendix F reveals that when there is a unit root, BC and BCJ give a much larger bias than WG, even when T is relatively large. So, in this case, the action of bias-correction gives more biased estimates. HPJ is the most biased and possesses the highest RMSE among all the consistent estimators at ( N , T ) = ( 100 , 10 ) , namely, when the panel is relatively short. Its performance improves quickly as T increases, though still quite a bit below RMM/RMM r .
Now, consider the size performance of the associated two-sided 5 % test from the different estimators. From Table 1, one sees severe size distortions of the WG-based inference when T is relatively small, but its empirical size is close to the nominal size when T = 100 . The GMM-based inference performs really poorly. The t-test based on HPJ severely over-rejects when ( N , T ) = ( 100 , 10 ) , but its size distortion goes down quickly when the panel has longer spans. The large-N-based inferences from RMM and RMM r , namely, RMM(N) and RMM r (N), give empirical sizes close to 5 % when N is relatively large. On the other hand, RMM(T) delivers a reasonably good size performance even when T is relatively small. Results from RMM r ( N T ) and RMM r (N) are mixed: when they work, they may perform slightly worse than RMM(N) and RMM(T), but their performances get worse when N is small and T is large. Finally, the size result from RMM r (T), namely, using the N / ( N 1 ) t N 1 approximation for the asymptotic distribution of the t-ratio, is good in almost all cases. Recall that the N / ( N 1 ) t N 1 approximation is valid under homoskedasticity. These results are not surprising, given that homoskedastic errors are simulated and that, for the robust estimator, the feasible standard errors based on (32) (or its version under unit root) require both N and T to be large. In fact, when ( N , T ) = ( 50 , 50 ) , the size distortion from RMM r ( N T ) is the smallest compared with other ( N , T ) combinations with relatively smaller N or T.
Table 2 reports results under cross-sectional heteroskedasticity. Relative performances of these different estimators largely stay the same. Notably, BC and BCJ perform really poorly when ( N , T ) = ( 100 , 10 ) , though they are designed to bias-correct the WG estimator under large N and fixed T. In contrast to the homoskedastic case, HPJ gives a smaller bias and lower RMSE than RMM/RMM r when ( N , T ) = ( 100 , 10 ) , but usually performs worse in other cases. In terms of hypothesis testing, when N is relatively large, both RMM(N) and RMM r (N) work well; when T is relatively large, both RMM(T) and RMM r (T) provide empirical rejection rates close to the nominal size. Recall that the RMM estimator designed under homoskedasticity is also valid under cross-sectional heteroskedasticity, and so is the related inference procedure. There is pronounced size distortion from HPJ for the unit root case (see Table A2 in Appendix F), but its size performance is reasonably good under the other two parameter configurations when T is not relatively small.
When there is time-series heteroskedasticity, the robust estimator usually dominates all the other estimators in terms of bias and RMSE, as demonstrated in Table 3 (and Table A3 in Appendix F). In terms of hypothesis testing, the large-N RMM r t-ratio delivers very good size performance when N is relatively large, whereas the N / ( N 1 ) t N 1 approximation (RMM r (T)) provides good results when T is relatively large. The large-N-large-T RMM r -based inference reports upward size distortions in almost all cases.
Finally, Table 4 provides results under double heteroskedasticity, namely, when both cross-sectional heteroskedasticity and time-series heteroskedasticity are present. RMM r is the least biased, except when ( N , T ) = ( 100 , 10 ) , where HPJ is slightly better. On the other hand, RMM, which ignores temporal heteroskedasticity, reports very small bias on many occasions, especially under the second and third parameter configurations (see Table A4 in Appendix F). In terms of RMSE, BCJ and RMM r are comparable. In terms of the size performance of the associated t-test, the story is very similar to that reported when there is temporal heteroskedasticity only, namely, under large N, RMM r (N) is most trustworthy and RMM r (T) is the one under large T. GMM has the most size distortions in almost all cases, and HPJ gives severe size distortions when there is a unit root.
Summarizing all the simulation results, one can learn the following. (i) When there is no heteroskedasticity, the proposed estimator can be safely used, either the one based on (7) or the robust one based on (25), regardless of N and T. When heteroskedasticity in the time dimension is present, then the robust version is the best. (ii) The presence of a unit root or not has no substantial impact on its performance. (iii) In terms of inference, when N is relatively large or is of comparable size relative to T, the large N-based inference from RMM r has very good size performance, regardless of heteroskedasticity. When T is relatively large, the large T-based inference from RMM gives reliable inference under homoskedasticity and cross-sectional heteroskedasticity. (iv) When T is large, the N / ( N 1 ) t N 1 approximation for the t-ratio from RMM r with the feasible robust variance usually has good size performance, regardless of heteroskedasticity.

6. Conclusions and Directions of Future Research

This paper proposes an estimation strategy that does not rely on instrumental variables. One can view the estimation strategy as using the endogenous lagged dependent variables as their own instruments and then constructing recentered moment conditions by explicitly exploiting the correlation between the endogenous variables and error term in the model. The asymptotic properties of the new estimator are thoroughly investigated under various conditions that relate to the sizes of cross-sectional units and time periods, heterogeneity in the error variance, and the issue of whether a unit root is present. In general, the asymptotic distribution does not require both N and T to be large. Under large T, it resembles the familiar OLS result in traditional regression analysis and its asymptotic variance achieves the efficiency bound under homoskedasticity. Similar to time series autoregressions, the convergence rate of the estimator of the autoregressive parameters is different when there is a unit root under large T, but the standard t-test procedure carries through in hypothesis testing. Monte Carlo simulations demonstrate that it possesses good finite-sample properties in various situations. All the theoretical results and Monte Carlo evidence in this paper suggest two directions for future research.

6.1. Cross-Sectional Correlation

Note that cross-sectional correlation is not explicitly discussed in this paper. If there is a weak cross-sectional correlation (in the sense that the covariance matrix of ( u 1 t , , u N t ) has a bounded norm as N ), then the estimation strategy proposed in this paper still holds, see discussions in Appendix E. However, both N and T need to be large. This form of cross-sectional correlation may cover situations where correlation arises from spatial contiguity as in the spatial econometrics literature. Nevertheless, when there are dominating units (Pesaran and Yang 2021) whose errors are always correlated with those from other units, the assumption of weak correlation is violated. Another situation of violation is when cross-sectional correlation is due to common factors. Under homoskedasticity, for the case of DP(1), De Vos and Everaert (2021) augment the model by the cross-sectional averages of the right-hand side regressors, which can be interpreted as proxies for the unknown factors. Then, they derive the asymptotic bias of the common correlated effects pooled (CCEP) estimator of the main parameter vector under large N and fixed T and design an estimator by matching the CCEP estimator with its asymptotic bias.20 One could have directly focused on the endogeneity of the defactored lagged dependent variables, namely, ( I T P ) w i , where P is the projection matrix on the proxies, consisting of cross-sectional averages (including 1 T ). The exact expectation of i = 1 N w i ( I T P ) u i is not easy to derive, but one can follow De Vos and Everaert (2021) to approximate it and construct the resulting estimator. It is left for future research to extend the robust estimator to cases when there is a strong cross-sectional correlation in addition to heteroskedasticity in higher-order DP with the possibly of a unit root.

6.2. Inference under Heteroskedasticity for Long Panels

It has been observed from Section 5 that all the other estimators in the experiments perform much worse in terms of bias under cross-sectional and temporal heteroskedasticity than the robust estimator when T is not very small. On the other hand, non-negligible size distortions using the feasible variance (32) are also documented and one may need to resort to the approximation of Hansen (2007) for inference purposes, though one has yet to show rigorously that the approximation is still valid in the presence of heteroskedasticity. The re-sampling approach (Kapetanios 2008) may not be applicable under small N and large T. In addition, if one re-samples the data across i (when N is large and T is small) or over t (when T is large and N is small), there is the issue of heteroskedasticity or temporal correlation that one needs to take into account. Designing a practical and reliable inference procedure for the robust estimator in long panels is a second avenue for future research.

Funding

This research received no external funding.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Acknowledgments

The author thanks three anonymous referees, Mohitosh Kejriwal, Joon Park, Keli Xu, Xuewen Yu, Xiaoyan Zhou, and seminar participants at Fudan University and Indiana University (Bloomington) for their helpful comments.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Some Preliminary Results

Note that Φ p ( ϕ ) is a lower triangular Toeplitz matrix with non-zero diagonals, so it is always invertible and its inverse is also lower triangular Toeplitz. From Linz (1985, p. 172), one has
Φ p 1 ( ϕ ) = 1 b 1 1 b 2 b 1 1 · · · · · b T 1 · b 1 1 ,
where b t can be obtained from the recursion, b t = s = 0 t 1 ϕ t s b s , t = 1 , , T 1 , b 0 = 1 and ϕ t s = 0 for t s > p . This defines a p-th order difference equation for the series { b t } t = 0 T 1 . The solution of the p-th order homogeneous equation b t ϕ 1 b t 1 ϕ p b t p = 0 is
b t = r = 1 p a r λ r t ,
where λ i ’s are the inverse of the roots of 1 ϕ 1 z ϕ p z p = 0 , and a r ’s are constants determined by the first p initial conditions, where, without loss of generality, it is assumed that all the λ i ’s are distinct.21
For any 0 , let the ( i , j ) -th element of the T × T matrix Φ p 1 L be denoted by b i j = O ( 1 ) with the understanding that b i j = 0 whenever i j < 0 . The ( i , j ) -th element of the T × T matrix Φ p 1 L Σ k is then b i j σ k j 2 .
When the DP(p) is dynamically stable, the i-th element of 1 Φ p 1 L is the sum of the i-th column of Φ p 1 L , given by
j = i + T b j i = r = 1 p a r j = i + T λ r j i = r = 1 p a r 1 λ r T + 1 i 1 λ r , i T + 1 .
The i-th element of 1 Φ p 1 L Σ k is
j = i + T b j i σ k i 2 = σ k i 2 r = 1 p a r 1 λ r T + 1 i 1 λ r i T + 1 .
The i-th element of Φ p 1 L 1 is the sum of the i-th row of Φ p 1 L , given by
j = 1 i b i j = r = 1 p a r j = 1 i λ r i j = r = 1 p a r 1 λ r i 1 λ r , i .
The i-th element of Φ p 1 L Σ k 1 is
j = 1 i b i j σ k j 2 = r = 1 p a r j = 1 i σ k j 2 λ r i j .
The ( i , j ) -th element of ( Φ p 1 L s ) Φ p 1 L is
r 1 = 1 p r 2 = 1 p a r 1 a r 2 j 1 = 1 T λ r 1 j 1 i s λ r 2 j 1 j = r 1 = 1 p r 2 = 1 p a r 1 a r 2 w i , j , s , , r 1 , r 2 ,
where
w i , j , s , , r 1 , r 2 = λ r 2 i + s j λ r 1 T + 1 i s λ r 2 T + 1 j 1 λ r 1 λ r 2 j + i + s T + 1 λ r 1 j + i s λ r 1 T + 1 i s λ r 2 T + 1 j 1 λ r 1 λ r 2 i + s < j + T + 1 .
The ( i , j ) -th element of ( Φ p 1 L s Σ k ) Φ p 1 L Σ k is
σ k i 2 σ k j 2 r 1 = 1 p r 2 = 1 p a r 1 a r 2 j 1 = 1 T λ r 1 j 1 i s λ r 2 j 1 j = σ k i 2 σ k j 2 r 1 = 1 p r 2 = 1 p a r 1 a r 2 w i , j , s , , r 1 , r 2 .
When there is a unit root, let λ 1 = 1 and | λ r |   <   1 be all distinct, r = 2 , , p . Now, the ( i , j ) -th element of Φ p 1 L is b i j with b i j = a 1 + r = 2 p a r λ r i j . Note that the i-th element of 1 Φ p 1 L is
j = i + T b j i = r = 1 p a r j = i + T λ r j i = a 1 ( T + 1 i ) + r = 2 p a r 1 λ r T + 1 i 1 λ r , i T + 1 .
The i-th element of Φ p 1 L 1 is
j = 1 i b i j = r = 1 p a r j = 1 i λ r i j = a 1 ( i ) + r = 2 p a r 1 λ r i 1 λ r , i .
The ( i , j ) -th element of ( Φ p 1 L s ) Φ p 1 L is, when i + s j + ,
r 1 = 1 p r 2 = 1 p a r 1 a r 2 j 1 = i + s T λ r 1 j 1 i s λ r 2 j 1 j = a 1 2 ( T + 1 i s ) + a 1 r 2 = 2 p a r 2 j 1 = i + s T λ r 2 j 1 j + a 1 r 1 = 2 p a r 1 j 1 = i + s T λ r 1 j 1 i s + r 1 = 2 p r 2 = 2 p a r 1 a r 2 j 1 = i + s T λ r 1 j 1 i s λ r 2 j 1 j = a 1 2 ( T + 1 i s ) + a 1 r = 2 p a r ( 1 + λ r i + s j ) ( 1 λ r T + 1 i s ) 1 λ r + r 1 = 2 p r 2 = 2 p a r 1 a r 2 w i , j , s , , r 1 , r 2 ,
where w i , j , s , , r 1 , r 2 is given by (A4), and when i + s < j + ,
r 1 = 1 p r 2 = 1 p a r 1 a r 2 j 1 = j + t T λ r 1 j 1 i s λ r 2 j 1 j = a 1 2 ( T + 1 j ) + a 1 r 2 = 2 p a r 2 j 1 = j + t T λ r 2 j 1 j + a 1 r 1 = 2 p a r 1 j 1 = j + t T λ r 1 j 1 i s + r 1 = 2 p r 2 = 2 p a r 1 a r 2 j 1 = j + t T λ r 1 j 1 i s λ r 2 j 1 j = a 1 2 ( T + 1 j ) + a 1 r = 2 p a r ( 1 + λ r j + t i s ) ( 1 λ r T + 1 j ) 1 λ r + r 1 = 2 p r 2 = 2 p a r 1 a r 2 w i , j , s , , r 1 , r 2 .

Appendix B. Discussion of Several Related Estimators

It is worth discussing several closely related estimators. Bao and Yu (2023) motivate their estimator by matching the inconsistent WG estimator θ ^ W G = ( W A W ) 1 W Ay = θ 0 + ( W A W ) 1 W Au , when T is finite, with its approximate analytical expectation. More specifically, their indirect inference estimator is based on solving a random sample binding function, namely, θ ^ I I = arg θ { θ ^ W G = θ + ( W A W ) 1 E [ W Au ( θ ) ] } , where A u ( θ ) = A ( y W θ ) and E [ W Au ( θ ) ] is an analytical function of θ (and σ 2 ). Since, by definition, θ ^ W G θ 0 = ( W A W ) 1 W Au , their estimation strategy numerically amounts to matching W Au with E ( W Au ) , as considered in this paper, provided that ( W A W ) 1 is nonsingular. Under time-series heteroskedasticity, they introduce a robust estimator that matches E ^ ( W Au ) with E ( W Au ) , where E ^ ( W Au ) is a function of θ 0 and W but not involving the variance parameters. Their estimator is robust in the sense that the moment conditions are valid under time-series heteroskedasticity, and the variance parameters σ t 2 = Var ( u i t ) are not estimated explicitly.
Under temporal heteroskedasticity, Alvarez and Arellano (2022) consider explicitly estimating σ t 2 , such that the whole set of moment conditions includes both first-order conditions (of the log-likelihood function) with respect to θ and those with respect to the variance parameters. When there is no temporal heteroskedasticity, the estimator of Alvarez and Arellano (2022) is essentially based on moment conditions matching W Au with E ( W Au ) (without the variance parameter concentrated out) plus an additional one for the variance. Under cross-sectional heteroskedasticity, namely, Var ( u i t ) = σ i 2 , Breitung et al. (2022) construct their estimator by matching the “numerator” of the profile score function (associated with the the parameter vector ϕ ) with its expectation. This, together with the exogeneity condition on X , constitutes the moment condition for their estimator. As discussed in the main text, it turns out that the estimator arising from matching W Au with E ( W Au ) under homoskedasticity is, in fact, robust to pure cross-sectional heteroskedasticity, and the estimator of Bao and Yu (2023) is equivalent to that of Breitung et al. (2022). However, neither estimator explicitly allows for heteroskedasticity in both cross-sectional and time-series dimensions. Breitung et al. (2022) mention this extension under stringent restrictions, but do not proceed to formally investigate the properties of the resulting estimator.
Bun and Carree (2006) and Juodis (2013) consider both forms of heteroskedasticity. In their bias-correction procedure, the variance parameters are estimated and then plugged into the bias expression to bias-correct the WG estimator. Notwithstanding, they focus on short panels and do not derive the asymptotic distribution of their bias-corrected estimator.

Appendix C. Lemmas and Proofs

For the ease of presentation, let C = M Φ p 1 L , C s = C C s , D = Dg ( C ) , B = C M Ψ M , C s , s = M Φ p 1 Φ s L s 1 , and C ( s p ) , s = M Φ p 1 Φ ( s p ) L s 1 . Recall that Φ p = Φ p ( ϕ 0 ) = I ϕ 01 L ϕ 0 p L p , Φ ( s p ) = Φ s ( ϕ 0 ) Φ p ( ϕ 0 ) , Ψ = ( T 2 ) 1 T D ( T 2 ) 1 ( T 1 ) 1 tr ( D ) I . Denote Σ i ( 3 ) = Dg ( E ( u i 1 3 ) , , E ( u i T 3 ) ) and Σ i ( 4 ) = Dg ( E ( u i 1 4 ) 3 σ i 1 4 , , E ( u i T 4 ) 3 σ i T 4 ) when there is heteroskedasticity. dg ( · ) creates a column vector consisting of the diagonal elements in order of its argument, and ⊙ is the matrix Hadamard (element-by-element) product operator.
Lemma A1.
(i) Under Assumption 5 or 6, u A u = O P ( N T ) . (ii) Under Assumptions 1–5, Var ( N T g N T ) = ( N T ) 1 Var ( W A u ) σ 4 2 ( T 1 ) / T + γ 2 ( T 1 ) 2 / T 2 h h . (iii) Under Assumptions 1–4 and 6,
Var ( N T g N T ) = 1 N T i = 1 N c i , 1 , 1 c i , 1 , p d i , 1 c i , p , 1 c i , p , p d i , p d i , 1 d i , p X i M Σ i M X i ,
where c i , , s = Cov ( y i , ( ) M u i u i M Ψ M u i , y i , ( s ) M u i u i M Ψ s M u i ) is given by the sum of the 15 terms in (A11) and d i , = Cov ( X i M u i , y i , ( ) M u i u i M Ψ M u i ) is given by (A12).
Proof. 
(i): Using moments of quadratic forms (Bao and Ullah 2010, e.g.,), one has E ( u A u ) = σ 2 N tr ( M ) = σ 2 N ( T 1 ) under Assumption 5 and E ( u A u ) = i = 1 N tr ( M Σ i ) under Assumption 6. Note that M is uniformly (in T) bounded in row and column sums and Σ i has O ( 1 ) elements. Thus, using Lemma A2 of Bao et al. (2020), tr ( M Σ i ) = O ( T ) , leading to i = 1 N tr ( M Σ i ) = O ( N T ) . Further, Var ( u A u ) = σ 4 N [ 2 tr ( M ) + γ 2 tr ( M M ) ] = σ 4 N 2 ( T 1 ) + γ 2 ( T 1 ) 2 / T = O ( N T ) under Assumption 5 and Var ( u A u ) = i = 1 N [ tr ( Σ i ( 4 ) M M ) + 2 tr ( Σ i M Σ i M ) ] under Assumption 6, where it is obvious that tr ( Σ i ( 4 ) M M ) = O ( T ) . Using Lemmas A2 of Bao et al. (2020), again, one can claim tr ( Σ i M Σ i M ) = O ( T ) . So in either case, one has Var ( u A u ) = O ( N T ) and one can claim that u A u = O P ( N T ) . (ii): See Supplementary Appendix D of Bao and Yu (2023). (iii): With g N T = ( N T ) 1 i = 1 N ( y i , ( 1 ) M u i u i M Ψ 1 M u i , , y i , ( p ) M u i u i M Ψ p M u i , u i M X i ) , one can derive Var ( N T g N T ) such that its top-left p × p block contains in its ( , s )-th position ( N T ) 1 i = 1 N Cov ( y i , ( ) M u i u i M Ψ M u i , y i , ( s ) M u i u i M Ψ s M u i ) , , s = 1 , , p , its lower-left k × p block consists of k × 1 columns like ( N T ) 1 i = 1 N Cov ( X i M u i , y i , ( ) M u i u i M Ψ M u i ) , = 1 , , p , and its lower-right k × k block is ( N T ) 1 i = 1 N X i M Σ i M X i . Substituting (4) into (25), one has, for = 1 , , p ,
y i , ( ) M u i u i M Ψ M u i = u i B u i + α i u i C 1 + u i C X i β 0 + j = 0 1 u i C j , j e 1 y i , j + j = p 1 u i C ( j p ) , j e 1 y i , j .
Thus, Cov ( y i , ( ) M u i u i M Ψ M u i , y i , ( s ) M u i u i M Ψ s M u i ) has the following 15 terms,
tr ( Σ i ( 4 ) B B s ) + tr [ Σ i B Σ i ( B s + B s ) ] , Var ( α i ) 1 C Σ i C s 1 , β 0 X i C Σ i C s X i β 0 , j 1 = 1 1 j 2 = 1 s 1 E ( y i , j 1 y i , j 2 ) e 1 C j 1 , j 1 Σ i C j 2 , s j 2 e 1 , j 1 = p 1 j 2 = s p 1 E ( y i , j 1 y i , j 2 ) e 1 C ( j 1 p ) , j 1 Σ i C ( j 2 p ) , s j 2 e 1 , E ( α i ) [ dg ( Σ i ( 3 ) B ) C s 1 + dg ( Σ i ( 3 ) B s ) C 1 ] , dg ( Σ i ( 3 ) B ) C s X i β 0 + dg ( Σ i ( 3 ) B s ) C X i β 0 , j = 0 s 1 E ( y i , j ) dg ( Σ i ( 3 ) B ) u i C j , s j e 1 + j = 0 1 E ( y i , j ) dg ( Σ i ( 3 ) B s ) u i C j , j e 1 , j = s p 1 E ( y i , j ) dg ( Σ i ( 3 ) B ) u i C ( j p ) , s j e 1 + j = p 1 E ( y i , j ) dg ( Σ i ( 3 ) B s ) u i C ( j p ) , j e 1 , E ( α i ) ( 1 C Σ i C s X i β 0 + 1 C s Σ i C X i β 0 ) , j = 0 s 1 E ( y i , j α i ) 1 C Σ i C j , s j e 1 + j = 0 1 E ( y i , j α i ) 1 C s Σ i C j , j e 1 , j = s p 1 E ( y i , j α i ) 1 C Σ i C ( j p ) , s j e 1 + j = p 1 E ( y i , j α i ) 1 C s Σ i C ( j p ) , j e 1 , j = 0 s 1 E ( y i , j ) β 0 X i C Σ i C j , s j e 1 + j = 0 1 E ( y i , j ) β 0 X i C s Σ i C j , j e 1 , j = s p 1 E ( y i , j ) β 0 X i C Σ i C ( j p ) , s j e 1 + j = p 1 E ( y i , j ) β 0 X i C s Σ i C ( j p ) , j e 1 , j 1 = 0 1 j 2 = s p 1 E ( y i , j 1 y i , j 2 ) e 1 C j 1 , j 1 Σ i C ( j 2 p ) , s j 2 e 1 + j 1 = 0 s 1 j 2 = p 1 E ( y i , j 1 y i , j 2 ) e 1 C j 1 , s j 1 Σ i C ( j 2 p ) , j 2 e 1 ,
and
Cov ( X i M u i , y i , ( ) M u i u i M Ψ M u i ) = X i M dg ( Σ i ( 3 ) B ) + E ( α i ) X i M Σ i C 1 + X i M Σ i C X i β 0 + j = 0 1 E ( y i , j ) X i M Σ i C j , j e 1 + j = p 1 E ( y i , j ) X i M Σ i C ( j p ) , j e 1
in view of results on moments of quadratic forms (Bao and Ullah 2010; Ullah 2004). ☐
Lemma A2.
For the stable case, namely, | λ r |   <   1 for all r = 1 , , p , the following hold for , s = 1 , , T : (i) 1 Φ p 1 L 1 = O ( T ) ; (ii) 1 Φ p 1 L Φ p 1 L s 1 = O ( T ) ; (iii) 1 C s 1 = O ( 1 ) and d C 1 = O ( 1 ) , where d = ( d 1 , , d T ) 0 consists of O ( 1 ) elements; (iv) tr ( C C s ) = O ( 1 ) ; (v) tr ( C C s ) = O ( T 1 ) ; (vi) tr ( C s ) = O ( T ) ; (vii) all the non-zero elements of dg ( I N C ) are O ( T 1 ) ; (viii) 1 C s C s 1 = O ( 1 ) ; (ix) tr ( C s C s ) = O ( T ) ; (x) tr ( C s C s ) = O ( T ) ; (xi) tr ( C s C s ) = O ( T ) .
Proof. 
These results follow by substituting (A2), (A3), and (A4) into the various terms involved.
(i):
Using (A2), one has
1 Φ p 1 L 1 = i = 1 T + 1 r = 1 p a r 1 λ r T + 1 i 1 λ r = r = 1 p a r ( T ) 1 λ r λ r 1 λ r T 1 λ r 2 = T r = 1 p a r 1 λ r + O ( 1 ) .
(ii):
Using (A2) and (A3), one has
1 Φ p 1 L Φ p 1 L s 1 = i = s T + 1 r = 1 p a r 1 λ r T + 1 i 1 λ r r = 1 p a r 1 λ r i s 1 λ r = r 1 = 1 p r 2 = 1 p a r 1 a r 2 i = s T + 1 1 λ r 1 T + 1 i 1 λ r 1 1 λ r 2 i s 1 λ r 2 = T r 1 = 1 p r 2 = 1 p a r 1 a r 2 1 λ r 1 1 λ r 2 + O ( 1 ) .
(iii):
Without loss of generality, assume s . First consider d = 1 . Using (A3), one has
1 ( Φ p 1 L ) Φ p 1 L s 1 = i = T r = 1 p a r 1 λ r i 1 λ r r = 1 p a r 1 λ r i s 1 λ r = r 1 = 1 p r 2 = 1 p a r 1 a r 2 i = T 1 λ r 1 i 1 λ r 1 1 λ r 2 i s 1 λ r 2 = T r 1 = 1 p r 2 = 1 p a r 1 a r 2 1 λ r 1 1 λ r 2 + O ( 1 ) .
Further,
1 T 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 = 1 T T r = 1 p a r 1 λ r + O ( 1 ) T r = 1 p a r 1 λ r + O ( 1 ) = T r 1 = 1 p r 2 = 1 p a r 1 a r 2 1 λ r 1 1 λ r 2 + O ( 1 ) .
It follows that
1 C s 1 = 1 ( Φ p 1 L ) Φ p 1 L s 1 1 T 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 = O ( 1 ) .
For = s , 1 C 1 = 1 ( Φ p 1 L ) M Φ p 1 L 1 is the sum of square of elements of C 1 . This implies that, at most, some finite number of elements of C 1 can be O ( 1 ) , and all other elements are O ( T 1 ) . Thus, for any non-zero T × 1 vector d of O ( 1 ) elements, d C 1 = O ( 1 ) .
(iv):
In light of the facts that T 1 1 Φ p 1 L s Φ p 1 L 1 = O ( 1 ) and T 1 1 Φ p 1 L Φ p 1 L s 1 = O ( 1 ) from (ii), T 2 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 = O ( T 1 ) from (iii), and tr ( Φ p 1 L Φ p 1 L s ) = 0 since both Φ p 1 L and Φ p 1 L s are strictly lower triangular,
tr ( C C s ) = tr ( Φ p 1 L M Φ p 1 L s ) 1 T 1 Φ p 1 L M Φ p 1 L s 1 = tr ( Φ p 1 L Φ p 1 L s ) 1 T 1 Φ p 1 L s Φ p 1 L 1 1 T 1 Φ p 1 L Φ p 1 L s 1 + 1 T 2 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 = O ( 1 ) .
(v):
Note that both Φ p 1 L and Φ p 1 L s are strictly lower triangular. It then follows that tr ( C C s ) = T 2 tr ( 1 1 Φ p 1 L 1 1 Φ p 1 L s ) , where the ( i , i ) -th element of 1 1 Φ p 1 L is the i-th element of 1 Φ p 1 L . Therefore, using (A2) and assuming s ,
tr ( C C s ) = 1 T 2 1 Φ p 1 L ( Φ p 1 L s ) 1 = 1 T 2 r 1 = 1 p r 2 = 1 p a r 1 a r 2 i = 1 T + 1 1 λ r 1 T + 1 i 1 λ r 1 1 λ r 2 T + 1 i s 1 λ r 2 = 1 T 2 T r 1 = 1 p r 2 = 1 p a r 1 a r 2 1 λ r 1 1 λ r 2 + O ( 1 ) = O ( T 1 ) .
(vi):
Note that tr ( C s ) = tr ( ( Φ p 1 L ) Φ p 1 L s ) T 1 1 Φ p 1 L s ( Φ p 1 L ) 1 , where, from (v), T 1 1 Φ p 1 L s ( Φ p 1 L ) 1 = O ( 1 ) . Thus, using (A4) and assuming s ,
tr ( C s ) = r 1 = 1 p r 2 = 1 p a r 1 a r 2 i = 1 T w i , i , , s , r 1 , r 2 + O ( 1 ) = r 1 = 1 p r 2 = 1 p a r 1 a r 2 i = 1 T λ r 2 s λ r 1 T + 1 i λ r 2 T + 1 i s 1 λ r 1 λ r 2 + O ( 1 ) = T r 1 = 1 p r 2 = 1 p a r 1 a r 2 λ r 2 s 1 λ r 1 λ r 2 + O ( 1 ) .
(vii):
Since Φ p 1 L is strictly lower triangular, dg ( I N C ) = T 1 dg ( I N 1 1 Φ p 1 L ) . From (A2), the diagonal elements of 1 1 Φ p 1 L are all O ( 1 ) . Thus, one can claim that all the non-zero elements of dg ( I N C ) are O ( T 1 ) .
(viii):
From (iii), at most, some finite number of elements of C 1 can be O ( 1 ) and all other elements are O ( T 1 ) . Also, all the elements of Φ p 1 L s ( Φ p 1 L s ) are O ( 1 ) . Thus, 1 C s C s 1 = 1 ( Φ p 1 L ) M Φ p 1 L s ( Φ p 1 L s ) M Φ p 1 L 1 = O ( 1 ) .
(ix):
The ( i , j ) -th element of C s = ( Φ p 1 L ) Φ p 1 L s T 1 ( Φ p 1 L ) 1 1 Φ p 1 L s , in light of (A2) and (A4), is
r 1 = 1 p r 2 = 1 p a r 1 a r 2 w i , j , , s , r 1 , r 2 1 T 1 λ r 1 T i + 1 1 λ r 1 1 λ r 2 T j s + 1 1 λ r 2 = O ( 1 ) .
Therefore, tr ( C s C s ) = O ( T ) .
(x):
From the previous part, the ( i , j ) -th element of C s is r 1 = 1 p r 2 = 1 p a r 1 a r 2 w i , j , , s , r 1 , r 2 + O ( T 1 ) . Thus, the i-th diagonal element of C s C s is
j = 1 T r 1 = 1 p r 2 = 1 p a r 1 a r 2 w i , j , , s , r 1 , r 2 r 1 = 1 p r 2 = 1 p a r 1 a r 2 w i , j , s , , r 1 , r 2 + O ( T 1 ) = r 1 = 1 p r 2 = 1 p r 3 = 1 p r 4 = 1 p a r 1 a r 2 a r 3 a r 4 j = 1 T w i , j , , s , r 1 , r 2 w i , j , s , , r 3 , r 4 + O ( T 1 ) = O ( 1 ) .
Therefore, tr ( C s C s ) = O ( T ) .
(xi):
By similar reasoning as in (x), one can show that the i-th diagonal element of C s C s is O ( 1 ) , and then tr ( C s C s ) = O ( T ) .
Lemma A3.
For the unit-root case, namely, λ 1 = 1 and | λ r |   <   1 , r = 2 , , p , the following hold for , s = 1 , , T : (i) 1 Φ p 1 L 1 = O ( T 2 ) and d C 1 = O ( T 2 ) , where d = ( d 1 , , d T ) 0 consists of O ( 1 ) elements; (ii) 1 Φ p 1 L Φ p 1 L s 1 = O ( T 3 ) ; (iii) 1 C s 1 = O ( T 3 ) ; (iv) tr ( C C s ) = O ( T 2 ) ; (v) tr ( C C s ) = O ( T ) ; (vi) tr ( C s ) = O ( T 2 ) ; (vii) all the non-zero elements of dg ( I N C ) are O ( 1 ) . (viii) 1 C s C s 1 = O ( T 5 ) ; (ix) tr ( C s C s ) = O ( T 3 ) ; (x) tr ( C s C s ) = O ( T 4 ) ; (xi) tr ( C s C s ) = O ( T 4 ) .
Proof. 
These results follow by substituting (A5), (A6), (A7), and (A8) into the various terms involved.
(i):
Using (A5), one has
1 Φ p 1 L 1 = a 1 i = 1 T + 1 ( T + 1 i ) + r = 2 p a r i = 1 T + 1 1 λ r T + 1 i 1 λ r = a 1 2 ( T ) ( T + 1 ) + r = 2 p a r ( T ) ( 1 λ r ) λ r ( 1 λ r T ) ( 1 λ r ) 2 = a 1 2 T 2 + a 1 2 ( 1 2 ) T + T r = 2 p a r 1 λ r + O ( 1 ) .
Similarly, using (A6)
d Φ p 1 L = a 1 i = T d i ( i ) + r = 2 p a r i = T d i 1 λ r i 1 λ r = a 1 i = 1 T i d i a 1 i = 1 T d i + O ( T ) = a 1 i = 1 T i d i + O ( T ) = O ( T 2 )
and thus, by denoting d ¯ = T 1 i = 1 T d i ,
d M Φ p 1 L 1 = d Φ p 1 L 1 d ¯ 1 Φ p 1 L 1 + O ( T ) = a 1 i = 1 T i ( d i d ¯ ) = O ( T 2 ) ,
which may retain a term of order O ( T 2 ) , unless all the elements of d are the same. So, in general, d M Φ p 1 L 1 = O ( T 2 ) .
(ii):
Using (A5) and (A6),
1 Φ p 1 L Φ p 1 L s 1 = i = s T + 1 a 1 ( T + 1 i ) + r = 2 p a r 1 λ r T + 1 i 1 λ r a 1 ( i s ) + r = 2 p a r 1 λ r i s 1 λ r = a 1 2 i = s T + 1 ( T + 1 i ) ( i s ) + a 1 r = 2 p a r i = s T + 1 ( i s ) 1 λ r T + 1 i 1 λ r + a 1 r = 2 p a r i = s T + 1 ( T + 1 i ) 1 λ r i s 1 λ r + T r 1 = 2 p r 2 = 2 p a r 1 a r 2 1 λ r 1 1 λ r 2 + O ( 1 ) = a 1 2 6 ( T s ) ( T s + 1 ) ( T s + 2 ) + a 1 r = 2 p a r T 2 1 λ r + T [ 1 3 λ r 2 ( 1 λ r ) ( + s ) ] ( 1 λ r ) 2 + T r 1 = 2 p r 2 = 2 p a r 1 a r 2 1 λ r 1 1 λ r 2 + O ( 1 ) = a 1 2 6 T 3 + a 1 2 ( 1 s ) 2 T 2 + a 1 T 2 r = 2 p a r 1 λ r + O ( T ) .
(iii):
Without loss of generality, assume s . First, note that
1 ( Φ p 1 L ) Φ p 1 L s 1 = i = T a 1 ( i ) + r = 2 p a r 1 λ r i 1 λ r a 1 ( i s ) + r = 2 p a r 1 λ r i s 1 λ r = a 1 2 i = T ( i ) ( i s ) + a 1 r = 2 p a r i = T ( i ) 1 λ r i s 1 λ r + ( i s ) 1 λ r i 1 λ r + O ( T ) = a 1 2 3 T 3 + a 1 2 ( 1 s ) 2 T 2 + a 1 T 2 r = 2 p a r 1 λ r + O ( T ) .
Further, from (i),
1 T 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 = 1 T a 1 2 T 2 + a 1 2 ( 1 2 s ) T + T r = 2 p a r 1 λ r a 1 2 T 2 + a 1 2 ( 1 2 ) T + T r = 2 p a r 1 λ r + O ( T ) = a 1 2 4 T 3 + a 1 2 ( 1 s ) 2 T 2 + a 1 T 2 r = 2 p a r 1 λ r + O ( T ) .
It follows that
1 C s 1 = 1 ( Φ p 1 L ) Φ p 1 L s 1 1 T 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 = a 1 2 12 T 3 + O ( T ) .
(iv):
Using results from (ii) and (iii),
tr ( C C s ) = 1 T 2 1 Φ p 1 L s 1 · 1 Φ p 1 L 1 1 T 1 Φ p 1 L s Φ p 1 L 1 1 T 1 Φ p 1 L Φ p 1 L s 1 = a 1 2 4 T 2 + a 1 2 2 T ( 1 s ) + a 1 T r = 2 p a r 1 λ r a 1 2 3 T 2 + a 1 2 T ( 1 s ) + 2 a 1 T r = 2 p a r 1 λ r + O ( 1 ) = a 1 2 12 T 2 a 1 2 ( 1 s ) 2 T a 1 T r = 2 p a r 1 λ r + O ( 1 ) .
(v):
Assuming s ,
tr ( C C s ) = 1 T 2 1 Φ p 1 L ( Φ p 1 L s ) 1 = 1 T 2 i = 1 T + 1 s a 1 ( T + 1 i ) + O ( 1 ) a 1 ( T + 1 i s ) + O ( 1 ) = a 1 2 3 T + O ( 1 ) .
(vi):
Assuming s ,
tr ( C s ) = tr ( ( Φ p 1 L ) Φ p 1 L s ) 1 T 1 Φ p 1 L s ( Φ p 1 L ) 1 = i = 1 T + 1 a 1 2 ( T + 1 i ) + a 1 r = 2 p a r ( 1 + λ r s ) ( 1 λ r T + 1 i ) 1 λ r + r 1 = 2 p r 2 = 2 p a r 1 a r 2 w i , j , , s , r 1 , r 2 1 T i = 1 T + 1 a 1 ( T + 1 i s ) + r = 2 p a r ( 1 λ r T + 1 i s ) 1 λ r [ a 1 ( T + 1 i ) + r = 2 p a r ( 1 λ r T + 1 i ) 1 λ r ] = a 1 2 2 T 2 a 1 2 3 T 2 + O ( T ) = a 1 2 6 T 2 + O ( T ) .
(vii):
From (A5), the diagonal elements of 1 1 Φ p 1 L are all O ( T ) . Thus, one can claim that all the non-zero elements of dg ( I N C ) = T 1 dg ( I N 1 1 Φ p 1 L ) are O ( 1 ) .
(viii):
By substitution,
1 C s C s 1 = 1 ( Φ p 1 L ) M Φ p 1 L s ( Φ p 1 L s ) M Φ p 1 L 1 = 1 ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L 1 1 T 1 ( Φ p 1 L ) 1 · 1 Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L 1 1 T 1 ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) 1 · 1 Φ p 1 L 1 + 1 T 2 1 ( Φ p 1 L ) 1 · 1 Φ p 1 L s ( Φ p 1 L s ) 1 · 1 Φ p 1 L 1 = 1 ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L 1 2 T 1 Φ p 1 L 1 · 1 ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) 1 + 1 T 2 1 Φ p 1 L s ( Φ p 1 L s ) 1 · ( 1 Φ p 1 L 1 ) 2 .
Assume s . Using (A7) and (A8),
1 ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L 1 = i = 1 T j = 1 i + s a 1 2 ( T + 1 i ) + j = i + s + 1 T + 1 j a 1 2 ( T + 1 j s ) + O ( 1 ) 2 = 2 a 1 4 15 T 5 + O ( T 4 ) .
Using (i), (A5), (A7), and (A8),
2 T 1 Φ p 1 L 1 · 1 ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) 1 = 2 T a 1 2 T 2 + O ( T ) × i = 1 T + 1 j = 1 i + s a 1 2 ( T + 1 i ) + j = i + s + 1 T + 1 j a 1 2 ( T + 1 j s ) + O ( 1 ) × a 1 ( T + 1 i s ) + O ( 1 ) = 5 a 1 4 24 T 5 + O ( T 4 ) .
Using (i) and (v),
1 T 2 1 Φ p 1 L s ( Φ p 1 L s ) 1 · ( 1 Φ p 1 L 1 ) 2 = a 1 2 3 T + O ( 1 ) a 1 2 T 2 + O ( T ) 2 = a 1 4 12 T 5 + O ( T 4 ) .
Therefore,
1 C s C s 1 = 2 a 1 4 15 T 5 5 a 1 4 24 T 5 + a 1 4 12 T 5 + O ( T 4 ) = a 1 4 120 T 5 + O ( T 4 ) .
(ix):
By substituting C s = ( Φ p 1 L ) Φ p 1 L s T 1 ( Φ p 1 L ) 1 1 Φ p 1 L s ,
tr ( C s C s ) = tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ) 2 T tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) 1 1 Φ p 1 L s ) + 1 T 2 tr ( ( Φ p 1 L ) 1 1 Φ p 1 L s ( Φ p 1 L ) 1 1 Φ p 1 L s ) .
Assume s . Using (A7),
tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ) = i = 1 T + 1 a 1 2 ( T + 1 i ) + O ( 1 ) 2 = a 1 4 3 T 3 + O ( T 2 ) .
Using (A5), the i-th diagonal element of ( Φ p 1 L ) 1 1 Φ p 1 L s is, i T + 1 ,
a 1 ( T + 1 i ) + r = 2 p a r 1 λ r T + 1 i 1 λ r a 1 ( T + 1 i s ) + r = 2 p a r 1 λ r T + 1 i s 1 λ r .
Thus,
2 T tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) 1 1 Φ p 1 L s ) = 2 T i = 1 T + 1 a 1 2 ( T + 1 i ) + O ( 1 ) × a 1 ( T + 1 i ) + O ( 1 ) a 1 ( T + 1 i s ) + O ( 1 ) = a 1 4 2 T 3 + O ( T 2 ) ,
and
1 T 2 tr ( ( Φ p 1 L ) 1 1 Φ p 1 L s ( Φ p 1 L ) 1 1 Φ p 1 L s ) = 1 T 2 i = 1 T + 1 a 1 ( T + 1 i ) + O ( 1 ) a 1 ( T + 1 i s ) + O ( 1 ) 2 = a 1 4 5 T 3 + O ( T 2 ) .
Therefore,
tr ( C s C s ) = a 1 4 3 T 3 a 1 4 2 T 3 + a 1 4 5 T 3 + O ( T 2 ) = a 1 4 30 T 3 + O ( T 2 ) .
(x):
By substitution,
tr ( C s C s ) = tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L ) 1 T tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) 1 1 Φ p 1 L ) 1 T tr ( ( Φ p 1 L ) 1 1 Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L ) + 1 T 2 tr ( ( Φ p 1 L ) 1 1 Φ p 1 L s ( Φ p 1 L s ) 1 1 Φ p 1 L ) = tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L ) 2 T 1 Φ p 1 L ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) 1 + 1 T 2 1 Φ p 1 L s ( Φ p 1 L s ) 1 · 1 Φ p 1 L ( Φ p 1 L ) 1 .
Assume s . Using (A7),
tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) Φ p 1 L ) = i = 1 T j = 1 i + s a 1 2 ( T + 1 i ) + O ( 1 ) 2 + j = i + s + 1 T a 1 2 ( T + 1 j s ) + O ( 1 ) 2 = a 1 4 6 T 4 + O ( T 3 ) .
Using (v),
1 T ( 1 Φ p 1 L s ( Φ p 1 L s ) 1 ) · 1 T 1 Φ p 1 L ( Φ p 1 L ) 1 = a 1 2 3 T 2 + O ( T ) a 1 2 3 T 2 + O ( T ) = a 1 4 9 T 4 + O ( T 3 ) .
Using (A5), (A7), and (A8),
2 T 1 Φ p 1 L ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L s ) 1 = 2 T i = 1 T + 1 j = 1 i + s a 1 ( T + 1 i ) a 1 2 ( T + 1 i ) a 1 ( T + 1 j s ) + 2 T i = 1 T + 1 j = i + s + 1 T + 1 s [ a 1 ( T + 1 i ) ] [ a 1 2 ( T + 1 j s ) ] [ a 1 ( T + 1 j s ) ] + O ( T 3 ) = 4 a 1 4 15 T 4 + O ( T 3 ) .
Therefore,
tr ( C s C s ) = a 1 4 6 T 4 + a 1 4 9 T 4 4 a 1 4 15 T 4 + O ( T 3 ) = a 1 4 90 T 4 + O ( T 3 ) .
(xi):
Similarly,
tr ( C s C s ) = tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ) 2 T 1 Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) 1 + 1 T 2 ( 1 Φ p 1 L ( Φ p 1 L s ) 1 ) 2 .
Assume s . If = s , from (x), tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ) = a 1 4 T 4 / 6 + O ( T 3 ) , in view of (A7) and (A8). Assume = s + 1 . Note that the ( i , j )-th element of ( Φ p 1 L ) Φ p 1 L s has leading term a 1 2 ( T + 1 i ) if i j 1 and T + 1 j s = T + 2 j if i < j 1 . The ( j , i ) -th element of ( Φ p 1 L ) Φ p 1 L s has leading term a 1 2 ( T + 1 j ) if j i 1 and T + 1 i s = T + 2 i if j < i 1 . Then,
tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ) = i = 1 T + 1 j = 1 i 2 a 1 2 ( T + 1 i ) a 1 2 ( T + 2 i ) + j = i + 2 T + 1 i = 1 j 2 a 1 2 ( T + 2 j ) a 1 2 ( T + 1 j ) + O ( T 3 ) = a 1 4 6 T 4 + O ( T 3 ) .
Similarly, for any s , tr ( ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ) = a 1 4 T 4 / 6 + O ( T 3 ) . Using (v), one has
1 T 2 ( 1 Φ p 1 L ( Φ p 1 L s ) 1 ) 2 = a 1 2 3 T 2 + O ( T ) 2 = a 1 4 9 T 4 + O ( T 3 ) .
Using (A5), (A7), (A8), and following similarly the proof in the previous part,
2 T 1 Φ p 1 L s ( Φ p 1 L ) Φ p 1 L s ( Φ p 1 L ) 1 = 4 a 1 4 15 T 4 + O ( T 3 ) .
Therefore,
tr ( C s C s ) = a 1 4 6 T 4 + a 1 4 9 T 4 4 a 1 4 15 T 4 + O ( T 3 ) = a 1 4 90 T 4 + O ( T 3 ) .
Lemma A4.
Under the conditions of Theorem 2,
Var ( ( N T ) 1 / 2 W Au ) = 1 N T σ 4 N Γ 1 + σ 2 Γ 2 σ 2 F 1 σ 2 F 1 σ 2 X AX + O ( T 1 ) ,
where Γ 1 and Γ 2 have tr ( C s ) and β 0 X ( I N C s ) X β 0 , respectively, in their ( , s )-positions, , s = 1 , , p , and F 1 has X ( I N C ) X β 0 as its ℓ-th column, = 1 , , p .
Proof. 
Following Bao and Yu (2023), write Var ( W Au ) = Var ( W ¯ Au ) + Var ( W ˜ Au ) + Cov ( W ¯ Au , W ˜ Au ) + Cov ( W ˜ Au , W ¯ Au ) , where W ¯ = E ( W ) and W ˜ = W W ¯ (and this kind of notation to denote the fixed and random components of a random term is used henceforth). Var ( W ˜ Au ) has a non-zero top-left p × p block with Cov ( y ˜ ( ) Au , y ˜ ( s ) Au ) populating its ( , s ) -th position, , s = 1 , , p . In particular, Cov ( y ˜ ( ) Au , y ˜ ( s ) Au ) is equal to
σ 4 N [ tr ( C C s ) + tr ( C s ) + γ 2 tr ( C C s ) ] + σ 2 1 C s 1 E ( α ˜ α ˜ ) + σ 2 j 1 = 0 s 1 j 2 = 0 1 e 1 Φ ( j 1 , j 2 ) ( s 1 j 1 , 1 j 2 ) e 1 E ( y ˜ j 1 y ˜ j 2 ) + σ 2 j 1 = s p 1 j 2 = p 1 e 1 Φ ( ( j 1 p ) , ( j 2 p ) ) ( s 1 j 1 , 1 j 2 ) e 1 E ( y ˜ j 1 y ˜ j 2 ) + σ 2 j 2 = 0 1 1 Φ ( 0 , j 2 ) ( s , 1 j 2 ) e 1 E ( α ˜ y ˜ j 2 ) + σ 2 j 2 = p 1 1 Φ ( 0 , ( j 2 p ) ) ( s , 1 j 2 ) e 1 E ( α ˜ y ˜ j 2 ) + σ 2 j 1 = 0 s 1 1 Φ ( 0 , j 1 ) ( , s 1 j 1 ) e 1 E ( α ˜ y ˜ j 2 ) + σ 2 j 1 = 0 s 1 j 2 = p 1 e 1 Φ ( j 1 , ( j 2 p ) ) ( s 1 j 1 , 1 j 2 ) e 1 E ( y ˜ j 1 y ˜ j 2 ) + σ 2 j 1 = s p 1 1 Φ ( 0 , ( j 1 p ) ) ( , s 1 j 1 ) e 1 E ( α ˜ y ˜ j 1 ) + σ 2 j 1 = s p 1 j 2 = 0 1 e 1 Φ ( ( j 1 p ) , ( j 2 p ) ) ( s 1 j 1 , 1 j 2 ) e 1 E ( y ˜ j 1 y ˜ j 2 ) ,
where Φ ( r , j ) ( , s ) = Φ r L Φ p 1 M Φ p 1 L s Φ j . (Recall that Φ ( p ) = Φ Φ p . Thus, for example, Φ ( r , ( j p ) ) ( , s ) = Φ r L Φ p 1 M Φ p 1 L s Φ ( j p ) .) From Lemma A2, in the first term, tr ( C C s ) = O ( 1 ) , tr ( C s ) = O ( T ) , and tr ( C C s ) = O ( 1 ) . The second term σ 2 1 C s 1 E ( α ˜ α ˜ ) is O ( N ) . Terms like e 1 Φ ( j 1 , j 2 ) ( s 1 j 1 , 1 j 2 ) e 1 , e 1 Φ ( j 1 , ( j 2 p ) ) ( s 1 j 1 , 1 j 2 ) e 1 , and e 1 Φ ( ( j 1 p ) , ( j 2 p ) ) ( s 1 j 1 , 1 j 2 ) e 1 pick up just one element of the relevant matrices in the quadratic forms in e 1 and thus are O ( 1 ) , whereas terms likes 1 Φ ( 0 , j 2 ) ( s , 1 j 2 ) e 1 , 1 Φ ( 0 , ( j 2 p ) ) ( s , 1 j 2 ) e 1 , 1 Φ ( 0 , j 1 ) ( , s 1 j 1 ) e 1 , and 1 Φ ( 0 , ( j 1 p ) ) ( , s 1 j 1 ) e 1 pick up the sum of the first column of the relevant matrices in the quadratic forms in e 1 , which are again O ( 1 ) . Thus, Var ( W ˜ Au ) is dominated by elements like σ 4 N tr ( C s ) = O ( N T ) . Further,
Cov ( W ¯ Au , W ˜ Au ) = E y ¯ ( 1 ) Au u A y ˜ ( 1 ) y ¯ ( 1 ) Au u A y ˜ ( p ) 0 k y ¯ ( p ) Au u A y ˜ ( 1 ) y ¯ ( p ) Au u A y ˜ ( p ) 0 k X Au u A y ˜ ( 1 ) X Au u A y ˜ ( p ) O k ,
where
E ( y ¯ ( ) Au u A y ˜ ( s ) ) = γ 1 σ 3 y ¯ ( ) A dg ( I N C s ) = O ( N ) , E ( X Au u A y ˜ ( ) ) = γ 1 σ 3 X A dg ( I N C ) = O ( N ) .
with γ 1 denoting the skewness coefficient of the distribution of u i t . Thus,
Var ( ( N T ) 1 / 2 W A u ) = 1 N T [ Var ( W ¯ Au ) + Var ( W ˜ Au ) + Cov ( W ¯ Au , W ˜ Au ) + Cov ( W ˜ Au , W ¯ Au ) ] = σ 2 1 N T W ¯ A W ¯ + σ 4 1 T tr ( C 11 ) tr ( C 1 p ) 0 k tr ( C p 1 ) tr ( C p p ) 0 k 0 k 0 k O k + O ( T 1 ) ,
where tr ( C s ) = O ( T ) from Lemma A2, and
1 N T W ¯ A W ¯ = 1 N T y ¯ ( 1 ) A y ¯ ( 1 ) y ¯ ( 1 ) A y ¯ ( p ) y ¯ ( 1 ) A X y ¯ ( p ) A y ¯ ( 1 ) y ¯ ( p ) A y ¯ ( p ) y ¯ ( p ) A X X A y ¯ ( 1 ) X A y ¯ ( p ) X AX .
Bt substituting (5) and using Lemma A2 again, one has
1 N T y ¯ ( ) A y ¯ ( s ) = 1 N T β 0 X ( I N C s ) X β 0 + O ( T 1 ) , 1 N T X A y ¯ ( ) = 1 N T X ( I N C ) X β 0 + O ( T 1 ) .
Substituting these terms into Var ( ( N T ) 1 / 2 W A u ) yields the result. ☐
Lemma A5.
Under the conditions of Theorem 2, ( N T ) 1 / 2 [ W Au E ( W Au ) ] = O p ( 1 ) with E ( W Au ) = O ( N ) and ( N T ) 1 W AW = O p ( 1 ) . Further, plim T ( N T ) 1 W AW exists.
Proof. 
Using Lemmas A2 and A4, one has E ( W Au ) = O ( N ) and Var ( ( N T ) 1 / 2 W A u ) = O ( 1 ) + O ( T 1 ) . Write W AW = W ¯ A W ¯ + W ¯ A W ˜ + W ˜ A W ¯ + W ˜ A W ˜ . From Lemma A4, W ¯ A W ¯ = O ( N T ) . By substitution, W ˜ A W ˜ has y ˜ ( ) A y ˜ ( s ) , , s = 1 , , p , populating its top-left p × p block and zero elsewhere, where, by some tedious algebra, E ( y ˜ ( ) A y ˜ ( s ) ) has its leading term N σ 2 tr ( C s ) = O ( N T ) and Var ( y ˜ ( ) A y ˜ ( s ) ) has its leading term σ 4 [ N tr ( C s C s + C s C s ) + N γ 2 tr ( C s C s ) ] = O ( N T ) by using Lemma A2. So, one can claim that ( N T ) 1 W AW = O p ( 1 ) . The lower-right block of ( N T ) 1 W AW is ( N T ) 1 X A X , which, by Assumption 3, has a probability limit. The top-left block of ( N T ) 1 W AW has elements ( N T ) 1 y ( ) A y ( s ) , , s = 1 , , p , which are dominated by ( N T ) 1 [ ( I N Φ p 1 L ) u + ( I N Φ p 1 L ) X β 0 ] A [ ( I N Φ p 1 L s ) u + ( I N Φ p 1 L s ) X β 0 ] as T . Note that they, in turn, consist of terms like ( N T ) 1 u ( I N C s ) u , ( N T ) 1 β 0 X ( I N C s ) u , and ( N T ) 1 β 0 X ( I N C s ) X β 0 . From Lemma A2, C is uniformly (in T) bounded in row and column sums, and so are C s = C C s and I N C s . Immediately, as T , ( N T ) 1 β 0 X ( I N C s ) X β 0 exists, so long as ( N T ) 1 X A X exists. Theorem A.1 of Kelejian and Prucha (2010) implies that ( N T ) 1 u ( I N C s ) u and ( N T ) 1 β 0 X ( I N C s ) u are asymptotically normal as T (and converge in probability to their respective means). Thus, one can claim that plim T ( N T ) 1 W AW exists. ☐
Lemma A6.
Under the conditions of Theorem 2, plim T ( N T ) 1 W AW = lim T σ 2
Ω 1 , N T , where Ω 1 , N T is given by (13).
Proof. 
plim T ( N T ) 1 W AW = lim T ( N T ) 1 W ¯ A W ¯ + plim T ( N T ) 1 W ˜ A W ˜ , in which ( N T ) 1 W ¯ A W ¯ is given in the proof of Lemma A4 and plim T ( N T ) 1 W ˜ A W ˜ has non-zero elements lim T σ 2 T 1 tr ( C s ) , , s = 1 , , p , populating its top-left p × p block. ☐
Lemma A7.
Under the conditions of Theorem 2, as T , Ω 1 , N T 1 / 2 N T g N T d N ( 0 , I ) , where g N T = g N T ( θ 0 ) and Ω 1 , N T are given by (7) and (13), respectively.
Proof. 
Recall that E ( W Au ) = E ( u A u ) h , where h = O ( T 1 ) . Also, Var ( u A u ) = σ 4 N ( T 1 ) ( 2 + γ 2 ( T 1 ) / T ) , so E ( W Au ) + u A u h = O p ( N / T ) . From Lemma A5, W A u E ( W Au ) = O P ( N T ) . Thus, one can write ( N T g N T = ( N T ) 1 / 2 [ W A u E ( W Au ) + E ( W Au ) + u A u h ] = ( N T ) 1 / 2 [ W A u E ( W Au ) ] + o p ( 1 ) as T . By substituting (5) into W A u , one has the top p × 1 block of W A u consisting of, = 1 , , p ,
u ( I N C ) u + u ( I N C ) X β 0 + u ( I N C 1 ) α + s = 0 1 u ( I N C s , s e 1 ) y s + s = p 1 u ( I N C ( s p ) , s e 1 ) y ( s ) ,
where C s , s = M Φ p 1 Φ s L s 1 and C ( s p ) , s = M Φ p 1 Φ ( s p ) L s 1 . From Lemma A2, u ( I N C 1 ) α , s = 0 t 1 u ( I N C s , s e 1 ) y ( s ) , and s = t p 1 u ( I N C ( s p ) , s e 1 ) y ( s ) are all of order O p ( N ) , u ( I N C ) X β 0 = O p ( N T ) , and u ( I N C ) u E ( u ( I N C p ) u ) = O p ( N T ) . Therefore, as T ,
N T g N T = 1 N T u ( I N C 1 ) u E ( u ( I N C 1 ) u ) + u ( I N C 1 ) X β 0 u ( I N C p ) u E ( u ( I N C p ) u ) + u ( I N C p ) X β 0 X Au + o p ( 1 ) .
From Lemma A2, C t is uniformly (in T) bounded in row and column sums. It follows that ( I N C ) are uniformly (in N T ) bounded in row and column sums. Obviously, A = I N M also has this property. Thus, if T , regardless of N, one can invoke Theorem A.1 of Kelejian and Prucha (2010) to have the asymptotic distribution result. ☐
Lemma A8.
Under the conditions of Theorem 4, as T , Ω 1 , N T 1 / 2 N T g N T d N ( 0 , I ) , where g N T = g N T ( θ 0 ) and Ω 1 , N T are given by (25) and (29), respectively.
Proof. 
Recall that Ψ is diagonal with O ( T 1 ) elements. Since M Σ i M is uniformly (in T) bounded in row and column sums, one has E ( u i M Ψ M u i ) = tr ( M Σ i M Ψ ) = O ( 1 ) from Lemma A2 of Bao et al. (2020). Likewise, tr ( Σ i M Ψ M Σ i M Ψ M ) = tr ( M Σ i M Ψ M Σ i M Ψ ) is O ( T 1 ) and each element of M Ψ M is O ( T 1 ) from a similar proof as in Lemma A3 of Bao et al. (2020). Then it follows that Var ( u i M Ψ M u i ) = tr ( Σ i ( 4 ) M Ψ M M Ψ M ) + 2 tr ( Σ M Ψ M Σ M Ψ M ) = O ( T 1 ) and u i M Ψ M u i = O ( 1 ) + O P ( T 1 / 2 ) . Next, y i , ( ) M u i = u i C u i + α i u i C 1 + u i C X i β 0 + j = 0 1 u i C j , j e 1 y i , j + j = p 1 u i C ( j p ) , j e 1 y i , j . Term by term, E ( u i C u i ) = tr ( Σ i C ) = tr ( Σ i D ) = O ( 1 ) since the diagonal elements of D are O ( T 1 ) . Following similar steps as in the proof of Lemma A2, since pre- or post-multiplying Φ p 1 L by Σ i does not change the uniform boundedness of Φ p 1 L , one can claim 1 Φ p 1 L Σ i 1 , 1 Φ p 1 L s Σ i Φ p 1 L Σ i 1 , 1 Φ p 1 L Σ i ( Φ p 1 L s ) Σ i 1 , 1 Σ i Φ p 1 L Σ i ( Φ p 1 L s ) 1 , and tr ( ( Φ p 1 L s ) Σ i Φ p 1 L Σ i ) are O ( T ) . Then, by substituting C s = M Φ p 1 L s , C = M Φ p 1 L , and M = I T 1 1 1 , one has tr ( C s Σ i C Σ i ) = O ( 1 ) and tr ( C s Σ i C Σ i ) = tr ( ( Φ p 1 L s ) Σ i Φ p 1 L Σ i ) + O ( 1 ) = O ( T ) . This leads to Var ( u i C u i ) = tr ( Σ i ( 4 ) C C ) + tr [ ( Σ i C Σ i ( C + C ) ] = O ( 1 ) + O ( T ) . So, one can claim that u i C u i = O ( 1 ) + O P ( T 1 / 2 ) . From the proof of (iii) in Lemma A2, M Φ p 1 L 1 has, at most, a finite number of O ( 1 ) elements, and all remaining elements are O ( T 1 ) . It is obvious that Σ i 1 / 2 M Φ p 1 L 1 shares the same properties. Then, α i u i C 1 = O P ( 1 ) and 1 C s Σ i C 1 = O ( 1 ) . Note that Cov ( u i C X i β 0 , u i C s X i β 0 ) = β 0 X i C Σ i C s X i β 0 = β 0 X i ( Φ p 1 L ) M Σ i 1 / 2 Σ i 1 / 2 M Φ p 1 L s X i β 0 and it has the same magnitude as β 0 X i ( Φ p 1 L ) M Φ p 1 L s X i β 0 in view of Lemma A2 of Bao et al. (2020). So, u i C X i β 0 = O P ( T 1 / 2 ) . With all these results, one can claim that the variance of y i , ( ) M u i u i M Ψ M u i is dominated by that of u i C u i + u i C X i β 0 . The asymptotic distribution result then follows from Theorem A.1 of Kelejian and Prucha (2010). ☐
Lemma A9.
Under the conditions of Theorem 4, plim T ( N T ) 1 G N T = plim T ( N T ) 1 W A W , where G N T is given by (26)
Proof. 
Both Ψ and Ψ s are diagonal with O ( T 1 ) elements, so the terms in ( N T ) 1 G N T are dominated by ( N T ) 1 y i , ( ) M W i , = 1 , , p , and ( N T ) 1 X i M W i . ☐
Lemma A10.
Under the conditions of Theorem 4, plim T ( N T ) 1 i = 1 N W i M Σ i M W i = lim T   Ω 1 , N T , where Ω 1 , N T is given by (29).
Proof. 
This follows similarly from the proof of Lemma A6. ☐
Lemma A11.
Under Assumptions 1–5 and that DP ( p ) has a unit root,
Var ( Υ 1 / 2 W Au ) = Υ 1 / 2 σ 2 Γ 2 + Γ 3 F 1 + F 2 F 1 + F 2 X A X Υ 1 / 2 + o ( 1 ) ,
where Γ 2 and Γ 3 have β 0 X ( I N C s ) X β 0 and 1 C s 1 E ( α α ) , respectively, in their ( , s )-positions, , s = 1 , , p , and F 1 and F 2 have X ( I N C ) X β 0 and X ( I N C 1 ) E ( α ) , respectively, as their ℓ-th columns, = 1 , , p .
Proof. 
Following the proof of Lemma A4 and using Lemma A3, one sees that now the leading term appearing in Var ( W ˜ Au ) is σ 2 1 C s 1 E ( α ˜ α ˜ ) = O ( N T 3 ) and the leading terms in W ¯ A W ¯ are β 0 X ( I N C s ) X β 0 = O ( N T 3 ) and 1 C s 1 α ¯ α ¯ = O ( N T 3 ) . The remaining terms (e.g., X ( I N C ) X β 0 and X ( I N C 1 ) α ¯ ) are, at most, O ( N T 2 ) . Thus, the leading terms in Var ( W Au ) are β 0 X ( I N C s ) X β 0 + 1 C s 1 [ α ¯ α ¯ + E ( α ˜ α ˜ ) ] = β 0 X ( I N C s ) X β 0 + 1 C s 1 E ( α α ) = O ( N T 3 ) that appear in its top-left block. ☐
Lemma A12.
Under Assumptions 1–5 and that DP ( p ) has a unit root, Υ 1 / 2 [ W Au E ( W Au ) ] = O p ( 1 ) and Υ 1 / 2 W AW Υ 1 / 2 = O p ( 1 ) .
Proof. 
The orders of magnitudes of terms in Υ 1 / 2 [ W Au E ( W Au ) ] are obvious, given the expressions of E ( W Au ) in (6) and Var ( Υ 1 / 2 W Au ) from Lemma A11. Use again W AW = W ¯ A W ¯ + W ¯ A W ˜ + W ˜ A W ¯ + W ˜ A W ˜ . The top-left, lower-left, and lower-right blocks of W ¯ A W ¯ are O ( N T 3 ) , O ( N T 2 ) , and O ( N T ) , respectively. The top-left block of W ¯ A W ˜ consists of y ¯ ( ) A y ˜ ( s ) with mean 0 and variance y ¯ ( ) A Var ( y ˜ s ) A y ¯ ( ) , whose leading term is
( I N Φ p 1 L ) X β 0 + ( I N Φ p 1 L 1 ) α ¯ A × ( I N Φ p 1 L s ) E ( u u ) ( I N L s Φ p 1 ) + ( I N Φ p 1 L s 1 ) E ( α ˜ α ˜ ) ( I N 1 L s Φ p 1 ) × A ( I N Φ p 1 L s ) X β 0 + ( I N Φ p 1 L s 1 ) α ¯ .
Using Lemma A3, one can verify the leading term (involving ( I N Φ p 1 L s 1 ) E ( α ˜ α ˜ ) ( I N 1 L s Φ p 1 ) ) in the expansion of the above product is O ( N T 6 ) . So the top-left block of W ¯ A W ˜ is O p ( N T 6 ) . The lower-left block of W ¯ A W ˜ consists of X A y ˜ ( ) with mean 0 k and variance X A Var ( y ˜ ( ) ) A X , whose leading term is X ( I N C C ) X + X ( I N C 1 ) E ( α ˜ α ˜ ) ( I N 1 C ) X , where, in view of Lemma A3, X ( I N C C ) X = O ( N T 3 ) and X ( I N C 1 ) E ( α ˜ α ˜ ) ( I N 1 C ) X = O ( N 2 T 4 ) . So the lower-left block of W ¯ A W ˜ is O p ( N T 2 ) . The top-left block of W ˜ A W ˜ consists of elements like y ˜ ( ) M y ˜ ( s ) , where E ( y ˜ ( ) M y ˜ ( s ) ) has its leading term 1 C s 1 E ( α ˜ α ˜ ) = O ( N T 3 ) and Var ( y ˜ ( ) M y ˜ ( s ) ) has its leading terms σ 2 1 C s C s 1 E ( α ˜ α ˜ ) = O ( N T 5 ) and σ 2 1 C s C s 1 E ( α ˜ α ˜ ) = O ( N T 5 ) . So the top-left block of W ˜ A W ˜ is O ( N T 3 ) + O p ( N T 5 ) . ☐
Lemma A13.
Under the conditions of Theorem 3, plim T G N T = plim T Υ 1 / 2 W A W Υ 1 / 2 , where G N T = G N T ( θ 0 ) , G N T ( θ ) = Υ 1 / 2 { [ W A ( y W θ ) E ( W A ( ( y W θ ) ) ] / θ } Υ 1 / 2 .
Proof. 
Lemma A12, together with Lemma A3, gives
plim T G N T = plim T Υ 1 / 2 W A W Υ 1 / 2 2 plim T Υ 1 / 2 O ( 1 ) 0 k O ( N T ) + O p ( N T 3 ) O p ( N T ) Υ 1 / 2 + plim T Υ 1 / 2 O p ( N T ) O ( T ) O p × k O k × p O k Υ 1 / 2 = plim T Υ 1 / 2 W A W Υ 1 / 2 .
Lemma A14.
Under the conditions of Theorem 3, plim T σ 2 Υ 1 / 2 W AW Υ 1 / 2 = lim T
Ω 1 , N T ( α ) , where Ω 1 , N T ( α ) is given by (19).
Proof. 
From the proofs of Lemma A12, Υ 1 / 2 W AW Υ 1 / 2 is dominated by Υ 1 / 2 ( W ¯ A W ¯ + W ˜ A W ˜ ) Υ 1 / 2 when the fixed effects are deterministic. Its top-left block consists of leading terms ( N T 3 ) 1 σ 2 [ β 0 X ( I N C s ) X β 0 + 1 C s 1 α α ] , the lower-left (or top-right) block consists of leading terms ( N T 2 ) 1 σ 2 [ X ( I N C ) X β 0 + X ( I N C 1 ) α ] (or its transpose), and the lower-right block is ( N T ) 1 σ 2 ( X A X ) . Under the conditions of Theorem 3, in view of Lemma A3, for a given N, all these terms have well defined limits as T . If N also diverges as T , by writing Υ 1 / 2 W AW Υ 1 / 2 = N 1 i = 1 N W i M W i , where W i is the properly scaled version of W i . (In particular, the first p columns of W i are multiplied by T 3 / 2 and the last k columns are multiplied by T 1 / 2 .) One can check that (the dominating part of) each element of the ( p + k ) × ( p + k ) summand is uniformly (in T) bounded for each i and Theorem 1 of Phillips and Moon (1999) applies, namely, the sequential convergence and joint convergence are the same, provided that the limit is defined when N . With the matrix Ω 1 , N T ( α ) defined accordingly, the result follows. ☐
Lemma A15.
Under the conditions of Theorem 3, Ω 1 , N T 1 / 2 ( α ) Υ 1 / 2 g N T d N ( 0 , I ) as T , where g N T = Υ 1 [ W A u E ( W Au ) ] and Ω 1 , N T ( α ) is given by (19).
Proof. 
By substituting (5) and using Lemma A3,
Υ 1 / 2 [ W A u E ( W Au ) ] = ( N T 3 ) 1 / 2 [ u ( I N C 1 ) X β 0 + u ( I N C 1 1 ) α ] + O p ( T 1 / 2 ) ( N T 3 ) 1 / 2 [ u ( I N C p ) X β 0 + u ( I N C p 1 ) α ] + O p ( T 1 / 2 ) ( N T ) 1 / 2 X A u ,
where the O p ( T 1 / 2 ) terms come from ( N T 3 ) 1 / 2 [ u ( I N C s ) u E ( u ( I N C ) u ) ] , since Var ( u ( I N C ) u ) = O ( N T 2 ) , = 1 , , p . Consider a term like ( N T 3 ) 1 / 2 u ( I N C ) X β 0 = ( N T 3 ) 1 / 2 i = 1 N u i C X i β 0 , = 1 , , p . Let ζ , i = C X i β 0 0 . Then, T 3 / 2 u i C X i β 0 is the sum of T random variables, namely, T 3 / 2 t = 1 T u i t ζ , i t . Note that u i t ζ , i t ’s are independent across t and Var ( T 3 / 2 t = 1 T u i t ζ , i t ) = T 3 σ 2 β 0 X i C X i β 0 = O ( 1 ) > 0 .22 With Assumption 5, Lyapunov’s central limit theorem can be used, and it can be claimed that for each i, T 3 / 2 u i C X i β 0 is asymptotically normal as T . When summing over i, by the same logic, for any N, one can claim that ( N T 3 ) 1 / 2 [ u ( I N C ) X β 0 is asymptotically normal with Var ( ( N T 3 ) 1 / 2 [ u ( I N C ) X β 0 ) = ( N T 3 ) β 0 X ( I N C ) X β 0 = O ( 1 ) > 0 . If N also diverges, the sequential asymptotic distribution follows straightforwardly, provided that ( N T 3 ) β 0 X ( I N C ) X β 0 is well defined in the limit. For the joint asymptotic distribution (when T and N diverge at the same time, denoted by ( T , N ) ), define ξ i , N , T = T 3 / 2 u i C X i β 0 / i = 1 N Var ( T 3 / 2 u i C X i β 0 ) , where i = 1 N Var ( T 3 / 2 u i C X i β 0 ) is obviously T 3 σ 2 β 0 X ( I N C ) X β 0 . Let 1 ( · ) be the indicator function. Then, for any ε > 0 ,
i = 1 N E ξ i , N , T 2 1 | ξ i , N , T | > ε i = 1 N E ξ i , N , T 2 1 β 0 X i C u i u i C X i β 0 > ε 2 σ 2 β 0 X ( I N C ) X β 0 N T 3 σ 2 β 0 X ( I N C ) X β 0 × max i E β 0 X i C u i u i C X i β 0 T 3 1 β 0 X i C u i u i C X i β 0 T 3 > ε 2 σ 2 β 0 X ( I N C ) X β 0 T 3 ,
where N T 3 / [ σ 2 β 0 X ( I N C ) X β 0 ] = O ( 1 ) , β 0 X i C u i u i C X i β 0 / T 3 is uniformly (in T) integrable, and β 0 X ( I N C ) X β 0 / T 3 when ( T , N ) . In view of Theorem 2 of Phillips and Moon (1999), the joint asymptotic distribution is the same. So for any N, one can claim that ( N T 3 ) 1 / 2 [ u ( I N C ) X β 0 is asymptotically normal as T , provided that the asymptotic variance matrix is always defined. By the same reasoning, ( N T 3 ) 1 / 2 u ( I N C 1 ) α and ( N T ) 1 / 2 X A u , as well as their linear combinations, have similar properties. So in the end one can claim Υ 1 / 2 [ W A u E ( W Au ) ] is asymptotically normal. The variance of Υ 1 / 2 W A u is Υ 1 / 2 Var ( W Au ) Υ 1 / 2 = Υ 1 / 2 Ω 1 , N T ( α ) Υ 1 / 2 + o ( 1 ) in view of Lemma A11. Thus, one has
[ Υ 1 / 2 Ω 1 , N T ( α ) Υ 1 / 2 ] 1 / 2 Υ 1 / 2 W A u E ( W Au ) d N ( 0 , I ) .
Recall that E ( W Au ) = E ( u A u ) h , where h has O ( 1 ) elements in its top p positions and 0’s in its lower k positions, and Var ( u A u ) = σ 4 N ( T 1 ) ( 2 + γ 2 ( T 1 ) / T ) . So, E ( W Au ) + u A u h has its top p × 1 block consisting of O p ( N T ) terms and bottom k × 1 block of zeros. From Lemma A12, W A u E ( W Au ) has O p ( N T 3 ) elements in its top p positions and O p ( N T ) in its bottom k positions. Thus, one can write Υ 1 / 2 g N T = Υ 1 / 2 [ W A u E ( W Au ) + E ( W Au ) + u A u h ] = Υ 1 / 2 [ W A u E ( W Au ) ] + o p ( 1 ) . Combining all the results, one has the asymptotic distribution of Υ 1 / 2 g N T . ☐
Lemma A16.
Under the conditions of Theorem 5,
Ω 1 , N T 1 / 2 ( α ) Υ 1 / 2 g N T d N ( 0 , I ) ,
where g N T = g N T ( θ 0 ) , g N T ( θ ) is Υ 1 [ N T g N T ( θ ) ] with g N T ( θ ) given by (25), and Ω 1 , N T ( α ) is given by (34).
Proof. 
Note that Ψ is diagonal with O ( 1 ) elements. Since M Σ i M is uniformly (in T) bounded in row and column sums, one has E ( u i M Ψ M u i ) = tr ( M Σ i M Ψ ) = O ( T ) in view of Lemma A2 of Bao et al. (2020). Similarly, tr ( Σ i M Ψ M Σ i M Ψ M ) = tr ( M Σ i M Ψ M Σ i M Ψ ) is O ( T ) and each elements of M Ψ M is O ( 1 ) from a similar proof as in Lemma A3 of Bao et al. (2020). Then, it follows that Var ( u i M Ψ M u i ) = tr ( Σ i ( 4 ) M Ψ M M Ψ M ) + 2 tr ( Σ M Ψ M Σ M Ψ M ) = O ( T ) and u i M Ψ M u i = O ( T ) + O P ( T 1 / 2 ) . On the other hand, y i , ( ) M u i = u i C u i + α i u i C 1 + u i C X i β 0 + j = 0 t 1 u i C j , j e 1 y i , j + j = p 1 u i C ( j p ) , j e 1 y i , j . Term by term, E ( u i C u i ) = tr ( Σ i C ) = tr ( Σ i D ) = O ( T ) since the diagonal elements of D are O ( 1 ) . One can claim that 1 Φ p 1 L Σ i 1 and tr ( ( Φ p 1 L s ) Σ i Φ p 1 L Σ i ) are O ( T 2 ) by following similar steps as in the proof of Lemma A3. Also, 1 Φ p 1 L s Σ i Φ p 1 L Σ i 1 , 1 Φ p 1 L Σ i ( Φ p 1 L s ) Σ i 1 , and 1 Σ i Φ p 1 L Σ i ( Φ p 1 L s ) 1 are O ( T 3 ) . This leads to Var ( u i C u i ) = tr ( Σ i ( 4 ) C C ) + tr [ ( Σ i C Σ i ( C + C ) ] = O ( T ) + O ( T 2 ) . So one can claim that u i C u i = O ( T ) + O P ( T ) . From the proof of (iii) in Lemma A3, M Φ p 1 L 1 has, at most, a finite number of O ( T 3 ) elements, and all remaining elements are O ( T 2 ) . Then, it is obvious that Σ i 1 / 2 M Φ p 1 L 1 shares the same properties. Therefore, 1 C s Σ i C 1 = 1 ( Φ p 1 L s ) M Σ i 1 / 2 Σ i 1 / 2 M Φ p 1 L 1 = O ( T 3 ) and one can claim that α i u i C 1 = O P ( T 3 / 2 ) . Note that Cov ( u i C X i β 0 , u i C s X i β 0 ) = β 0 X i C Σ i C s X i β 0 = β 0 X i ( Φ p 1 L ) M Σ i 1 / 2 Σ i 1 / 2 M Φ p 1 L s X i β 0 , and, in view of Lemma A2 of Bao et al. (2020), it has the same magnitude as β 0 X i ( Φ p 1 L ) M Φ p 1 L s X i β 0 . So, u i C X i β 0 = O P ( T 3 / 2 ) . With all these results, one can claim that y i , ( ) M u i u i M Ψ M u i is dominated by α i u i C 1 + u i C X i β 0 . Thus,
Υ 1 / 2 g N T = ( N T 3 ) 1 / 2 [ u ( I N C 1 ) X β 0 + u ( I N C 1 1 ) α ] + O p ( T 1 / 2 ) ( N T 3 ) 1 / 2 [ u ( I N C p ) X β 0 + u ( I N C p 1 ) α ] + O p ( T 1 / 2 ) ( N T ) 1 / 2 X A u .
Following similar steps in the proof of Lemma A15, one can arrive at the desired asymptotic distribution. ☐
Lemma A17.
Under the conditions of Theorem 5,
plim T G N T = plim T Υ 1 / 2 W A W Υ 1 / 2 ,
where G N T = G N T ( θ 0 ) and G N T ( θ ) is Υ 1 / 2 [ N T G N T ( θ ) Υ 1 / 2 ] with G N T ( θ ) given by (26).
Proof. 
Given that Ψ is diagonal with O ( 1 ) elements, it follows that M Ψ M is uniformly (in T) bounded in row and column sums. Recall W i = ( y i , ( 1 ) , , y i , ( p ) , X i ) . Substituting (4), one has the first three leading terms in u i M Ψ M y i , ( s ) , namely, u i M Ψ M Φ p 1 L u i , α i u i M Ψ M Φ p 1 1 , and u i M Ψ M Φ p 1 X i β 0 . One can show that u i M Ψ M Φ p 1 L u i = O P ( T ) , α i u i M Ψ M Φ p 1 1 = O P ( T 3 / 2 ) , and u i M Ψ M Φ p 1 X i β 0 = O P ( T 3 / 2 ) by following the steps in the proof of Lemma A16. Also, u i M Ψ M X i = O P ( T 1 / 2 ) . So, u i M Ψ M W i = O P ( T 3 / 2 ) . Now consider y i , ( ) M W i = ( y i , ( ) M y i , ( 1 ) , , y i , ( ) M y i , ( p ) , y i , ( ) M X i ) . Following the steps in the proof of Lemma A12, one can claim that the leading terms in y i , ( ) M W i are O P ( T 3 ) , which dominate u i M Ψ M W i . Next, consider terms like u i M Ψ s M u i that also appear in (26). The diagonal elements of M Φ p 1 L Φ p 1 L s are T 1 Φ p 1 L Φ p 1 L s 1 . In view of Lemma A3. (ii), all the elements of the second part in (27) are O ( 1 ) . The i-th element of Φ p 1 L Φ p 1 L s 1 is
j = s i + ( a 1 + r = 2 p a r λ r i j ) a 1 ( j s ) + r = 2 p a r 1 λ r j s 1 λ r .
So all the elements of the first part in (27) are, at most, O ( T ) . Then, u i M Ψ s M u i is dominated by u i M Dg ( T 1 Φ p 1 L Φ p 1 L s 1 ) M u i , which has O ( T 2 ) mean and O ( T 3 ) variance. Given all these results, one can claim that the top p × ( p + k ) block of G N T is dominated by terms like ( N T 3 ) 1 i = 1 N ( y i , ( ) M W i ) = O p ( 1 ) . The bottom k × ( p + k ) block of G N T is ( N T ) 1 i = 1 N ( X i M W i ) . Again, by substituting (4), one can verify that X i M W i = O p ( T ) . In summary,
G N T = Υ 1 / 2 i = 1 N y i , ( 1 ) M W i y i , ( p ) M W i X i M W i Υ 1 / 2 + o p ( 1 )
and the result follows immediately. ☐
Lemma A18.
Under the conditions of Theorem 5,
plim T Υ 1 / 2 i = 1 N W i M Σ i M W i Υ 1 / 2 = lim T Ω 1 , N T ( α ) ,
where Ω 1 , N T ( α ) is given by (34).
Proof. 
This follows similarly from the proof of Lemma A14. ☐
Proof of Theorem 2.
This follows from Lemmas A4 and A7. ☐
Proof of Theorem 3.
This follows from Lemmas A11 and A15. ☐
Proof of Theorem 4.
This follows from Lemma A18. ☐
Proof of Theorem 5.
This follows from Lemma A16. ☐

Appendix D. Asymptotic Distribution of ( N T ) 1 i = 1 N W i M v ^ i v ^ i M W i When N Is Fixed

If N is fixed, all four terms in the expansion of ( N T ) 1 i = 1 N W i M v ^ i v ^ i M W i are O p ( 1 ) . From the proof of Lemma A8 in Appendix C, the top p × 1 block of W i M u i is dominated by u i C u i + u i C X i β 0 , = 1 , , p . For Ω 1 , N T defined by (29), suppose one writes Ω 1 , N T = N 1 i = 1 N Ω 1 , T i , then as T , T 1 / 2 W i M u i d N ( 0 , lim T Ω 1 , T i ) . So, ( N T ) 1 i = 1 N W i M u i u i M W i p N 1 i = 1 N Ω 1 , T i 1 / 2 b i b i Ω 1 , T i 1 / 2 , where b i N ( 0 , I ) is a ( p + k ) -dimensional normal random vector and E ( b i b j ) = O for i j . The lower-right block of W i M W i is X i M X i and its top-left p × p block is dominated by u i L Φ p 1 M Φ p 1 L s u i + 2 u i L Φ p 1 M Φ p 1 L s X i β 0 + β 0 X i L Φ p 1 M Φ p 1 L s X i β 0 , , s = 1 , , p from the proofs of Lemmas A5 and A9.
Assuming that T 1 u i L Φ p 1 M Φ p 1 L s u i and T 1 β 0 X i L Φ p 1 M Φ p 1 L s X i β 0 both have properly defined probability limits as T , then one may write T 1 W i M W i p Q i . Further, ( N T ) 1 i = 1 N W i M W i ( θ ^ θ 0 ) u i M W i p N 1 i = 1 N Q i Q 1 b b i Ω 1 , T i 1 / 2 ) , where Q = j = 1 N Q j and b = j = 1 N Ω 1 , T j 1 / 2 b j , in light of the proof of Theorem 4 in Hansen (2007). Similarly, one also has ( N T ) 1 i = 1 N W i M W i ( θ ^ θ 0 ) ( θ ^ θ 0 ) W i M W i p N 1 i = 1 N Q i Q 1 b b Q 1 Q i . Thus, ( N T ) 1 i = 1 N W i M v ^ i v ^ i M W i is not consistent for estimating Ω 1 , N T , though it has a limiting distribution as T with N fixed.
In the special case when Q i and Ω 1 , T i stay the same across i, then the limiting distribution of ( N T ) 1 i = 1 N W i M v ^ i v ^ i M W i is proportional to Ω 1 , N T , and Corollary 4.1 of Hansen (2007) suggests that the t-statistic converges to N / ( N 1 ) t N 1 , where t N 1 is the t distribution with N 1 degrees of freedom.23 In the presence of (unconditional) cross-sectional heteroskedasticity, it is unlikely that Q i and Ω 1 i do not change with i. But under homoskedasticity (with possible conditional temporal heteroskedasticity) or temporal heteroskedasticity, the moment condition (25) and the asymptotic distribution (31) are still valid, and thus, one can use the N / ( N 1 ) t N 1 approximation to conduct the t-test. As Hansen (2007) points out, if N , then N / ( N 1 ) t N 1 converges to the standard normal distribution, so the N / ( N 1 ) t N 1 approximation can also be used under large N.

Appendix E. Cross-Sectional Correlation

Now consider the more general case when Cov ( u i , u j ) = Σ i j = Dg ( σ i j , 1 2 , , σ i j , T 2 ) . (When i = j , Σ i i = Σ i .) That is, cross-sectional correlation (and heteroskedasticity) may exist, but temporal correlation is ruled out. For this situation, one needs both N and T to be large.
When there is no unit root, from Lemma A8, one sees that, for θ in a neighborhood of θ 0 , g i ( θ ) is dominated by W i M u i . (The top p × 1 block of g i ( θ ) in this neighborhood is dominated by u i C u i + u i C X i β .) So,
1 N T i = 1 N t = 1 T g i t ( θ ) = 1 N T i = 1 N t = 1 T w i t ( u i t u ¯ i ) + o p ( 1 ) = 1 T t = 1 T 1 N i = 1 N w i t u i t 1 N i = 1 N u ¯ i 1 T t = 1 T w i t + o p ( 1 ) = 1 T t = 1 T 1 N i = 1 N w i t u i t + o p ( 1 ) = 1 T t = 1 T N 1 / 2 W t u t + o p ( 1 ) ,
where W t = ( w 1 t , , w N t ) and u t = ( u 1 t , , u N t ) . This second last line follows because T 1 / 2 t = 1 T w i t = O p ( 1 ) , and u ¯ i = O p ( T 1 / 2 ) . Note that { N 1 / 2 W t u t } t = 1 T forms a martingale difference sequence, so does { N 1 W t u t u t W t N 1 W t Σ t W t } t = 1 T , where Σ t = E ( u t u t ) = Var ( u t ) . Therefore,
lim T 1 T t = 1 T E ( N 1 W t u t u t W t ) = plim N , T 1 N T t = 1 T W t Σ t W t
provided that plim N , T ( N T ) 1 t = 1 T W t Σ t W t exists and is positive definite. A sufficient condition is that the positive definite N × N matrix Σ t is bounded in the norm as N . Then,
1 N T i = 1 N t = 1 T g i t d N 0 , plim N , T 1 N T t = 1 T W t Σ t W t
and accordingly, in view of the proof of Lemma A9, one has
N T ( θ ^ θ 0 ) d N 0 , plim N , T 1 N T W A W 1 plim N , T 1 N T t = 1 T W t Σ t W t plim N , T 1 N T W A W 1 .
In practice, plim N , T ( N T ) 1 t = 1 T W t Σ t W t can be estimated by
1 N T t = 1 T W t ( ϵ ^ t ϵ ^ ¯ ) ( ϵ ^ t ϵ ^ ¯ ) W t
where ϵ ^ ¯ = T 1 t = 1 T ϵ ^ t and ϵ ^ t = y t W t θ ^ .
When there is a unit root, from Lemma A16, g i ( θ ) is dominated by W i M u i , where its top p × 1 block is dominated by elements like α i u i C 1 + u i C X i β , = 1 , p , which are O P ( T 3 / 2 ) , and its lower k × 1 block is X i M u i , which is O P ( T 1 / 2 ) . Proceeding similarly, one can write
Υ 1 / 2 i = 1 N t = 1 T g i t ( θ ) = Υ 1 / 2 t = 1 T W t u t + o p ( 1 )
and
lim T Υ 1 / 2 t = 1 T E ( W t u t u t W t ) Υ 1 / 2 = plim N , T Υ 1 / 2 t = 1 T W t Σ t W t Υ 1 / 2 .
Finally, in view of the proof of Lemma A17, one has
Υ 1 / 2 ( θ ^ θ 0 ) d N 0 , V ,
where V = ( plim N , T Υ 1 / 2 W A W Υ 1 / 2 ) 1 ( plim N , T Υ 1 / 2 ( t = 1 T W t Σ t W t ) Υ 1 / 2 )
× ( plim N , T Υ 1 / 2 W A W Υ 1 / 2 ) 1 .

Appendix F. Additional Simulation Results

This section provides simulation results when ϕ 0 = ( 0.3 , 0.2 , 0.1 ) , ( 0.3 , 0.6 , 0.1 ) for the DP(3) model, as specified by (36) with the error term (37). The experimental design is exactly the same as that in Section 5 in the main text. Recall all the results are related to the sum of the autoregressive parameters, namely, ϕ 01 + ϕ 02 + ϕ 03 .
Table A1. Additional simulation results under homoskedasticity.
Table A1. Additional simulation results under homoskedasticity.
ϕ 0 ( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
( 0.3 , 0.2 , 0.1 ) Bias ( × 100 )WG−5.54−2.56−0.98−1.28−1.05−0.52
GMM−1.72−2.57−0.98−1.28−1.05−1.08
BC−0.79−0.28−0.07−0.15−0.14−0.06
BCJ−0.77−0.28−0.07−0.15−0.14−0.06
HPJ2.940.460.060.080.030.03
RMM0.03−0.01−0.00−0.05−0.07−0.04
RMM r 0.03−0.01−0.00−0.05−0.07−0.04
RMSE ( × 100 )WG6.133.471.722.612.472.28
GMM3.303.491.722.612.473.37
BC2.782.401.422.292.252.22
BCJ2.782.401.422.292.252.22
HPJ4.582.691.472.412.342.27
RMM2.672.371.422.292.252.22
RMM r 2.682.371.422.292.252.22
Size ( 5 % )WG57.3219.3010.398.807.445.68
WG(h)57.7020.9111.3311.199.9210.31
GMM9.2519.1410.378.737.396.16
GMM(h)10.0521.6411.7312.0010.8112.63
HPJ14.314.834.514.854.774.77
RMM(N)6.046.035.636.987.089.83
RMM(T)6.075.364.965.205.034.96
RMM r (N)6.076.005.637.007.109.83
RMM r ( N T )6.686.375.767.087.119.88
RMM r (T)6.105.174.785.034.444.20
( 0.3 , 0.6 , 0.1 ) Bias ( × 100 )WG−1.88−0.61−0.13−0.20−0.14−0.05
GMM−0.46−0.58−0.13−0.20−0.14−0.15
BC10.769.604.445.474.442.23
BCJ10.779.604.445.474.442.23
HPJ1.990.720.160.250.180.06
RMM0.030.01−0.000.000.00−0.00
RMM r 0.030.01−0.000.000.00−0.00
RMSE ( × 100 )WG2.130.810.200.360.280.14
GMM1.110.790.200.360.280.39
BC11.079.724.455.514.462.24
BCJ11.099.724.455.514.462.24
HPJ2.771.220.320.590.480.26
RMM1.000.530.140.290.240.13
RMM r 1.000.530.140.290.240.13
Size ( 5 % )WG48.4422.5315.5811.659.626.56
WG(h)47.0423.6817.8216.2215.0517.45
GMM7.4620.7215.5411.539.546.76
GMM(h)8.4222.7418.4017.1115.9718.56
HPJ23.3314.4311.969.489.077.54
RMM(N)6.407.237.019.5810.8415.17
RMM(T)5.575.415.135.675.555.03
RMM r (N)6.317.116.909.5410.8715.10
RMM r ( N T )6.327.056.929.5110.7114.94
RMM r (T)5.785.885.957.077.187.95
Note: See Table 1 in the main text.
Table A2. Additional simulation results under cross-sectional heteroskedasticity.
Table A2. Additional simulation results under cross-sectional heteroskedasticity.
ϕ 0 ( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
( 0.3 , 0.2 , 0.1 ) Bias ( × 100 )WG−21.40−9.10−3.51−3.74−2.73−1.03
GMM−8.69−9.16−3.51−3.74−2.73−2.17
BC−2.55−0.62−0.11−0.18−0.10−0.04
BCJ−2.57−0.62−0.11−0.18−0.10−0.04
HPJ5.930.960.140.210.180.06
RMM−0.20−0.12−0.03−0.10−0.05−0.03
RMM r −0.23−0.12−0.03−0.10−0.05−0.03
RMSE ( × 100 )WG22.1410.364.615.644.883.49
GMM10.9910.434.615.644.885.29
BC6.735.173.044.294.103.35
BCJ6.745.173.044.294.103.35
HPJ10.386.023.204.574.323.44
RMM6.475.203.044.304.103.35
RMM r 6.495.203.044.304.103.35
Size ( 5 % )WG98.4653.5427.9918.0013.327.25
WG(h)96.9948.9423.7618.2914.4511.71
GMM31.9453.1727.9317.8713.168.14
GMM(h)27.3649.5824.4718.9415.5414.90
HPJ15.226.965.585.275.755.15
RMM(N)6.347.026.497.668.1910.44
RMM(T)13.049.608.497.547.646.21
RMM r (N)6.357.036.447.638.1910.46
RMM r ( N T )9.267.836.837.908.3910.57
RMM r (T)8.546.705.695.695.814.28
( 0.3 , 0.6 , 0.1 ) Bias ( × 100 )WG−26.46−6.50−1.52−1.28−0.74−0.14
GMM−12.14−6.35−1.52−1.28−0.74−0.46
BC−8.34−0.64−0.080.510.721.13
BCJ−8.36−0.64−0.080.510.721.13
HPJ8.075.021.581.390.870.19
RMM0.830.100.010.010.000.00
RMM r 0.920.100.010.010.000.00
RMSE ( × 100 )WG26.826.801.621.500.940.27
GMM13.066.661.621.500.940.80
BC10.112.210.611.000.981.19
BCJ10.132.210.611.000.991.19
HPJ10.445.911.831.931.330.47
RMM7.332.140.550.770.570.23
RMM r 7.542.170.550.770.570.23
Size ( 5 % )WG100.0096.9087.5544.2728.219.49
WG(h)100.0093.4181.9343.9731.5018.86
GMM78.5195.9387.5144.0628.0511.75
GMM(h)74.5492.1982.4645.3233.4223.42
HPJ21.0638.1845.4621.5116.919.83
RMM(N)5.096.696.919.009.7314.41
RMM(T)31.8712.738.577.256.685.60
RMM r (N)5.236.636.878.849.6514.43
RMM r ( N T )18.508.126.848.329.0513.66
RMM r (T)17.546.895.735.816.046.79
Note: See Table 1 in the main text.
Table A3. Additional simulation results under temporal heteroskedasticity.
Table A3. Additional simulation results under temporal heteroskedasticity.
ϕ 0 ( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
( 0.3 , 0.2 , 0.1 ) Bias ( × 100 )WG−27.50−10.98−3.88−5.11−4.06−1.90
GMM−12.93−11.09−3.88−5.11−4.06−4.14
BC−3.32−0.94−0.17−0.39−0.35−0.14
BCJ−3.38−0.92−0.16−0.38−0.35−0.14
HPJ7.661.11−0.01−0.07−0.100.01
RMM1.16−0.41−0.11−0.29−0.29−0.12
RMM r −0.13−0.07−0.06−0.20−0.24−0.11
RMSE ( × 100 )WG28.1512.054.906.996.244.95
GMM15.1412.174.906.996.247.87
BC7.505.243.044.884.834.60
BCJ7.515.243.044.884.834.60
HPJ12.826.033.475.545.465.07
RMM7.515.253.044.894.834.61
RMM r 6.875.263.044.894.834.61
Size ( 5 % )WG99.7362.1929.4421.8816.538.85
WG(h)99.6362.0527.2322.1917.3812.23
GMM41.7562.6229.3421.7116.3915.73
GMM(h)38.2263.0627.8623.5418.1417.61
HPJ20.656.237.507.207.217.14
RMM(N)5.926.695.927.577.539.54
RMM(T)12.647.527.277.227.277.12
RMM r (N)6.326.645.927.517.499.56
RMM r ( N T )8.258.046.298.027.899.62
RMM r (T)7.666.695.295.535.214.26
( 0.3 , 0.6 , 0.1 ) Bias ( × 100 )WG−21.28−10.40−2.46−3.77−2.55−0.73
GMM−10.85−10.54−2.46−3.77−2.55−2.75
BC−2.65−3.11−0.54−1.45−1.03−0.38
BCJ−3.07−3.09−0.53−1.45−1.03−0.38
HPJ13.023.821.291.551.360.68
RMM16.75−0.38−0.49−0.69−0.47−0.12
RMM r 0.130.280.040.180.150.03
RMSE ( × 100 )WG21.7010.642.554.002.780.94
GMM11.9710.802.554.002.783.22
BC5.294.241.002.211.680.78
BCJ5.364.231.012.211.680.78
HPJ14.695.401.702.702.281.18
RMM17.832.760.831.531.220.61
RMM r 4.432.820.771.691.370.66
Size ( 5 % )WG99.9999.9999.2792.7879.9033.50
WG(h)99.9399.8797.7688.1773.3035.91
GMM63.4399.9899.2692.7579.8160.53
GMM(h)62.1999.8697.9089.1975.3157.29
HPJ51.1621.7027.8215.9216.3914.10
RMM(N)1.917.9915.1314.1213.7115.72
RMM(T)91.5713.6019.7615.2914.4611.30
RMM r (N)5.785.656.467.078.2114.10
RMM r ( N T )6.738.689.0310.5211.0913.70
RMM r (T)6.227.227.597.747.476.15
Note: See Table 1 in the main text.
Table A4. Additional simulation results under double heteroskedasticity.
Table A4. Additional simulation results under double heteroskedasticity.
ϕ 0 ( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
( 0.3 , 0.2 , 0.1 ) Bias ( × 100 )WG−18.86−9.24−3.45−4.69−3.66−1.94
GMM−6.89−9.11−3.45−4.69−3.66−4.05
BC−1.98−0.68−0.12−0.36−0.19−0.22
BCJ−1.93−0.67−0.12−0.36−0.19−0.22
HPJ6.070.840.130.050.14−0.05
RMM−0.94−0.36−0.06−0.27−0.13−0.21
RMM r −0.12−0.12−0.04−0.21−0.10−0.20
RMSE ( × 100 )WG19.4510.314.436.535.814.84
GMM8.8010.204.436.535.817.59
BC5.484.772.834.644.594.47
BCJ5.474.772.834.644.594.47
HPJ9.415.683.075.195.094.81
RMM5.294.772.834.654.594.47
RMM r 5.294.782.834.654.594.47
Size ( 5 % )WG98.4155.2926.2520.0814.828.54
WG(h)98.0154.7825.5821.8016.3213.04
GMM24.1953.4726.1719.9314.7214.17
GMM(h)24.6153.7826.0722.9717.4117.73
HPJ18.047.346.197.186.856.44
RMM(N)6.526.646.357.507.4510.04
RMM(T)8.877.366.717.066.886.61
RMM r (N)6.116.696.367.537.4210.08
RMM r ( N T )7.807.756.687.977.7610.22
RMM r (T)7.236.645.565.575.044.53
( 0.3 , 0.6 , 0.1 ) Bias ( × 100 )WG−23.86−9.12−1.80−3.13−2.14−0.66
GMM−9.27−8.78−1.80−3.13−2.14−2.54
BC−6.98−2.36−0.19−0.93−0.66−0.30
BCJ−6.90−2.34−0.18−0.93−0.66−0.30
HPJ6.554.261.562.051.620.69
RMM−3.84−1.65−0.27−0.58−0.37−0.10
RMM r 0.930.370.010.100.090.02
RMSE ( × 100 )WG24.199.401.903.382.380.87
GMM10.099.071.903.382.383.02
BC8.543.610.721.751.370.70
BCJ8.523.620.721.751.370.70
HPJ8.885.571.822.882.341.15
RMM6.292.970.651.411.130.57
RMM r 6.693.210.631.521.230.61
Size ( 5 % )WG100.0099.8294.3886.1270.8530.99
WG(h)100.0099.2489.2477.0561.5933.59
GMM68.3699.6094.3886.0270.6856.04
GMM(h)65.8998.7989.5178.6563.3751.80
HPJ18.9926.1541.9321.9221.2214.66
RMM(N)18.2414.3510.1912.0412.0015.49
RMM(T)37.8925.4614.4316.2514.4410.97
RMM r (N)4.644.496.857.358.2714.36
RMM r ( N T )17.1812.618.1110.6610.8513.78
RMM r (T)16.0510.806.877.727.036.30
Note: See Table 1 in the main text.

Notes

1
No attempt is made here to provide an exhaustive list of published papers in this literature. Readers can refer to Okui (2021) for a comprehensive literature review.
2
As one referee points out, many papers in the literature distinguish between exogeneity and endogeneity depending on whether a regressor is correlated with the idiosyncratic error by assuming it is correlated with the individual fixed effects. As such, the (random) regressors X in Assumption 3 to be introduced are exogenous. The word “endogeneity” or “endogenous” in this paper exclusively refers to the lagged dependent variables.
3
One referee suggests that, in view of the numerical equivalence, the estimator in this paper should be named the implicit indirect inference estimator. Nonetheless, given the different motivations, the suggested name is not adopted in this paper.
4
They mention the possibility of both cross-sectional and temporal heteroskedasticity under more stringent conditions, but they do not rigorously derive the properties of the resulting estimator.
5
The higher-order DP in Juodis (2013) is presented as a panel first-order vector autoregression (VAR(1)).
6
Of course, if one assumes that α i are fixed constants of bounded magnitudes, then Assumption 1 should be dropped. When they are random, the i.i.d. assumption could be relaxed, so long as the relevant probability limits pertaining to the recentered moments and gradient are well defined.
7
In this case, Assumptions 1 and 2 do not apply. One could also follow Han and Phillips (2010) to assume that the fixed effects disappear when there is a unit root, but an extension along this line is not pursued in this paper. In practice, the fixed effects are not estimated anyway, so treating them as random or not does not affect the estimation strategy to be presented in this paper.
8
The derivation of (4) is analogous to rewriting lagged time-series vectors from an autoregressive process of order p: y t = ϕ 1 y t 1 + + ϕ p y t p + u t ,   t = 1 , , T , where y 0 , y 1 , , y t p + 1 are given. With obvious notation, y = ϕ 1 y ( 1 ) + + ϕ p y ( p ) + u . Note that y ( ) = L y ( + 1 ) + e 1 y ( 1 ) , = 1 , , p , where y ( 0 ) = y . Substituting y ( 1 ) = L y + e 1 y 0 , y ( 2 ) = L y ( 1 ) + e 1 y 1 = L 2 y + L e 1 y 0 + e 1 y 1 , and so on to the right-hand side of y = ϕ 1 y ( 1 ) + + ϕ p y ( p ) + u , one can solve for y and hence y ( 1 ) , given by y ( 1 ) = Φ p 1 ( L u + e 1 y 0 + s = 1 p 1 Φ ( s p ) L s e 1 y s ) at the true parameter vector. With y ( 1 ) , the expression of y ( 2 ) follows from y ( 2 ) = L y ( 1 ) + e 1 y 1 . By successive substitutions, y ( ) = Φ p 1 ( L u + s = 0 1 Φ s L 1 s e 1 y s + s = p 1 Φ ( s p ) L 1 s e 1 y s ) .
9
At the time of writing, the author was not aware of the work of Breitung et al. (2022). In addition to differences in the allowable heteroskedasticity, unit root, and asymptotic regimes, their approach is motivated by correcting the profile score from a normal likelihood function, but the estimator in this paper explicitly uses the endogeneity of the with-group transformed lagged dependent variables to construct the recentered moment conditions E ( g N T ( θ 0 ) ) = 0 in this and the next sections.
10
Var ( N T g N T ) = [ Var ( W A u ) + h h Var ( u A u ) + Cov ( W A u , u A u ) h + h Cov ( u A u , W A u ) ] / N T . Further, the ( i , j )-th element of h h Var ( u A u ) is equal to the negative of the ( i , j )-th element of Cov ( W A u , u A u ) h and Var ( u A u ) / N T = σ 4 [ 2 ( T 1 ) / T + γ 2 ( T 1 ) 2 / T 2 ] . These results lead to the variance expression (10). See Bao and Yu (2023) for the detailed derivation.
11
Dovonon et al. (2020) point out that there may exist situations where global identification holds but first-order local identification fails. They provide such an example based on the special case of a unit-root DP(1) with no exogenous covariates, namely, p = 1 , k = 0 , and ϕ 0 = 1 , where a GMM estimator is used. When T = 4 and E [ ( α i ( 1 ϕ 0 ) y i 1 ) 2 ] = 0 , they show that the Jacobian is a null vector, and thus, the GMM is not able to first-order identify the parameter, though global identification and second-order local identification still hold. For the estimator proposed in this paper, it can be shown that when ϕ 0 = 1 , the (unscaled) moment condition at ϕ 0 = 1 becomes y ( 1 ) A u + u A u / 2 and its derivative at ϕ 0 = 1 becomes y ( 1 ) A y ( 1 ) y ( 1 ) A u + ( T 2 ) u A u / 6 , where E ( y ( 1 ) A y ( 1 ) ) = N ( T 2 1 ) σ 2 / 6 + ( T 3 T ) E ( α α ) / 12 , E ( y ( 1 ) A u ) = N σ 2 ( 1 T ) / 2 , and E ( u A u ) = N ( T 1 ) σ 2 . If further α = 0 , then the Jacobian is equal to 0, and thus, the first-order local identification condition fails in this special case. This is also recognized by Dhaene and Jochmans (2016) (see their Corollary 4.1) when they design their adjusted profile likelihood estimator. For a general DP(p), under some rare circumstances, it may happen that there are multiple zeros when one solves the adjusted profile score function and for local identification, Dhaene and Jochmans (2016) recommend numerical search starting from the WG estimator. Results from numerical gird search in Dhaene and Jochmans (2016) and Bao and Yu (2023) suggest that the issue of multiple zeros may not be of practical concern.
12
In Kelejian and Prucha (2010), the linear form in u involves a vector of non-stochastic elements. Here, X may be random. Checking their proof, which relies on results from their earlier work (Kelejian and Prucha 2001, Theorem 1), one can see that as long as X is strictly exogenous, then the sigma-field that defines the martingale difference array in the proof of Theorem 1 in Kelejian and Prucha (2001) can be extended and the result continues to hold. In the case of random X , one can replace X AX with E ( X AX ) in various variance expressions. Further note that, in view of footnote 13 in Kelejian and Prucha (2001), one can think of their k n as N T and their n as T.
13
If it is also the case that N , the sequential ( ( T , N ) seq ) and joint ( ( T , N ) ) asymptotic distributions may be different (Phillips and Moon 1999). Under the assumptions in this paper, for the stable case, Theorem 1 of Kelejian and Prucha (2001) essentially states no difference under the two asymptotic regimes. For the unit-root case, Appendix C (Lemma A15) shows that the two asymptotic regimes deliver the same asymptotic distribution. Recall that G N T = O p ( 1 ) + [ O ( T 2 ) + O p ( N 1 / 2 T 3 / 2 ) ] + O p ( T 1 ) , where the O p ( 1 ) term is ( N T ) 1 W A W . Further,
14
σ 2 is defined as lim N N 1 i = 1 N σ i 2 in Breitung et al. (2022).
15
Recall from Note 11, that if further α = 0 , the method in this paper cannot identify ϕ 0 . This is also in line with Theorem 4 of Hahn and Kuersteiner (2002) that shows that their bias correction is not expected to work under this special case. Checking their proof (see their Lemma 12), one can interpret their recentered WG estimator in this special case as arising from some (unscaled) moment condition, which is y A u + 3 u ( I N L Φ 1 M Φ 1 L ) u / ( T + 1 ) at the true parameter value and has exact expectation 0 when ϕ 0 = 1 . From Note 11, for the RMM estimator in this special case, the (unscaled) moment condition at ϕ 0 = 1 becomes y ( 1 ) A u + u A u / 2 . So, if one designs a (unscaled) moment condition y ( 1 ) A ( y ϕ y ( 1 ) ) + ( y ϕ y ( 1 ) ) A ( y ϕ y ( 1 ) ) / 2 , which is valid at ϕ 0 = 1 but not valid at ϕ 0 1 , then its derivative, evaluated at ϕ 0 = 1 , is dominated by y ( 1 ) A y ( 1 ) . Using results from Appendix C, one can show plim N , T y ( 1 ) A y ( 1 ) / ( N T 2 ) = σ 2 / 6 , which is the same as Lemma 11 of Hahn and Kuersteiner (2002). Further, following similarly the proof of Lemma 12 of Hahn and Kuersteiner (2002), one has ( N T 2 ) 1 / 2 ( y ( 1 ) A u + u A u / 2 ) d N ( 0 , σ 4 / 12 ) as N , T . Correspondingly, N T 2 ( ϕ ^ 1 ) d N ( 0 , 3 ) as N , T . Recall that ϕ ^ here is solved from y ( 1 ) A ( y ϕ y ( 1 ) ) + ( y ϕ y ( 1 ) ) A ( y ϕ y ( 1 ) ) / 2 = 0 . Theorem 4 of Hahn and Kuersteiner (2002) indicates that if one were using the true parameter value in this case, one would have recentered θ ^ θ 0 by 3 / ( T + 1 ) instead of the general bias formula ( 1 + θ ^ ) / T (in their notation, where θ ^ is the WG estimator). Similarly here, one would have used the moment condition y ( 1 ) A ( y ϕ y ( 1 ) ) + ( y ϕ y ( 1 ) ) A ( y ϕ y ( 1 ) ) / 2 instead of the general one y ( 1 ) A u ( ϕ ) + u ( ϕ ) A u ( ϕ ) 1 Φ 1 ( ϕ ) L 1 / [ T ( T 1 ) ] .
16
In particular, additional simulation results are available under four non-normal distributions that are also considered in Bao and Yu (2023): uniform on [ 0 , 1 ] , student-t distribution with five degrees of freedom, log-normal distribution lnN ( 0 , 1 ) , and mixture of N ( 3 , 1 ) and N ( 3 , 1 ) with half probability each.
17
The number of instruments for the GMM estimator of Arellano and Bond (1991) is of order O ( T 2 ) . To prevent instrument proliferation, the total number of instruments from lagged y is capped at T 0 ( T 0 1 ) / 2 , where T 0 = min ( 50 , T ) , such that when T > 50 , only the first q = T 0 ( T 0 1 ) / 2 columns in the matrix of instruments are retained.
18
Bun and Carree (2006) consider DP(1) only. The BC estimator is based on the panel VAR(1) representation of DP(p) in Juodis (2013).
19
The complete results for each single element of θ 0 (including β 0 ) under each parameter configuration are available upon request and they lead to similar conclusions as reported in this section.
20
Even though the estimator itself is consistent under large N and fixed T, De Vos and Everaert (2021) assume both to be large to derive its asymptotic distribution. For practical inference, they suggest re-sampling the cross-sectional units and then using the empirical distribution of their estimates from the bootstrapped samples to approximate the asymptotic distribution.
21
If there are repeated roots, the exact expressions of the various terms in the lemmas to follow are different, but their orders of magnitude stay the same. This is because, for instance, for | λ | < 1 , t = 1 T λ t , and t = 1 T t j λ t , where j is a positive finite integer, are of the same magnitude as T .
22
T 3 σ 2 β 0 X i C X i β 0 is essentially a quadratic form in the idempotent matrix M , which is positive unless Φ p 1 L X i β 0 is a constant vector.
23
The matrix i = 1 N W i M v ^ i v ^ i M W i , in this case, may be adjusted by N / ( N 1 ) . Further, the F statistic for testing q linear restrictions based on (31) with the the variance matrix estimated by (32) (and i = 1 N W i M v ^ i v ^ i M W i possibly adjusted) converges to [ N q / ( N q ) ] F q , N q , where F q , N q denotes an F distribution with q numerator and N q denominator degrees of freedom.

References

  1. Ahn, Seung C., and Peter Schmidt. 1995. Efficient estimation of models for dynamic panel data. Journal of Econometrics 68: 527. [Google Scholar] [CrossRef]
  2. Alvarez, Javier, and Manuel Arellano. 2003. The time series and cross-section asymptotics of dynamic panel data estimators. Econometrica 71: 1121–59. [Google Scholar] [CrossRef]
  3. Alvarez, Javier, and Manuel Arellano. 2022. Robust likelihood estimation of dynamic panel data models. Journal of Econometrics 226: 21–61. [Google Scholar] [CrossRef]
  4. Anderson, Theodore W., and Cheng Hsiao. 1981. Estimation of dynamic models with error components. Journal of the American Statistical Association 76: 598–606. [Google Scholar] [CrossRef]
  5. Arellano, Manuel, and Stephen Bond. 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. The Review of Economic Studies 58: 277–97. [Google Scholar] [CrossRef]
  6. Bao, Yong, and Aman Ullah. 2010. Expectation of quadratic forms in normal and nonnormal variables with applications. Journal of Statistical Planning and Inference 140: 1193–205. [Google Scholar] [CrossRef]
  7. Bao, Yong, and Xuewen Yu. 2023. Indirect inference estimation of dynamic panel data models. Journal of Econometrics 235: 1027–53. [Google Scholar] [CrossRef]
  8. Bao, Yong, Xiaotian Liu, and Lihong Yang. 2020. Indirect inference estimation of spatial autoregressions. Econometrics 8: 34. [Google Scholar] [CrossRef]
  9. Blundell, Richard, and Richard J. Smith. 1991. Initial conditions and efficient estimation in dynamic panel data models. Annals of Economics and Statistics 20/21: 109–23. [Google Scholar]
  10. Blundell, Richard, and Stephen Bond. 1998. Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87: 115–43. [Google Scholar] [CrossRef]
  11. Breitung, Jörg, Sebastian Kripfganz, and Kazuhiko Hayakawa. 2022. Bias-corrected method of moments estimators for dynamic panel data models. Econometrics and Statistics 24: 116–32. [Google Scholar] [CrossRef]
  12. Bun, Maurice J. G., and Martin A. Carree. 2005. Bias-corrected estimation in dynamic panel data models. Journal of Business & Economic Statistics 23: 200–10. [Google Scholar] [CrossRef]
  13. Bun, Maurice J. G., and Martin A. Carree. 2006. Bias-corrected estimation in dynamic panel data models with heteroscedasticity. Economics Letters 92: 220–27. [Google Scholar] [CrossRef]
  14. Chudik, Alexander, M. Hashem Pesaran, and Jui-Chung Yang. 2018. Half-panel jackknife fixed-effects estimation of linear panels with weakly exogenous regressors. Journal of Applied Econometrics 33: 816–36. [Google Scholar] [CrossRef]
  15. De Vos, Ignace, and Gerdie Everaert. 2021. Bias-corrected common correlated effects pooled estimation in dynamic panels. Journal of Business & Economic Statistics 39: 294–306. [Google Scholar] [CrossRef]
  16. Dhaene, Geert, and Koen Jochmans. 2015. Split-panel jackknife estimation of fixed-effect models. The Review of Economic Studies 82: 991–1030. [Google Scholar] [CrossRef]
  17. Dhaene, Geert, and Koen Jochmans. 2016. Likelihood inference in an autoregression with fixed effects. Econometric Theory 32: 1178–215. [Google Scholar] [CrossRef]
  18. Dovonon, Prosper, Alastair R. Hall, and Frank Kleibergen. 2020. Inference in second-order identified models. Journal of Econometrics 218: 346–72. [Google Scholar] [CrossRef]
  19. Everaert, Gerdie, and Lorenzo Pozzi. 2007. Bootstrap-based bias correction for dynamic panels. Journal of Economic Dynamics and Control 31: 1160–84. [Google Scholar] [CrossRef]
  20. Gospodinov, Nikolay, Ivana Komunjer, and Serena Ng. 2017. Simulated minimum distance estimation of dynamic models with errors-in-variables. Journal of Econometrics 200: 181–93. [Google Scholar] [CrossRef]
  21. Gouriéroux, Christian, Peter C. B. Phillips, and Jun Yu. 2010. Indirect inference for dynamic panel models. Journal of Econometrics 157: 68–77. [Google Scholar] [CrossRef]
  22. Hahn, Jinyong, and Guido Kuersteiner. 2002. Asymptotically unbiased inference for a dynamic panel model with fixed effects when both n and T are large. Econometrica 70: 1639–57. [Google Scholar] [CrossRef]
  23. Han, Chirok, and Peter C. B. Phillips. 2010. GMM estimation for dynamic panels with fixed effects and strong instruments at unity. Econometric Theory 26: 119–51. [Google Scholar] [CrossRef]
  24. Hansen, Christian B. 2007. Asymptotic properties of a robust variance matrix estimator for panel data when T is large. Journal of Econometrics 141: 597–620. [Google Scholar] [CrossRef]
  25. Hayakawa, Kazuhiko. 2009. A simple efficient instrumental variable estimator for panel AR(p) models when both N and N are large. Econometric Theory 25: 873–90. [Google Scholar] [CrossRef]
  26. Holtz-Eakin, Douglas, Whitney Newey, and Harvey Rosen. 1988. Estimating vector autoregressions with panel data. Econometrica 56: 1371–95. [Google Scholar] [CrossRef]
  27. Hsiao, Cheng, M. Hashem Pesaran, and A. Kamil Tahmiscioglu. 2002. Maximum likelihood estimation of fixed effects dynamic panel data models covering short time periods. Journal of Econometrics 109: 107–50. [Google Scholar] [CrossRef]
  28. Juodis, Artūras. 2013. A note on bias-corrected estimation in dynamic panel data models. Economics Letters 118: 435–38. [Google Scholar] [CrossRef]
  29. Kapetanios, G. 2008. A bootstrap procedure for panel data sets with many cross-sectional units. The Econometrics Journal 11: 377–95. [Google Scholar] [CrossRef]
  30. Kelejian, Harry H., and Ingmar R. Prucha. 2001. On the asymptotic distribution of the Moran I test statistic with applications. Journal of Econometrics 104: 219–57. [Google Scholar] [CrossRef]
  31. Kelejian, Harry H., and Ingmar R. Prucha. 2010. Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances. Journal of Econometrics 157: 53–67. [Google Scholar] [CrossRef] [PubMed]
  32. Kiviet, Jan F. 1995. On bias, inconsistency, and efficiency of various estimators in dynamic panel data models. Journal of Econometrics 68: 53–78. [Google Scholar] [CrossRef]
  33. Kruiniger, Hugo. 2018. A Further Look at Modified ML Estimation of the Panel AR(1) Model with Fixed Effects and Arbitrary Initial Conditions. MPRA Paper 88623. Munich: University Library of Munich. [Google Scholar]
  34. Lancaster, Tony. 2002. Orthogonal parameters and panel data. The Review of Economic Studies 69: 647–66. [Google Scholar] [CrossRef]
  35. Linz, Peter. 1985. Analytical and Numerical Methods for Volterra Equations. Philadelphia: SIAM. [Google Scholar] [CrossRef]
  36. Nickell, Stephen. 1981. Biases in dynamic models with fixed effects. Econometrica 49: 1417–26. [Google Scholar] [CrossRef]
  37. Okui, Ryo. 2021. Linear dynamic panel data models. In Handbook of Research Methods and Applications in Empirical Microeconomics. Edited by Nigar Hashimzade and Michael A. Thornton. Cheltenham: Edward Elgar Publishing, pp. 2–22. [Google Scholar]
  38. Pesaran, M. Hashem, and Cynthia Fan Yang. 2021. Estimation and inference in spatial models with dominant units. Journal of Econometrics 221: 591–615. [Google Scholar] [CrossRef]
  39. Phillips, Peter C. B., and Hyungsik R. Moon. 1999. Linear regression limit theory for nonstationary panel data. Econometrica 67: 1057–111. [Google Scholar] [CrossRef]
  40. Ullah, Aman. 2004. Finite Sample Econometrics. New York: Oxford University Press. [Google Scholar]
  41. White, Halbert. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48: 817–38. [Google Scholar] [CrossRef]
Table 1. Simulation results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under homoskedasticity.
Table 1. Simulation results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under homoskedasticity.
( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
Bias ( × 100 )WG−5.75−2.26−0.66−0.91−0.70−0.33
GMM−1.59−2.20−0.66−0.91−0.70−0.75
BC0.502.150.921.230.890.26
BCJ0.442.150.921.230.890.26
HPJ3.451.100.230.350.210.06
RMM−0.01−0.02−0.01−0.03−0.04−0.04
RMM r −0.01−0.02−0.01−0.03−0.04−0.04
RMSE ( × 100 )WG6.062.600.881.341.150.87
GMM2.482.550.881.341.151.52
BC2.332.711.111.661.330.86
BCJ2.312.701.111.661.330.86
HPJ4.652.230.751.321.170.91
RMM1.931.310.570.980.920.81
RMM r 1.951.310.570.980.920.81
Size ( 5 % )WG88.5444.0920.9515.8012.316.98
WG(h)86.7844.0122.1318.6515.5212.47
GMM13.3242.1020.9315.7212.218.64
GMM(h)14.3743.1222.5619.3916.3216.76
HPJ21.368.964.514.684.223.87
RMM(N)5.706.456.137.498.1710.52
RMM(T)6.456.255.505.705.565.34
RMM r (N)5.746.506.147.478.1310.52
RMM r ( N T )6.656.986.627.848.4110.52
RMM r (T)6.015.985.235.565.534.40
Note: WG is the within-group estimator; GMM is the one-step estimator of Arellano and Bond (1991); BC is the bias-corrected estimator of Bun and Carree (2006); BCJ is the bias-corrected estimator of Juodis (2013); HPJ is the half-panel jackknife estimator of Chudik et al. (2018); RMM and RMM r are, respectively, the recentered method of moments estimator in this paper and its robust version. For the size performance, WG(h) and GMM(h) are based on clustered standard errors at the individual level, RMM(N) and RMM r (N) are based on the large-N standard errors, RMM(T) is based on the large-T standard error, RMM r ( N T ) is based on the standard error that is valid under large N and large T, and RMM r (T) is based on the N / ( N 1 ) t N 1 approximation. Reported are the bias and RMSE (both scaled up by 100) out of 10,000 simulations of the estimated ϕ 01 + ϕ 02 + ϕ 03 and the empirical size of the 5 % two-sided t-test for testing the sum of the autoregressive parameters equal to its true value. The DGP follows (36) and (37).
Table 2. Simulation results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under cross-sectional heteroskedasticity.
Table 2. Simulation results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under cross-sectional heteroskedasticity.
( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
Bias ( × 100 )WG−39.47−14.10−4.51−4.18−2.78−0.84
GMM−21.83−14.08−4.51−4.18−2.78−1.89
BC−10.21−3.01−1.13−0.69−0.35−0.03
BCJ−10.24−3.01−1.13−0.69−0.35−0.03
HPJ0.693.071.121.260.820.15
RMM1.18−0.10−0.10−0.13−0.12−0.09
RMM r 1.00−0.08−0.10−0.13−0.12−0.09
RMSE ( × 100 )WG39.8714.574.824.743.411.57
GMM23.0514.574.824.743.412.89
BC12.565.092.032.352.001.33
BCJ12.595.092.032.352.001.33
HPJ9.666.702.593.402.771.54
RMM11.144.571.772.322.011.34
RMM r 11.084.611.772.322.011.34
Size ( 5 % )WG100.0099.2584.0552.5233.419.73
WG(h)100.0097.6878.3050.0434.9915.40
GMM89.5199.1484.0252.2833.2214.19
GMM(h)85.7097.5578.9651.7036.6823.15
HPJ8.0013.1710.048.427.024.20
RMM(N)6.475.786.847.658.4810.35
RMM(T)42.6616.9010.098.367.606.09
RMM r (N)6.375.646.837.678.4210.38
RMM r ( N T )24.6711.408.468.359.0710.22
RMM r (T)23.209.637.096.175.924.27
Note: See Table 1.
Table 3. Simulation results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under temporal heteroskedasticity.
Table 3. Simulation results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under temporal heteroskedasticity.
( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
Bias ( × 100 )WG−34.39−19.57−5.78−8.18−6.01−2.37
GMM−22.39−20.05−5.78−8.18−6.01−6.23
BC−3.76−5.74−2.06−3.18−2.27−0.64
BCJ−4.46−5.71−2.06−3.17−2.27−0.64
HPJ9.37−3.210.740.170.630.43
RMM17.84−2.01−0.62−1.16−0.82−0.27
RMM r 0.220.22−0.11−0.24−0.30−0.20
RMSE ( × 100 )WG34.8419.926.048.706.633.31
GMM23.8620.416.048.706.637.41
BC7.457.562.734.493.632.38
BCJ7.687.552.734.493.632.38
HPJ13.596.982.594.373.982.86
RMM18.625.271.943.413.032.36
RMM r 6.845.391.913.443.052.36
Size ( 5 % )WG100.00100.0094.7685.0864.1018.74
WG(h)100.0099.9792.5181.6362.2223.52
GMM84.78100.0094.7684.9463.9545.46
GMM(h)81.9499.9692.8582.7364.0345.42
HPJ24.2113.719.418.928.366.40
RMM(N)0.3210.698.009.498.9611.01
RMM(T)92.0217.169.6710.688.747.00
RMM r (N)5.574.946.366.927.7110.93
RMM r ( N T )9.2612.819.6411.7510.4411.41
RMM r (T)8.6111.248.188.157.074.90
Note: See Table 1.
Table 4. Simulation Results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under double heteroskedasticity.
Table 4. Simulation Results: ϕ 0 = ( 0.3 , 0.3 , 0.2 ) under double heteroskedasticity.
( N , T ) ( 100 , 10 ) ( 50 , 20 ) ( 50 , 50 ) ( 25 , 40 ) ( 20 , 50 ) ( 10 , 100 )
Bias ( × 100 )WG−37.33−16.83−4.58−6.95−5.18−2.29
GMM−17.49−16.50−4.58−6.95−5.18−5.89
BC−9.98−4.70−1.22−2.31−1.66−0.63
BCJ−9.81−4.66−1.22−2.30−1.66−0.63
HPJ−0.371.461.271.471.210.42
RMM−6.25−2.71−0.31−0.88−0.58−0.30
RMM r 0.910.13−0.08−0.27−0.24−0.25
RMSE ( × 100 )WG37.7017.244.857.525.833.18
GMM18.5916.924.857.525.837.04
BC11.846.482.023.733.152.28
BCJ11.726.472.013.733.152.28
HPJ8.916.352.574.463.922.75
RMM10.115.231.683.132.822.26
RMM r 10.015.601.683.162.842.26
Size ( 5 % )WG100.0099.9487.2777.0456.6518.48
WG(h)100.0099.7183.8172.7755.0224.20
GMM83.6699.9187.2276.8456.4542.85
GMM(h)81.1599.6184.3473.9456.5344.42
HPJ9.0112.0212.8211.4610.676.20
RMM(N)19.9711.996.618.488.6810.59
RMM(T)45.3222.738.5510.649.046.57
RMM r (N)6.184.626.287.328.3610.53
RMM r ( N T )23.1015.808.8310.7610.6710.88
RMM r (T)21.9314.087.327.957.134.70
Note: See Table 1.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bao, Y. Estimating Linear Dynamic Panels with Recentered Moments. Econometrics 2024, 12, 3. https://doi.org/10.3390/econometrics12010003

AMA Style

Bao Y. Estimating Linear Dynamic Panels with Recentered Moments. Econometrics. 2024; 12(1):3. https://doi.org/10.3390/econometrics12010003

Chicago/Turabian Style

Bao, Yong. 2024. "Estimating Linear Dynamic Panels with Recentered Moments" Econometrics 12, no. 1: 3. https://doi.org/10.3390/econometrics12010003

APA Style

Bao, Y. (2024). Estimating Linear Dynamic Panels with Recentered Moments. Econometrics, 12(1), 3. https://doi.org/10.3390/econometrics12010003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop