Next Article in Journal
Dirichlet Process Log Skew-Normal Mixture with a Missing-at-Random-Covariate in Insurance Claim Analysis
Previous Article in Journal
Detecting Pump-and-Dumps with Crypto-Assets: Dealing with Imbalanced Datasets and Insiders’ Anticipated Purchases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Matrix Statistic for the Hausman Endogeneity Test under Heteroskedasticity

by
Alecos Papadopoulos
Department of Economics, Athens University of Economics and Business, TK 10434 Athens, Greece
Econometrics 2023, 11(4), 23; https://doi.org/10.3390/econometrics11040023
Submission received: 10 August 2023 / Revised: 28 September 2023 / Accepted: 2 October 2023 / Published: 10 October 2023

Abstract

:
We derive a new matrix statistic for the Hausman test for endogeneity in cross-sectional Instrumental Variables estimation, that incorporates heteroskedasticity in a natural way and does not use a generalized inverse. A Monte Carlo study examines the performance of the statistic for different heteroskedasticity-robust variance estimators and different skedastic situations. We find that the test statistic performs well as regards empirical size in almost all cases; however, as regards empirical power, how one corrects for heteroskedasticity matters. We also compare its performance with that of the Wald statistic from the augmented regression setup that is often used for the endogeneity test, and we find that the choice between them may depend on the desired significance level of the test.

1. Introduction

The Hausman family of specification tests was introduced by Hausman (1978) and it has seen unabated use in econometrics ever since. Amini et al. (2012) detail its wide reach and different implementations for panel data, while in a cross-sectional setting, the test has often been used to test for regressor endogeneity.
In the cross-sectional setting, the test statistic is formally based on a “vector of contrasts”, the difference of two estimators, where under the null hypothesis, both are consistent and the one is also efficient, while under the alternative, only one is consistent. This form of the test uses a variance expression that is often singular, requiring a generalized inverse. To bypass the singularity of the variance matrix an “augmented regression” approach has been developed, linking the test with its precursors (Durbin 1954; Wu 1973, 1974).1
The efficiency of one of the estimators under the null hypothesis has a very convenient consequence: the variance of the difference of the two estimators equals the difference of their variances; thus, we do not have to compute covariances. However, when heteroskedasticity is present (and it is expected to exist regularly in cross sectional studies), this helpful simplification is no longer valid. Adkins et al. (2012) have examined in great detail this endogeneity test in situations of heteroskedasticity, and they take the augmented regression route to formulate the various test variants that they implement.
In this study, we push a known result in the literature to its conclusion, and we arrive at a new matrix Hausman statistic for an endogeneity test. This new statistic is a useful additional tool to have, for the following reasons: it handles heteroskedasticity in a natural way; it could be a more familiar tool to use for researchers that are accustomed to using matrix algebra and forms; compared to the original form of the Hausman statistic, it does not use generalized inverses. In fact, if the matrix involved is not invertible, it reflects the existence of perfect collinearity between some instruments and some endogenous regressors, which invalidates the instruments. Finally, in Monte Carlo simulations that we will present, it performed better than the “augmented regression” test in terms of power, when executing the test at the 10% significance level.

2. The Matrix Hausman Statistic for Testing Endogeneity

We follow the notation of Adkins et al. (2012). We consider the linear regression model y = X β + u . The vectors y and u are n × 1 , and u is assumed to be zero-mean. The regressor matrix is partitioned X = [ X 1 X 2 ] . X 1 is an n × K 1 submatrix of regressors thought to be endogenous (so not orthogonal to the error term) while X 2 is an n × K 2 submatrix of exogenous regressors (or “internal instruments”). The unknown of interest is the vector β . We have available Λ 1 K 1 “external instruments” collected in matrix Z 1 and the full instruments matrix is Z = [ Z 1 X 2 ] . We write the orthogonal projection matrix P z = Z ( Z Z ) 1 Z and the residual-maker (or annihilator) matrix M x = I n P x , with I n being the n × n Identity matrix. The subscript in P and M determines which collection of variables we use in each case. These matrices are symmetric and idempotent. We write u ^ = M x y for the residuals from the Ordinary Least Squares regression (OLS). We write P z X X ^ = [ X ^ 1 X 2 ] for the linear projection of X on the columns of Z (the “fitted values”). Note that P z X 2 = X 2 because X 2 belongs in the column space of P z .
The OLS estimator for β is β ^ O L S = ( X X ) 1 X y while the benchmark Instrumental Variables (IV) estimator when instruments are more in number than endogenous regressors is β ^ I V = ( X ^ X ^ ) 1 X ^ y (two-stage least-squares). The basic expression of the Hausman statistic under homoskedasticity of the error term (with the OLS estimate of the error variance σ ^ u 2 ) is
( σ ^ u 2 ) 1 · β ^ I V β ^ O L S ( X ^ X ^ ) 1 ( X X ) 1 1 β ^ I V β ^ O L S
This is the statistic where we may encounter trouble in inverting the middle matrix, which, moreover, may not even be positive definite in finite samples. It may render the test inapplicable, or necessitate the use of a generalized inverse instead.
To bypass this issue, while simultaneously accounting for heteroskedasticity, we start by noting that the core of the statistic for the Hausman test is the difference
β ^ I V β ^ O L S = ( X ^ X ^ ) 1 X ^ y ( X X ) 1 X y = ( X P z X ) 1 X P z y ( X X ) 1 X y = ( X P z X ) 1 X P z y X P z X ( X X ) 1 X y = ( X P z X ) 1 X P z I n X ( X X ) 1 X y = ( X P z X ) 1 X P z M x y = ( X ^ X ^ ) 1 X ^ u ^ .
We have used the fact that P z , M x are symmetric and idempotent and that M x y = M x u = u ^ . Result (2) is known in the literature. For example, Greene (2012, p. 276) arrives at it, but he does not go further. Also Adkins et al. (2012) actually start with it (their Equation (1)), but then they use the augmented regression approach to proceed. Also, later in their paper, when they re-purpose the “weak vs. strong instruments” test of Hahn et al. (2011), they “directly estimate the asymptotic covariance matrix of the contrast”, but the expression they give is inconveniently long, since this variance can be nicely compacted, as we will show. To our knowledge, the result in Equation (2) has not been pursued to the very end for the construction of a Hausman statistic and test, and this is what we will do here.
The null hypothesis of the Hausman test is that the two estimators converge to the same probability limit (plim):
H 0 : plim ( β ^ I V β ^ O L S ) = 0 .
To examine this hypothesis we consider the limiting distribution of the scaled difference and its variance, which, under the null hypothesis and given Equation (2), is
Avar n ( β ^ I V β ^ O L S ) V = plim ( n 1 X ^ X ^ ) 1 S ( n 1 X ^ X ^ ) 1 .
The middle matrix is S = plim ( n 1 X ^ u ^ u ^ X ^ ) . We can then formulate a theoretical statistic for the endogeneity test,
q = n · ( β ^ I V β ^ O L S ) V ( β ^ I V β ^ O L S ) H 0 d χ K 1 2 .
Here, V denotes a generalized inverse of V.2 Combining Equation (2) and (3) with (4), we arrive at the following statistic feasible to compute, for some consistent estimator S ^ ,
q ^ = n 1 · u ^ X ^ ( X ^ X ^ ) 1 ( X ^ X ^ ) 1 S ^ ( X ^ X ^ ) 1 ( X ^ X ^ ) 1 X ^ u ^ .
We show in Appendix A that a generalized inverse of the middle matrix is ( X ^ X ^ ) S ^ + ( X ^ X ^ ) , where S ^ + denotes the Moore–Penrose generalized inverse. Inserting this in the expression for q ^ , we can simplify,
q ^ = n 1 · u ^ X ^ S ^ + X ^ u ^ .
Next, because X ^ includes the submatrix X 2 , which is by construction orthogonal to the OLS residuals u ^ , we obtain that
S = Q K 1 × K 1 0 K 1 × K 2 0 K 2 × K 1 0 K 2 × K 2 , Q = plim ( n 1 X ^ 1 u ^ u ^ X ^ 1 ) .
We show in Appendix B that
S + = Q K 1 × K 1 1 0 K 1 × K 2 0 K 2 × K 1 0 K 2 × K 2 .
We have managed to eliminate the generalized inverse and to use a proper inverse.3 What remains now is to find a consistent estimator for the Q matrix. Decomposing the OLS residuals, we have
X ^ 1 u ^ u ^ X ^ 1 = X ^ 1 M x u u M x X ^ 1 .
This matrix expression, where we sandwich the outer product of the error vector, may look familiar to those acquainted with the heteroskedasticity-robust estimation literature and one could expect that we could now use the squared residuals in place of u u to estimate Q. However, there is an issue: the matrix M x is n × n , growing in both dimensions as the sample size increases. So it is not clear that the related proof strategy of White (1980) is applicable here. Nevertheless, by drilling down even more, we arrive, in Appendix C, at an expression that contains matrix products with finite dimensions. Thus, we can indeed apply this substitution, which provides a consistent estimator for Q as
Q ^ = n 1 X ^ 1 M x Ω ^ 0 M x X ^ 1 , Ω ^ 0 = diag { u ^ i 2 } .
This indeed looks like a “White” estimator of a heteroskedastic covariance matrix. The expression is valid under the formal assumptions stated in White (1980), which we do not repeat here for brevity.
Equation (9) allows us also to conclude that the matrix Q must be invertible; otherwise, at least one component of the instruments matrix is not valid.
This can be shown in the following way: Let x 1 j , j = 1 , . . . , K 1 be a column of X 1 , the submatrix with the endogenous regressors. If P z x 1 j = x ^ 1 j = x 1 j , we will have M x x ^ 1 j = 0 so X ^ 1 M x will have a column of zeros and Q will be singular. However, P z x 1 j = x 1 j implies that x 1 j belongs to the column space of the instruments matrix Z, meaning that it is an exact linear combination of the columns of Z. However, if this were the case, it necessarily implies that at least one of the instruments would be correlated with the error term, and so Z too would suffer from endogeneity. Namely, if x 1 j = Λ 1 + K 2 a z while E ( x 1 j u ) 0 , we will have E u Λ 1 + K 2 a z 0 .
Therefore, using a proper inverse here also serves the function of alerting us that such exact linear dependence exists between instruments and endogenous regressors, if the matrix Q proves to be non-invertible. In such a case, executing the endogeneity test using a generalized inverse would be wrong; one first has to somehow correct the instruments matrix to restore the validity of the instruments.4
Lastly, using these results in the expression for q ^ , and again the fact that u ^ X ^ = u ^ X ^ 1 : 0 , we arrive at the final expression for the heteroskedasticity-robust matrix Hausman statistic (where we have also canceled out the n 1 factors),
q ^ h e t = u ^ X ^ 1 X ^ 1 M x Ω ^ 0 M x X ^ 1 1 X ^ 1 u ^ H 0 d χ K 1 2 , Ω ^ 0 = diag { u ^ i 2 } .
Computing this statistic first requires running OLS estimation on the original model to obtain the residuals u ^ and their squares for Ω ^ 0 , and then using the matrices P z and X 1 (since X ^ 1 = P z X 1 ), and also the matrix M x . The matrices P z and M x are of dimension n × n ; thus, for very large samples, they may be taxing for the software (although they will be used just once in an actual applied study; it is simulation studies that may be considerably slowed down when using them). If one wishes to avoid them, to obtain X ^ 1 , we can run OLS regressions for the columns of X 1 on Z, and also, to compute M x X ^ 1 we can run regressions of X ^ 1 on X and obtain the resulting residual series.
The statistic can be used also under the assumption of homoskedasticity, in which case it becomes
q ^ h o m = ( σ ^ u 2 ) 1 u ^ X ^ 1 X ^ 1 M x X ^ 1 1 X ^ 1 u ^ H 0 d χ K 1 2 .
To increase power, one should use the error variance estimator from the OLS regression.
Equations (10) and (11) are the main theoretical contribution of this study. We have exploited the expression for the vector of contrasts in terms of projected regressors and OLS residuals, and we have arrived at a matrix Hausman statistic that incorporates possible heteroskedasticity in a natural way, it has a compact form, it does not use generalized inverses, and it guards against invalid instruments.
In the next section we present results from a simulation study to examine the performance of this matrix Hausman statistic, looking also into the variants that have been proposed for Ω ^ in an attempt to improve finite-sample performance. A recent overview and Monte Carlo study for these “HCx” estimators for heteroskedastic variance matrices can be found in MacKinnon (2013).

3. Monte Carlo Study

3.1. Description

We constructed a data generation process (DGP) with a constant term, one exogenous variable, two “suspected” endogenous variables and three external valid instruments. We considered a case where the DGP includes an unobservable covariate uncorrelated with the regressors (so here, OLS is consistent even if this variable is not included in the regressor matrix), and one where it is correlated (so there is endogeneity and OLS is inconsistent). The first case serves to examine the empirical size of the test, while the second provides information about the power of the test. The technical details of the Monte Carlo study are presented in Appendix D.
We created four scenarios as regards heteroskedasticity of the error term: homoskedasticity, heteroskedasticity with the error variance randomly changing per observation independently of the regressors, “group-wise” heteroskedasticity, where the error variance takes only three distinct values with equal probability, again independent of the regressors, and finally, a “random-coefficients” model, which leads to the error variance being a function of the regressors, without this affecting mean-independence. We considered sample sizes n = 50 , 75 , 100 , 200 , and in each case, we executed 10,000 repetitions. In all cases, we initiated the random number generator by the same seed. This has two consequences: first, for a given sample size all scenarios have identical series for the observable variables, and they differ only with respect to the endogeneity/heteroskedasticity aspect. Second, for each scenario, as we increase the sample size the previous generated values are fully part of the larger sample. In this way, we mimic the accumulation of data rather than the availability of independent larger data sets.
As regards the statistic, we used both its homoskedastic variant (i.e., assuming, correctly or not, that the true error is homoskedastic), as well as the four best-known alternatives for estimation of heteroskedastic variance matrices, HCx , x = 0 , 1 , 2 , 3 , as these are defined in MacKinnon (2013): writing h i i for the diagonal element of the projection matrix P x , we have
HC 0 : Ω ^ 0 = diag { u ^ i 2 } , HC 1 : Ω ^ 1 = n n k diag { u ^ i 2 } , HC 2 : Ω ^ 2 = diag { u ^ i 2 / ( 1 h i i ) } , HC 3 : Ω ^ 3 = diag { u ^ i 2 / ( 1 h i i ) 2 } .
Note that k is the number of regressors each time. For our matrix statistic, the number of regressors is k = K 1 + K 2 , where K 1 = K 2 = 2 .

3.2. Comparative Performance of the Variants of the Matrix Hausman Statistic

Here, we assess how the statistic performs in terms of empirical size and power, as we change the heteroskedasticity correction. We do not compare it with other forms of the Hausman test, because we want first to determine whether it has an acceptable performance (empirical size close to nominal, power rising fast with the sample size). If it performs acceptably, then a case arises to compare it to other forms of the Hausman test. We present the results in Table 1, which relates to testing at the 5 % significance level.
We have the following main observations: first, the behavior of the Hausman matrix statistic as regards empirical size is rather stable, across different true skedastic scenarios as well as across different HCx ways to incorporate the possible heteroskedasticity. In fact, it performs acceptably in relation to the size of the test, even if we ignore the possible presence of heteroskedasticity and use (11) instead of (10). Second, for the various HCx variants to account for heteroskedasticity, the empirical size monotonically falls as we increase the strength of the finite-sample correction that we apply. Results for testing at the 10 % significance level (available upon request) show a similar behavior in relation to empirical size.
As regards empirical power, the choice of the heteroskedastic variant for Ω ^ matters even more, for small sample sizes. Power also deteriorates monotonically and visibly, while the highest power is achieved when we use the homoskedastic variant (where the test is slightly conservative).5 Overall, for testing at the 5 % significance level, it appears that the prudent thing to do when applying this statistic, is to use its HC 0 heteroskedastic formula. When testing at the 10 % significance level, power increases visibly. For example, under conditional heteroskedasticity, the power at 10 % for sample sizes n = 50 , 75 tends to be higher by a factor of 1.2 to almost 1.9 , i.e., almost double the power at the 5 % significance level, all else being equal. For the 10 % significance level therefore, it appears best to use the homoskedastic variant of the matrix statistic, Equation (11).
The finding that ignoring heteroskedasticity while it exists may lead to better-performing tests should not be surprising for small samples. To account for heteroskedasticity, we use additional estimated quantities, the OLS residuals individually and this should be expected to negatively affect statistical power in the context of a small sample.

3.3. Comparison with the Wald Statistic from the Augmented Regression Approach

Using the exact same simulated data sets, we have also computed the Wald statistic coming from the augmented regression setup to test for endogeneity. Here, we first regress the suspected endogenous regressors X 1 on the full instrument matrix Z, we obtain the residuals M z X 1 and we include these residuals in an augmented regressor matrix X A = [ X 1 : X 2 : M z X 1 ] .6 We run an OLS regression of the dependent variable on X A , and we compute a Wald test for the coefficients of the regressors in the submatrix M z X 1 .7
In the interest of space, we do not report the full results here. The Wald statistic clearly has a size problem for these small samples: it tends to over-reject the correct null hypothesis, sometimes even having empirical size nearly double the nominal one (for both 5 % and 10 % nominal significance levels). The over-rejection becomes less than one percentile across skedastic scenarios, only with the HC 3 heteroskedasticity correction.8 In Table 2, we report the performance for this statistic for testing at the 5 % significance level, and we repeat the performance metrics for our matrix statistic with the HC 0 formula from Table 1.
The Wald statistic appears to have an advantage as regards power, even though some of this advantage will be lost due to the correction for the slightly oversized test. However, the picture changes if we want to test at the 10 % significance level. Here, it is our matrix statistic (which moreover assumes homoskedasticity) that has the advantage in terms of power, as is shown in Table 3.
Overall, no statistic dominates the other, and for sample sizes larger than n = 200 the two are essentially equivalent in terms of size and power. For smaller samples, the desired significance level of the test can be our guide in order to choose between them.

Funding

This research received no external funding.

Data Availability Statement

No data were used for this study.

Conflicts of Interest

The author declare no conflicts of interest.

Appendix A

We argue that a generalized inverse of ( X ^ X ^ ) 1 S ^ ( X ^ X ^ ) 1 is ( X ^ X ^ ) S ^ + ( X ^ X ^ ) .
A generalized inverse A of matrix A satisfies A A A = A . Setting for compactness ( X ^ X ^ ) 1 C 1 and A = C 1 S ^ C 1 , our candidate generalized inverse is A = C S ^ + C . We have
A A A = [ C 1 S ^ C 1 ] [ C S ^ + C ] [ C 1 S ^ C 1 ] = C 1 S ^ S ^ + S ^ C 1 = C 1 S ^ C 1 = A ,
which is what we wanted to show. S ^ S ^ + S ^ = S ^ holds because the Moore–Penrose inverse satisfies this condition, among others.

Appendix B

We argue that
S = Q K 1 × K 1 0 K 1 × K 2 0 K 2 × K 1 0 K 2 × K 2 S + = Q K 1 × K 1 1 0 K 1 × K 2 0 K 2 × K 1 0 K 2 × K 2 .
In order for a matrix A + to be indeed the unique Moore–Penrose pseudo-inverse of matrix A, it must satisfy four conditions:
  • A A + A = A
  • A + A A + = A +
  • ( A A + ) = A A +
  • ( A + A ) = A + A .
Note that Q is symmetric. For condition 1, we have
Q 0 0 0 · Q 1 0 0 0 · Q 0 0 0 = I K 1 0 0 0 · Q 0 0 0 = Q 0 0 0 .
For condition 2, we have
Q 1 0 0 0 · Q 0 0 0 · Q 1 0 0 0 = I K 1 0 0 0 · Q 1 0 0 0 = Q 1 0 0 0 .
For condition 3, we have
Q 0 0 0 · Q 1 0 0 0 = I K 1 0 0 0 = Q 0 0 0 · Q 1 0 0 0 ,
and for condition 4, we have analogously
Q 1 0 0 0 · Q 0 0 0 = I K 1 0 0 0 = Q 1 0 0 0 · Q 0 0 0 .

Appendix C

We want to find a consistent estimator for
Q = plim ( n 1 X ^ 1 u ^ u ^ X ^ 1 ) .
We have
X ^ 1 u ^ = X 1 P z u ^ = X 1 Z ( Z Z ) 1 Z u ^ = X 1 Z ( Z Z ) 1 Z 1 X 2 u ^ .
However, X 2 u ^ = 0 ; thus, carrying out the multiplications and including the sample size as a scaling factor, we arrive at
Q = plim n 1 X 1 Z ( n 1 Z Z ) 1 n 1 Z 1 u ^ u ^ Z 1 0 0 0 ( n 1 Z Z ) 1 n 1 Z X 1 .
The standard regularity conditions are assumed to hold, and so the matrices that sandwich the middle one are well defined and converge to a finite probability limit. Focusing on the middle one, we have
u ^ u ^ = M x u u M x = ( I n P x ) u u ( I n P x ) = u u u u P x P x u u + P x u u P x .
Further, Z 1 P x = Z 1 X ( X X ) 1 X and P x Z 1 = X ( X X ) 1 X Z 1 ; thus, adding scaling factors again, we have
n 1 Z 1 u ^ u ^ Z 1 = n 1 Z 1 u u Z 1 n 1 Z 1 u u X ( X X ) 1 X Z 1 n 1 Z 1 X ( n 1 X X ) 1 n 1 X u u Z 1 + n 1 Z 1 X ( n 1 X X ) 1 n 1 X u u X ( n 1 X X ) 1 n 1 X Z 1 .
Under the regularity conditions and the assumptions in White (1980), the probability limits of all these matrix products can be consistently estimated if in place of u u , we use Ω ^ 0 = diag { u ^ i 2 } . Reverting back, this means that
n 1 Z 1 M x Ω ^ 0 M x Z 1 p plim n 1 Z 1 M x u u M x Z 1 = plim n 1 Z 1 u ^ u ^ Z 1 .
Thus, we have obtained that a consistent estimator of the matrix Q is (now eliminating the redundant scaling factors)
Q ^ = X 1 Z ( Z Z ) 1 n 1 Z 1 M x Ω ^ 0 M x Z 1 0 0 0 ( Z Z ) 1 Z X 1 p Q .
This can be compacted. Consider the matrix, suitably bracketed,
X ^ 1 M x Ω ^ 0 M x X ^ 1 = X 1 P z M x Ω ^ 0 M x P z X 1 = X 1 Z ( Z Z ) 1 Z M x Ω ^ 0 M x Z ( Z Z ) 1 Z X 1 .
The outer terms are identical to the outer terms of Q ^ . Its middle term Z M x Ω ^ 0 M x Z can be decomposed,
Z M x Ω ^ 0 M x Z = Z 1 X 2 M x Ω ^ 0 M x Z 1 X 2 = Z 1 M x Ω ^ 0 X 2 M x Ω ^ 0 M x Z 1 M x X 2 .
However, M x X 2 = X 2 M x = 0 so
Z M x Ω ^ 0 M x Z = Z 1 M x Ω ^ 0 M x Z 1 0 0 0 .
This is identical to the middle component of Q ^ ; thus, we arrive at
Q ^ = n 1 X 1 Z ( Z Z ) 1 Z M x Ω ^ 0 M x Z ( Z Z ) 1 Z X 1 = n 1 X 1 P z M x Ω ^ 0 M x P z X 1 = n 1 X ^ 1 M x Ω ^ 0 M x X ^ 1 .

Appendix D

We present here the details of the Monte Carlo (MC) study whose results we report in the main text. The study was conducted using the software “gretl”. For the random number generator, we have used the seed 1930021000.
Table A1 contains the random variables that we have used as building blocks.
Table A1. Building blocks of MC simulation.
Table A1. Building blocks of MC simulation.
SymbolDistribtuionDescription
U 1 F ( 20 , 15 ) Snedecor’s F-distr. with d.f. 20 (num.) and 15 (denom.)
U 3 P(1)Poisson with mean equal to 1
U 4 χ 2 ( 3 ) Chi-square with 3 d.f.
U 5 N ( 1 , 2 ) Normal with mean equal to −1 and st.dev. 2
U 6 t ( 6 ) Student’s-t with 6 d.f.
U 7 U ( 2 , 2 ) Continuous Uniform in (−2,  2)
U 8 U ( 0 , 2 ) Continuous Uniform in (0,  2)
U 9 U d ( 0 , 2 ) Discrete Uniform in { 0 , 1 , 2 }
We have generated one “unobservable” that creates endogeneity and one that it does not, two regressors that become endogenous when the correlated unobservable is used in the data generation process, one exogenous, and three instruments. Table A2 contains the generating expressions.
Table A2. Variables in the MC simulation.
Table A2. Variables in the MC simulation.
SymbolExpressionStatus
L 1 0.7 U 6 + U 7 Latent correlated
L 2 U 7 Latent uncorrelated
X 11 U 1 + U 3 + U 6 Endogenous given L 1
X 12 0.5 U 3 + U 5 0.5 U 6 Endogenous given L 1
X 2 U 1 + U 5 Exogenous
Z 11 U 3 U 1 Instrument
Z 12 | U 5 | Instrument
Z 13 U 3 U 5 Instrument
We note that, even though U 5 is a Normal random variable, the instrument Z 12 = | U 5 | is relevant (correlated with the endogenous variables), because U 5 has a non-zero mean, and so its absolute value is a Folded Normal, and remains correlated with U 5 (in contrast, if U 5 was a zero-mean Normal and consequently Z 12 a Half Normal, their covariance would be zero).
As regards the scenarios of homoskedasticity, random heteroskedasticity, and groupwise heteroskedasticity, the error term (including the unobservable variable) was generated as shown in Table A3 ( s = 1 implies that the correlated L 1 latent variable was used).
Table A3. Error terms in the MC simulation.
Table A3. Error terms in the MC simulation.
SymbolExpressionModel
u [ 1 , s ] N ( 0 , 2 ) + 3 L s , s = 1 , 2 Homoskedasticity
u [ 2 , s ] N ( 0 , 1 + U 8 ) + 3 L s , s = 1 , 2 Random Heteroskedasticity
u [ 3 , s ] N ( 0 , 1 + U 9 ) + 3 L s , s = 1 , 2 Groupwise Heteroskedasticity
For these three setups, the dependent variable was generated as
Y [ t , s ] = 1 5 X 2 + 2 X 11 + 1.5 X 12 + u [ t , s ] , t = 1 , 2 , 3 , s = 1 , 2 .
So, for example, Y [ 2 , 2 ] is the situation where we have random heteroskedasticity and no endogeneity; thus, it was used to assess the empirical size of the test for this specific heteroskedastic scenario.
For the conditional heteroskedasticity scheme, we used a random-coefficients model, with mean values of the random coefficients equal to the specified coefficients above, together with the homoskedastic error term u [ 1 , s ] , namely
Y [ 4 , s ] = N ( 1 , 0.2 ) N ( 5 , 1 ) · X 2 + N ( 2 , 0.4 ) · X 11 + N ( 1.5 , 0.3 ) · X 12 + u [ 1 , s ] , s = 1 , 2 .
Decomposed, this leads to
Y [ 4 , s ] = 1 5 X 2 + 2 X 11 + 1.5 X 12 + u [ 4 , s ] , u [ 4 , s ] = N ( 0 , 0.2 ) N ( 0 , 1 ) · X 2 + N ( 0 , 0.4 ) · X 11 + N ( 0 , 0.3 ) · X 12 + u [ 1 , s ] , s = 1 , 2 .
Since all conditional heteroskedasticity factors are scaled by independent zero-mean Normals, no additional source of endogeneity is created.
The general matrix Hausman statistic is
q ^ h e t = u ^ X ^ 1 X ^ 1 M x Ω ^ M x X ^ 1 1 X ^ 1 u ^ .
The OLS regressions regressed Y [ t , s ] on a constant and X = ( X 2 : X 11 : X 12 ) . u ^ are the OLS residuals used also in computing Ω ^ . M x = I n P x , where P x is the projection matrix of X. Its diagonal elements h i i were used for the variants of Ω ^ . X 1 = ( X 11 : X 12 ) and X ^ 1 = P z X 1 , where P z is the projection matrix of a constant and of ( X 2 : Z 1 : Z 2 : Z 3 ) . For the homoskedastic variant of the statistic, the estimated OLS error variance σ ^ u was used, instead of Ω ^ .

Notes

1
Sometimes it is also called the “artificial regression” or “control function” approach.
2
In the literature, the test is presented with the use of the Moore–Penrose pseudo-inverse V + , most likely because its uniqueness avoids the necessity to choose among alternatives in an ad hoc manner, as well as the uncertainty of obtaining possibly different results for different generalized inverses in finite samples. Regardless, the limiting distributional result holds for any generalized inverse, see Hausman and Taylor (1981).
3
The need for a generalized inverse in the original formulation of the test is treated as “cumbersome” in the literature, see for example Greene (2012, p. 276) and Wooldridge (2002, p. 119), and it is also put forth as an argument to favor the use of the augmented regression test.
4
The “augmented regression” test also guards against this possibility, since it uses the residuals from regressing each endogenous variable on the instruments. If exact linear dependence exists, the related series of residuals will be a series of zeros.
5
This monotonic fall of power, as we “intensify” the degree to which we attempt to correct the heteroskedasticity estimator for finite sample performance, is in accord with what MacKinnon (2013, pp. 456–57) found.
6
In case there is an issue with the validity of the instruments, as discussed earlier, in the augmented regression method, we would get at least one series of zero residuals.
7
So, as regards the heteroskedasticity corrector HC 1 , the number of regressors in the augmented regression setup is k = 2 K 1 + K 2 , while for HC 2 and HC 3 , the diagonal element h i i is of a projection matrix that includes these additional variables.
8
MacKinnon (2013, pp. 449–52) also found in his simulations that the HC3 variant performs best as regards empirical size in small samples.

References

  1. Adkins, Lee C., Randall C. Campbell, Viera Chmelarova, and R. Carter Hill. 2012. The Hausman test, and some alternatives, with heteroskedastic data. In Essays in Honor of Jerry Hausman. Advances in Econometrics, vol. 29. Leeds: Emerald Group Publishing Ltd. [Google Scholar]
  2. Amini, Shahram, Michael S. Delgado, Daniel J. Henderson, and Christopher F. Parmeter. 2012. Fixed vs. random: The Hausman test four decades later. In Essays in Honor of Jerry Hausman. Advances in Econometrics, vol. 29. Leeds: Emerald Group Publishing Ltd. [Google Scholar]
  3. Durbin, James. 1954. Errors in variables. Revue de l’institut International de Statistique 22: 23–32. [Google Scholar] [CrossRef]
  4. Greene, William H. 2012. Econometric Analysis, 7th ed. Harlow: Pearson Education Ltd. [Google Scholar]
  5. Hahn, Jinyong, John C. Ham, and Hyungsik Roger Moon. 2011. The Hausman test and weak instruments. Journal of Econometrics 160: 289–99. [Google Scholar] [CrossRef]
  6. Hausman, Jerry A. 1978. Specification tests in econometrics. Econometrica 46: 1251–71. [Google Scholar] [CrossRef]
  7. Hausman, Jerry A., and William E. Taylor. 1981. A generalized specification test. Economics Letters 8: 239–45. [Google Scholar] [CrossRef]
  8. MacKinnon, James G. 2013. Thirty years of heteroskedasticity-robust inference. In Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. Edited by Xiaohong Chen and Norman R. Swanson. New York: Springer, pp. 437–61. [Google Scholar]
  9. White, Halbert. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48: 817–38. [Google Scholar] [CrossRef]
  10. Wooldridge, Jeffrey M. 2002. Econometric Analysis of cross Section and Panel Data. Cambridge: MIT Press. [Google Scholar]
  11. Wu, De-Min. 1973. Alternative tests of independence between stochastic regressors and disturbances. Econometrica 41: 733–50. [Google Scholar] [CrossRef]
  12. Wu, De-Min. 1974. Alternative tests of independence between stochastic regressors and disturbances: Finite sample results. Econometrica 42: 529–46. [Google Scholar] [CrossRef]
Table 1. Monte Carlo simulation study. Empirical Size and Power of the matrix Hausman statistic. Nominal size: 5%.
Table 1. Monte Carlo simulation study. Empirical Size and Power of the matrix Hausman statistic. Nominal size: 5%.
Skedastic ScenarioHomoskedasticityRandomGroup-WiseConditional
nRobust EstimationSizePowerSizePowerSizePowerSizePower
50Homoskedastic4.6949.054.7748.434.8647.235.5038.57
HC05.5441.095.5540.675.4540.115.8132.04
HC14.3035.064.2334.643.9634.224.0326.53
HC24.0333.004.0032.223.6732.113.5924.30
HC32.7724.932.7024.462.2324.102.1817.36
75Homoskedastic4.5071.744.6171.374.7870.515.5357.97
HC05.2164.985.3064.365.3563.605.6151.21
HC14.3761.444.4360.944.3960.084.5347.26
HC24.1559.414.0958.974.1658.064.1345.04
HC33.1853.433.1252.773.1551.412.9138.67
100Homoskedastic4.6286.164.7085.314.8084.965.7673.35
HC05.0281.335.2380.725.1780.255.0566.79
HC14.5179.554.3478.594.5378.394.4764.38
HC24.3478.234.1677.384.2477.044.1562.46
HC33.4374.233.3673.253.5872.653.3257.35
200Homoskedastic4.9099.504.8599.454.4799.316.0896.79
HC05.1999.135.0099.024.8698.955.2694.74
HC14.8699.064.8198.984.5098.874.9794.34
HC24.7098.944.7598.854.3698.734.7993.77
HC34.2798.534.2998.433.9598.324.2892.67
Table 2. Comparison in empirical size and power of the matrix Hausman statistic vs. the Wald statistic from the augmented regression setup. Nominal size: 5%.
Table 2. Comparison in empirical size and power of the matrix Hausman statistic vs. the Wald statistic from the augmented regression setup. Nominal size: 5%.
Skedastic ScenarioHomoskedasticityRandomGroup-WiseConditional
nStatisticSizePowerSizePowerSizePowerSizePower
50 q ^ h e t -HC05.5441.095.5540.675.4540.115.8132.04
Wald-HC35.7645.175.7444.845.7644.215.6434.93
75 q ^ h e t -HC05.2164.985.3064.365.3563.605.6151.21
Wald-HC35.5768.425.4567.765.5367.535.4352.96
100 q ^ h e t -HC05.0281.335.2380.725.1780.255.0566.79
Wald-HC35.4083.625.2982.635.3982.565.3167.84
200 q ^ h e t -HC05.1999.135.0099.024.8698.955.2694.74
Wald-HC35.3699.385.1899.314.8799.165.3994.80
Table 3. Comparison in empirical size and power of the matrix Hausman statistic vs. the Wald statistic from the augmented regression setup. Nominal size: 10%.
Table 3. Comparison in empirical size and power of the matrix Hausman statistic vs. the Wald statistic from the augmented regression setup. Nominal size: 10%.
Skedastic ScenarioHomoskedasticityRandomGroup-WiseConditional
nStatisticSizePowerSizePowerSizePowerSizePower
50 q ^ h o m 9.7262.899.6362.3110.1460.3710.6051.81
Wald-HC310.1657.3510.1556.4310.4455.149.9145.69
75 q ^ h o m 9.7982.049.7981.419.9580.9311.0570.08
Wald-HC310.1177.8910.0977.5110.1877.279.9564.43
100 q ^ h o m 9.7792.149.7491.6810.2691.3411.2382.55
Wald-HC310.0790.4510.0289.9110.2689.5110.1078.26
200 q ^ h o m 9.9699.809.8199.749.9699.7511.6098.45
Wald-HC39.9099.779.9399.6610.3399.6210.2397.33
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Papadopoulos, A. A New Matrix Statistic for the Hausman Endogeneity Test under Heteroskedasticity. Econometrics 2023, 11, 23. https://doi.org/10.3390/econometrics11040023

AMA Style

Papadopoulos A. A New Matrix Statistic for the Hausman Endogeneity Test under Heteroskedasticity. Econometrics. 2023; 11(4):23. https://doi.org/10.3390/econometrics11040023

Chicago/Turabian Style

Papadopoulos, Alecos. 2023. "A New Matrix Statistic for the Hausman Endogeneity Test under Heteroskedasticity" Econometrics 11, no. 4: 23. https://doi.org/10.3390/econometrics11040023

APA Style

Papadopoulos, A. (2023). A New Matrix Statistic for the Hausman Endogeneity Test under Heteroskedasticity. Econometrics, 11(4), 23. https://doi.org/10.3390/econometrics11040023

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop