1. Introduction
This paper studies testing for cross-sectional correlation in panel data when serial correlation is also present in the disturbances. It does that for the case of strictly-exogenous regressors
1. Cross-sectional correlation could be due to unknown common shocks, spatial effects or interactions within social networks. Ignoring cross-sectional correlation in panels can have serious consequences. In time series with serial correlation, existing cross-sectional correlation leads to efficiency loss for least squares and invalidates inference. In some cases, it results in inconsistent estimation; see Lee [
1] and Andrews [
2]. Testing the cross-sectional correlation of panel residuals is therefore important.
One could test for a specific form of correlation in the error like spatial correlation; see Anselin and Bera [
3] for cross-sectional data and Baltagi et al. [
4] for panel data, to mention a few. Alternatively, one could test for correlation without imposing any structure on the form of correlation among the disturbances. The null hypothesis, in that case, is testing the diagonality of the covariance or correlation matrix of the
N-dimensional disturbance vector
which is usually assumed to be independent over time, for
. When
N is fixed and
T is large, the traditional multivariate statistics techniques, including log-likelihood ratio and Lagrange multiplier tests, are applicable; see, for example, Breusch and Pagan [
5], who propose a Lagrange Multiplier (LM) test, which is based on the average of the squared pair-wise correlation coefficients of the least squares residuals.
However, as
N becomes large because of the growing availability of the comprehensive databases in macro and finance, this so-called “high dimensional” phenomenon brings challenges to classical statistical inference. As shown in the Random Matrix Theory (RMT) literature, the sample covariance and correlation matrices are ill-conditioned since their eigenvectors are not consistent with their population counterparts; see Johnstone [
6] and Jiang [
7]. New approaches have been considered in the statistics literature for the testing the diagonality of the sample covariance or correlation matrices; see Ledoit and Wolf [
8], Schott [
9] and Chen et al. [
10], to mention a few.
The above tests for raw data cannot be used directly to test cross-sectional correlation in panel data regressions since the disturbances are not observable. Noise caused by substituting residuals for the actual disturbances may accumulate due to large dimensions, and this in turn may lead to biased inference. The bias for cross-sectional correlation tests in large panels depends on the model specification, the estimation method and the sample sizes
N and
T, among other things. For example, Pesaran et al. [
11] consider an LM test and correct its bias in a large heterogeneous panel data model; Baltagi et al. [
12] extend Schott’s test [
9] to a fixed effects panel data model and correct the bias caused by estimating the disturbances with fixed effects residuals in a homogeneous panel data model. Following Ledoit and Wolf [
8], Baltagi et al. [
13] propose a bias-adjusted test for testing the null of sphericity in the fixed effects homogeneous panel data model. However, this method does not test cross-sectional correlation directly. Rejection of the null could be due to cross-sectional correlation or heteroscedasticity or both. A general test for cross-sectional correlation was proposed by Pesaran [
14]. His test statistic is based on the average of pair-wise correlation coefficients, defined as CD
(CD, Cross-sectional Dependence). The test is exactly centered at zero under the null and does not need bias correction. Pesaran [
15] extends his test statistic to test the null of weak cross-sectional correlation and derives its asymptotic distribution using joint limits. This test is robust to many model specifications and has many applications. Recent surveys for cross-sectional correlation or dependence tests in large panels are provided by Moscone and Tosetti [
16], Sarafidis and Wansbeek [
17] and Chudik and Pesaran [
18].
The asymptotics and bias-correction of existing tests for cross-sectional correlation in large panels are carried out under some, albeit restrictive, assumptions. For instance, the errors are normally distributed;
as
, and so on. One fundamental restriction is that the errors are independent over time. In fact, the presence of serial correlation in panel data applications is likely to be the rule rather than the exception, especially for macro applications and when
T is large. Ignoring serial correlation does not affect the consistency of estimates, but it leads to incorrect inference. In RMT, when
are independent across
and
N is large, the Limiting Spectral Distribution (LSD) of the corresponding sample covariance matrix is the Marchenko-Pastur (M-P) law; see Bai and Silverstein [
19]. Existing correlation among these disturbances may cause a deviation of the LSD from the M-P law. Indeed, Bai and Zhou [
20] show that the LSD of the sample covariance matrix with correlations in columns is different from the M-P law. Gao et al. [
21] show similar results for the sample correlation matrix. Therefore, the cross-sectional correlation tests, which heavily depend on the assumption of independence over time, could lead to misleading inference if there is a serial correlation in the disturbances.
To better understand the effects of potential serial correlation on the existing tests of cross-sectional correlation, let us assume that the
independent random vectors
for
are observable. The correlation coefficients
of any
and
are defined by
. Their means are zero vectors. If all of the elements of each
are independent and identically spherically distributed, Muirhead [
22] shows that
When
N is fixed, the summation of all distinct
terms of
will be small, as
In
Section 3, we show that if all of the elements of each
follow a multiple Moving Average model of order one (MA(1)) with parameter
then
As
, the extra term
can accumulate and lead to extra bias for the existing LM type tests in panels. Although CD
is centered at zero, it may still encounter size distortions because serial correlation is ignored.
This paper proposes a modification of Pesaran’s CD test of cross-sectional correlation when the error terms are serially correlated in large panel data models. First, using results from RMT, we study the first two moments of the test statistic and propose an unbiased and consistent estimate of the variance with unknown serial correlation under the null. Second, we derive the limiting distribution of the test under the asymptotic framework with simultaneously in any order without any distribution assumption. We also discuss its local power properties under a multi-factor alternative. Monte Carlo simulations are conducted to study the performance of our test statistic in finite samples. The results confirm our theoretical findings.
The plan for the paper is as follows. The next section introduces the model and notation, existing LM type tests and the Cross-sectional Dependence (CD) test. It then presents our assumptions and the proposed modified Pesaran’s CD test statistic.
Section 3 derives the asymptotics of this test statistic.
Section 4 reports the results of the Monte Carlo experiments.
Section 5 provides some concluding remarks. All of the mathematical proofs are provided in the Appendix.
Throughout the paper, we adopt the following notation. For a squared matrix B, tr is the trace of tr denotes the Frobenius norm of a matrix or the Euclidean norm of a vector B; denotes convergence in distribution; and denotes convergence in probability. We use to denote the joint convergence of N and T when N and T pass to infinity simultaneously. K is a generic positive number not depending on N nor
2. Model and Tests
Consider the following heterogeneous panel data model
where
i and
t index the cross-section dimension and time dimension, respectively;
is the dependent variable, and
is a
vector of exogenous regressors. The individual coefficients
are defined on a compact set and allowed to vary across
The null hypothesis of no cross-sectional correlation is
or equivalently as
where
is the pair-wise correlation coefficients of the disturbances defined by
Under the alternative, there exists at least one
for some
For the panel regression model 1, the residuals are unobservable. In this case, the test statistic is based on the residual-based correlation coefficients
Specifically,
where
is the Ordinary Least Squares (OLS) residuals using
T observations for each
. These OLS residuals are given by
with
being the OLS estimates of
from (
1) for
Let
where
and
is a
matrix of regressors with the
t-th row being the
vector
We also define
,
and
for
The OLS residuals can be rewritten in vector form as
and the residual-based pair-wise correlation coefficients can be rewritten as
, for any
.
2.1. LM and CD Tests
For
N fixed and
Breusch and Pagan [
5] propose an LM test to test the null of no cross-sectional correlation in (
2) without imposing any structure on this correlation. It is given by
LM
is asymptotically distributed as a Chi-squared distribution with
degrees of freedom under the null. However, for a typical micro-panel dataset,
N is larger than
T; and the Breusch-Pagan LM test statistic is not valid under this “large
N, small
T” setup. In fact, Pesaran [
14] proposes a scaled version of this LM test as follows
Pesaran [
14] shows that
is distributed as
with
first, then
under the null. However,
is not correctly centered at zero with fixed
T and large
N. Hence, Pesaran et al. [
11] propose a bias-adjusted version of this LM test, denoted by LM
They show that the exact mean and variance of
are given by
and
where
and
is given by
Pesaran et al. [
11] show that
is asymptotically distributed as
under the null (
2) and the normality assumption of the disturbances as
followed by
Alternatively, Pesaran [
14] proposes a test based on the average of pair-wise correlation coefficients rather than their squares. The test statistic is given by
Pesaran [
15] shows that this test is asymptotically distributed as
with
. He also extends this to test the null of weak cross-sectional correlation.
2.2. Assumptions and the Modified CD Test Statistic
So far, all of the methods surveyed above for testing cross-sectional correlation in panel data models assume that the disturbances are independent over time. Ignoring serial correlation usually results in efficiency loss and biased inference. In fact, we show in
Section 3 that the existence of serial correlation leads to extra bias in the LM-type tests. For the CD
test in (
10), it is still centered at zero with serial correlation, but its variance is affected by serial correlation. As a result, we also expect size distortions in CD
. To correct for this, we consider a modification of this test statistic that accounts for an unknown form of serial correlation in the disturbances. First, we introduce the assumptions needed:
Assumption 1. Define and We also assume that , for where is a random vector with mean vector zero and covariance matrix Let denote the t-th entry of for any has a uniformly bounded fourth moment, and there exists a finite constant Δ
, such that Following Bai and Zhou [20], the disturbances are generated bywhere for are random vectors across time, and is a sequence of numbers satisfying Assumption 1 allows the error term
to be correlated over time. The condition
excludes long memory-type strong dependence. We need bounded moment conditions to ensure large
asymptotics for panel data models with serial correlation. The conditions in Assumption 1 are quite relaxable; they are satisfied by many parametric weak dependence processes, such as stationary and invertible finite-order Auto-Regressive and Moving Average (ARMA) models. Under Assumption 1, the covariance matrix of each
is
where Σ is a
symmetric positive definite matrix. The random vector
can be written as
where
The generic covariance matrix
of each
captures the serial correlation. Bai and Zhou [
20] use this representation and show that
tr
is bounded for any fixed positive integer
κ. More specifically, considering a multiple Moving Average model of order one (MA(1))
where
and
,
and
are defined in Assumption 1. For this case,
, where
One can also verify that for (
11), we have the following generic representation,
We use this representation throughout the paper for convenience.
Assumption 2. The regressors, , are strictly exogenous, such thatand is a positive definite matrix. Assumption 3. and the OLS residuals, defined by (4), are not all zeros with probability approaching one. Assumptions 2 and 3 are standard for model (
1); see Pesaran [
14] and Pesaran et al. [
11]. We impose the assumption that the regressors are strictly exogenous. We do not impose any restrictions on the distribution of the errors or the relative convergence speed of (
). This framework is quite relaxable while LM-type tests usually impose the normality assumption and restrictions on the relative speed of
N and
T, namely
Under these assumptions, the OLS estimates for model (
1) are consistent, but inefficient. We focus on the term used in Pesaran’s CD test [
14]
In the next section, we derive the first two moments of this test statistic and later derive its limiting distribution under this general unknown form of serial correlation over time.
4. Monte Carlo Simulations
This section conducts Monte Carlo simulations to examine the empirical size and power of the proposed test (CD
) defined in (
22) in heterogeneous panel data regression models. We also look at the performance of LM
and CD
defined by (
9) and (
10), respectively, for comparison purposes. We consider four scenarios: (1) the errors are independent over time, with no serial correlation; (2) the errors follow a moving average model of order one
over time; (3) the errors follow an Auto-Regressive model of order one (AR
) over time; (4) the errors follow an Auto-Regressive and Moving Average of order (1,1) (ARMA
) over time. Finally, we provide small sample evidence on the power performance of the modified CD
test against a factor and spatial auto-regressive model of order one alternatives, which are popular in economics for modeling cross-sectional correlation.
4.1. Experimental Design
Following Pesaran et al. [
11], our experiments use the following data-generating process
where
IID
IID
is a strictly exogenous regressor, and we set
and
IID
with
IID
, for
The error terms of (
26) are generated using the following four data generating processes
where
;
IID
and
∼IID
We further set
and
. To check the robustness of the tests to non-normal distributions,
are generated from a Normal
and a Chi-squared distribution
To examine the empirical power of the tests, we consider two different cross-sectional correlation alternatives: factor and spatial models. The factor model is generated by
where
IID
and
IID
In this case,
replaces
in (
26) for the power studies.
is generated by the four scenarios defined by (
28)–(31), respectively. For the spatial model, we consider a first-order spatial auto-correlation model (SAR(1)),
where
and
are defined by (28)–(31), respectively.
The experiments are conducted for and For each pair of we run 2000 replications. To obtain the empirical size, we conduct the proposed test (CD and CD at the two-sided 5% nominal significance level and LM at the positive one-sided 5% nominal significance level.
4.2. Simulation Results
Table 1 reports the empirical size of CD
, LM
and CD
for normal and chi-squared distributed errors. The error terms are assumed to be independent over time. The results show that all of the tests have the correct size with different
combinations under both normal and chi-squared scenarios. Those are consistent with the theoretical findings. The only exceptions are for small N or T equal to 10, especially for LM
Table 2 reports the empirical size of the three tests with MA(1) error terms defined by (29). The results show that CD
has the correct size for all
, but CD
has size distortions for different
combinations because the disturbances are MA(1) over time. For example, under the normality scenario, the size of CD
is
for
and
; it becomes
when
T grows to
LM
suffers serious size distortions, because of the extra bias caused by ignoring serial correlation. From
Table 2, the empirical size of LM
is
as
N or
T becomes larger than 30.
Table 3 and
Table 4 report the empirical size of the tests with AR(1) and ARMA(1,1) errors under the two distributions: normal and chi-squared scenarios. Note that CD
is over-sized in
Table 4 for the chi-squared case when
However, it has the correct size as
T gets larger than
In contrast, LM
has serious size issues, rejecting
of the time, and CD
is over sized by as much as
. Overall, in comparison with CD
and LM
, the proposed test CD
controls for size distortions when serial correlation in the disturbances is present and is not much affected when serial correlation is not present.
Table 5 summarizes the size-adjusted power of CD
with MA(1), AR(1) and ARMA(1,1) errors under the factor model alternative. Results show that CD
performs reasonably well under the two distribution scenarios especially for
N and
.
Table 6 confirms the power properties of CD
for MA(1), AR(1) and ARMA(1,1) errors under the SAR(1) alternative, especially for large
N and
T.