1. Introduction
Cross-sectional dependence has attracted considerable attention among economists recently.
1 It is well-known that ignoring cross-sectional dependence may lead to inconsistent estimators and misleading inference. A popular and successful way to capture cross-sectional dependence is through common factors.
2 Common factor models assume a finite number of unobserved factors that may be the result of economy-wide shocks with impacts on population units that may depend on the characteristics of the unit. Possible common factors include macroeconomic, technological, legal/institutional, political, environmental, health and sociological shocks, among others. The applied literature has considered, for example, technological shocks (such as new procedures, drugs and surgical techniques) affecting the relationship between countries’ healthcare attainments and their per capita health expenditures and educational levels (e.g., [
20]); cross-country cross-industry analysis of returns to R&D, which are affected both by global shocks, such as the recent financial crisis, and by local shocks, such as spillovers between a limited group of industries or countries (e.g., [
21]); and the analysis of transnational terrorism, where common factors may arise from common terrorist training camps, common grievances and demonstration effects (cf. [
22]).
Typically, common factor models allow for a small and known number of unobserved factors. Although such an approach is convincing in empirical macro models, in microeconometric models, it is often more reasonable to think of a potentially large, possibly unknown (and maybe infinite) number of factors that can influence individuals’ behaviour. For instance, in studies of individual earnings, there are many individual-level observables and unobservables that affect income; as well as several common factors, such as region, family, male/female ratio, race composition, education, age composition, and so on (cf. [
7]). The number of common factors may increase as we collect more cross-sectional observations or there may be an infinite number of unobserved factors (see, e.g., [
23]).
The purpose of this paper is to study a nonparametric regression model for cross-sectional data in the presence of common shocks that are very general in nature. The common shocks can be of infinite dimension with flexible impact on different population units. For example, common shocks could take the form of a nonlinear random function of observable or unobservable individual characteristics with the effect on the i-th observation varying continuously across i depending on the value of the characteristic. We focus on nonparametric models because there may be little guidance (or justification) in practice for selecting a particular functional form for the regression function.
There has been important recent work on nonparametric models with many finite common factors (e.g., [
5,
15,
16]). They consider common shocks that enter the regression function additively and with disturbances that are modelled as linear functions of mutually-independent unobserved common factors and individual-specific factor loadings. We, in contrast, allow the regression function to be non-separable for common shocks, and we do not require the mutual independence assumption. In other words, we allow for an unknown large, potentially infinite, number of factors that can influence individuals’ outcomes and that may interact with observable and unobservable individual characteristics in extremely rich and flexible ways. To the best of our knowledge, this is the first paper that allows for such a flexible framework.
We consider this flexible setting because we are interested in investigating how general the regression function and the common shocks can be while still allowing for meaningful nonparametric estimates. We focus on the Nadaraya-Watson kernel estimator and study the effects of general common shocks on its asymptotic properties. Asymptotic results for kernel estimators are typically obtained by manipulating conditional densities of random variables. However, if the common shocks are too general, conditional densities do not necessarily exist. Doob [
24] (pp. 623–624) and Halmos [
25] (Section 48) present some examples of non-existence. If conditional densities do not exist, then what we would expect to be the probability limit of the kernel estimator in the present context is either meaningless or difficult to interpret.
3The idea here is to let the common shocks be as general as possible and to work with well-defined conditional densities that adhere as closely as possible to the standard kernel literature. To do so, we appeal to the disintegration theory for conditional distributions that can be found in Pollard [
28], Dellacherie and Meyer [
29] and Hoffmann-Jorgensen [
30]. We find that an important sufficient condition to guarantee the existence of conditional densities is that the common shocks must belong to a separable metric space equipped with the Borel
σ-field. We conclude that the sufficient conditions are mild and not very restrictive in practice.
4Given the existence of conditional densities, we adjust the standard assumptions of the kernel literature to the present case. We show that the Nadaraya-Watson kernel estimator converges in probability to the Kolmogorov conditional expectation given the sigma-field generated by the common shocks. The optimal rate of convergence is the same as the rate obtained when the observations are i.i.d. The asymptotic distribution is mixed normal with weights depending on the common shocks. It is obtained by exploring a martingale difference sequence central limit theorem. We find that inference depends on how the common shocks affect the regression variables. A dichotomy similar to that of Andrews [
8] is present here: if the dependent variable is mean independent of the common shocks given the explanatory variables, the usual
t-test has the correct size; but if the dependent variable is not mean independent, the
t statistic diverges to infinity in probability under the null hypothesis.
The closest paper in the literature to ours is that of Andrews [
8], who considers a linear regression model in the presence of general common shocks. He shows that the least-squares estimator converges in probability to Kolmogorov conditional expectations given the
σ-field generated by the common shocks. The random probability limit is a well-defined object because the Kolmogorov conditional expectation always exists. Andrews, therefore, does not need to guarantee the existence of conditional densities. Extending his results to a nonparametric model is important because parametric models may be misspecified. We show that the price to be paid is that mild restrictions must then be imposed on the nature of the common shocks.
The nonparametric version of the standard factor model is a special case of the model considered here. For this class of models, we show that, even though the kernel regression converges in probability to a random object measurable with respect to the common shocks, it is possible to identify and estimate the slope of the regression function. However, its location (e.g., the intercept in a linear model) is not identified even if we normalize common shocks to have a zero mean. To identify and estimate location, the dependent variable must be mean independent of the common shocks given the regressors.
Common factor models are typically applied to panel data sets (e.g., [
14,
17,
18,
19]). We view the present paper as a first step towards nonparametric panel data models that may incorporate a more general and flexible common factors structure. Indeed, in a companion paper, Souza-Rodrigues [
27] develops a two-step nonparametric estimator that requires a “large-
N, large-
T” dataset for a generalized regression model based on the identification results of Berry and Haile [
31]. The estimator applies equally to datasets with a large number of individuals in different groups and a large number of groups. The empirical application in Souza-Rodrigues [
27] considers the impact of hospital volumes of surgical procedures on individual health status (e.g., mortality rate).
5 Group-level observables (i.e., hospital volume of surgeries) may be correlated with group-level unobservables (hospital unobserved quality), which, by its turn, may be indexed by individual characteristics (since an unobserved hospital characteristic that is helpful for patients with some demographic characteristics may not be as helpful for other patients). The strategy proposed by Souza-Rodrigues [
27] is to run a nonparametric regression of individual outcomes on individual observables within each group (hospital) in the first step. It is a nonparametric regression with common shocks where the common shocks are the group-level observables and unobservables. Because the group-level unobservables may be a (random) function of individual characteristics, it is important to allow for this possibility, as we do here.
6 The results of the present paper can be incorporated in other nonlinear panel data settings.
The present paper also relates to the literature of spatial dependence.
7 Typically, in this literature, common shocks are presumed to have predominantly local effects, and the dependence is modelled as a function of an exogenously-given spatial or economic distance, with some form of stationary mixing condition analogous to the time series data. Recent nonparametric versions of spatial models have been considered by Martins-Filho and Yao [
37], Gerolimetto and Magrini [
38], among others. Although the present paper can incorporate common shocks with differential local effects (e.g., assuming that individual factor loadings include geographic location), we do not allow individual outcomes to depend on the characteristics of other individuals. We therefore view spatial dependence models as complementary to ours.
Robinson [
44] provides an alternative way of modelling cross-sectional dependence. He considers a nonparametric kernel regression in which the disturbances are represented by a (possibly infinite) sum of independent random variables with unknown weights. The structure in the disturbances is sufficiently rich to cover spatial dependence models, but, since it does not require known economic distances, it can accommodate stronger forms of dependence than mixing conditions. Robinson [
44] investigates the properties of kernel estimators, and Robinson and Lee [
45] study the properties of sieve estimators within this framework. The present paper can accommodate disturbances of the type represented by Robinson [
44], but with a vector of common shocks in place of the vector of independent random variables. We do not require the vector of common shocks to be independent random variables, and we allow for potentially correlated random weights in the summation term for the disturbances. However, the restrictions we need to impose on the common shocks differ from the assumptions in Robinson [
44]. Furthermore, we require i.i.d. sampling schemes that are neither assumed by Robinson [
44], nor by the spatial dependence literature. Our model is therefore neither more general than, nor is it a special case of Robinson’s model.
The paper is organized as follows:
Section 2 presents the regression model and discusses sufficient conditions to guarantee the existence of conditional densities.
Section 3 establishes the asymptotic properties of the Nadaraya-Watson kernel regression estimator and discusses its implications.
Section 4 concludes. The
Appendix presents the disintegration theory and briefly discusses the role of separability of common shocks in the existence of conditional densities. The
Supplemental Material presents results for the kernel density estimator, contains all relevant proofs and discusses the probabilistic framework adapted from Andrews [
8] that justifies the approach taken here.
2. Regression Model and Conditional Densities
The dataset is
, where
(
) and
(
). Consider the model:
where
(
, with
) is a vector of individual-specific random variables;
is the common shock; and
is the idiosyncratic error. Some components of
may be observable (in which case, it may be incorporated in
) or it may be completely unobservable. We allow the common shock
to be either a random vector (possibly infinite-dimensional) or a random function of
. In the latter case, the common shocks may affect individuals differently. As usual, we use upper-case letters to denote random quantities and lower-case letters to denote realizations.
The standard parametric factor model is a special case of our model and is typically written as:
where
is the vector of individual-specific factor loadings;
is the vector of unobserved common factors;
is the idiosyncratic error that is independent of
and has zero mean; and
is the vector with the parameters of interest. Cross-sectional dependence in the disturbances is generated by the term
. The standard model can also accommodate cross-sectional dependence on regressors
. For example, consider the expanded vectors
and
and take
and
. Note that if
, then
and
are correlated even when
and
are independent of each other (e.g., [
8,
10,
11]). The nonparametric version of (
2) takes
, with the same structure for the disturbances
.
8The standard factor model (
2) is a special case of our model (
1) with the regression function given by the linear and additively separable
and the common shock function given by
. We therefore generalize the standard model in the following ways: (i) we let the regression function
be nonparametric; (ii) we allow the regressors
to freely interact with the common shock
; and (iii) we let the common shock be a general function of individual-specific factor loadings
(subject to the restrictions discussed below). Furthermore, factor models typically impose independence between
and
and assume that
is a mutually independent vector, while we do not need to impose these independence assumptions. We, however, do not consider a fully-non-separable model; we maintain the additive separability assumption in the idiosyncratic error
.
Robinson [
44] also considers a nonparametric version of (
2), but with another structure for
. He considers the model:
where
are scalar unknown functions;
are independent random variables with zero mean and unit variance; and
are unknown fixed weights.
9 The present paper compares to Robinson [
44] when the following holds:
, with
,
and
. Unlike Robinson [
44], we allow for (potentially correlated) random coefficients
and do not restrict
to be independent variables with zero mean and unit variances. The restrictions we need to impose on the function
are discussed below and are of a different nature than the assumptions used by Robinson [
44].
Data Generation
Denote the vector , where . Define the measurable space , where is the Borel sigma-field. The random elements are defined on , where is the product space and is the product Borel sigma-field on . We suppose the common shocks across observations are captured by the σ-field generated by C, denoted by . We impose the following assumption:
Assumption 1 The sequence is i.i.d. conditional on the σ-field .
As shown by Andrews [
8], this assumption is valid when the units are drawn randomly from the population. One difference between the present paper and Andrews [
8] is that he states the existence of some
σ-field, such that the data are i.i.d. conditional on it without specifying a priori how this
σ-field is constructed, while we impose more structure and state explicitly how the
σ-field is generated. Andrews’ framework is, therefore, more general than ours in this respect. Note that neither the spatial dependence models nor Robinson’s [
44] approach require random sampling.
10 Existence of Conditional Densities
Because we make use of the Nadaraya-Watson kernel estimator and because the kernel estimator requires the existence of conditional densities, we now discuss the existence problem.
To guarantee the existence of conditional densities that allow for very general common shocks, we make use of the disintegration theory. Disintegration of a probability measure is a collection of regular conditional probabilities, each satisfying (i) a concentration property (i.e., conditional on an event, the probability of its complement is zero) and (ii) a decomposition property (i.e., the probability of an event is a weighted sum of the conditional probability measures, also known as the law of total probability).
11 The reader unfamiliar with disintegration theory might want to read the
Appendix (or the references cited there) before proceeding.
Define the sub-vector , for . We want to guarantee the existence of the conditional density of given C. By Assumption 1, the probability distribution of , denoted by , is exchangeable on . Call the marginal distribution of under . We impose the following:
Assumption 2 (i) is a metric space.
(ii)
λ is a sigma-finite Radon measure on
.
12(iii) C maps into . is a separable metric space, and is the Borel σ-field.
(iv) μ is a sigma-finite measure on . Let the measure induced by C and λ on be absolutely continuous with respect to μ.
(v) Let , for any , be absolutely continuous with respect to λ. Denote its Radon-Nikodym density by .
Assumption 2(iii) requires
to be a separable metric space. This is trivially satisfied when
belongs to a finite-dimensional Euclidean space. However, if
C is an infinite dimensional vector of random variables, we need restrictions, such as
, for some
, where
is the space of sequences with finite
-norm, and we need to rule out the case
, because
is non-separable. Similarly, if
C is a random function of
, it must belong to spaces, such as the
space for
, or the space of bounded and continuous functions defined on a closed bounded subset of
and equipped with the sup-norm, or the Hölder space, etc. However, it cannot belong to the space of bounded functions with the sup-norm,
, because it is not separable. See the discussion about the role of separability for existence of conditional densities in the
Appendix.
13The restrictions in Assumption 2 are mild and sufficient to guarantee the existence of conditional densities of
given
C for any
. The reason for sufficiency is the following: first, Assumptions 2(i)–(iv) are sufficient for the sigma-finite Radon measure
λ to have a
-disintegration; i.e., they guarantee the existence of a collection of measures, denoted by
, that satisfy the aforementioned concentration and decomposition properties (but note that
do not have to be probability measures; see Definition 3 and Theorem 1 in the
Appendix).
Second, if the disintegration
exists and the probability measure
on
is absolutely continuous with respect to
λ with density
(Assumption 2(v)), then two implications follow (see Theorem 2 in the
Appendix): (i) the probability distribution of
C induced by
(i.e., the image measure
) is absolutely continuous with respect to
μ with density:
and (ii) the probability measure
has a conditional distribution given
C, denoted by the collection
, where
is defined by having density:
with respect to
for
-almost all
. The conditional density
is therefore similar to elementary conditional densities: it is the ratio of the joint density
and the marginal
. However, it does not require
C to belong to a finite-dimensional Euclidean space.
Because C is common to all i, the equality follows for all . In addition, for all and for Q-almost all by Assumption 1. We state this result as a lemma:
Lemma 1. Let Assumptions 1 and 2 hold. Then, there exist conditional densities of given C, for all , defined by:for Q-almost all , where .14 Example 1. Suppose is scalar and is the separable Hilbert space . Take a basis for and represent the common shock by , where for . Note that one can define , in which case the random coefficients are not independent of each other. More important for us is to note that selecting a function in is equivalent to selecting the infinite dimensional vector in . Let be the Borel σ-field on and be the Borel σ-field on . Because the spaces and are homeomorphic, their topologies are equivalent, and so, and are equivalent.15 As a result, the event on is equivalent to the (potentially more intuitive) event on . In addition, conditioning on is equivalent to conditioning on . We have, therefore, and:for any measurable set A.16 Example 1 intends to translate properties of conditional probabilities given an element in some abstract space of functions into properties in (hopefully) more concrete spaces defined by random vectors. Example 1, however, does not apply when
is not a Hilbert space. Although we may approximate any of the separable metric spaces by other simpler spaces, the conditioning argument does not hold without running into problems, such as the Borel paradox (see, e.g., [
47]). For instance, take
to be the set of bounded and continuous functions,
. It is separable, and any
can be well approximated by a polynomial of order
, say
with some coefficients
. Because we can take
J such that
, for some
, the probability of the event
is close to the probability of the event
. However, the topology of
is not the same as the topology of the Euclidean
for any finite
J. Therefore, the Borel
σ-field on
is different from the Borel
σ-field on any
. Conditioning on different
σ-fields delivers different conditional probability distributions, and so, we are not guaranteed to have
close to
for all measurable sets
A. We can still obtain the existence of conditional densities, but we cannot derive conclusions based on some approximation
for
, no matter how large
J is.
3. Regression Estimator
Next, we consider the properties of the Nadaraya-Watson kernel regression estimator:
where
(·) is the kernel function and
is the bandwidth. As previously mentioned, the objective here is to work as closely as possible to the standard kernel literature. The assumptions we impose are therefore similar to the standard assumptions (see Pagan and Ullah [
48]), but with the population density and regression function substituted for the corresponding conditional functions and with the extra “
Q-almost all
c” qualifiers added. For brevity, we relegate the properties of the kernel density estimator to the
Supplemental Material.
We maintain Assumptions 1 and 2 from now on. In addition, we impose the following conditions:
Condition 1. Let K be the class of all Borel measurable nonnegative bounded real-valued functions , such that: (i) ; (ii) ; (iii) as ; (iv) ; (v) ; and (vi) .
Condition 2. For Q-almost all , the conditional density is continuous at any point .
Condition 3. (i) as and (ii) as .
Condition 4. For Q-almost all c, (i) is twice continuously differentiable with respect to x in some neighbourhood of and (ii) the second-order derivatives of with respect to x are bounded in this neighbourhood.
Condition 5. For Q-almost all c, the point is in the interior of the support of X conditional on and , for some finite ξ.
Condition 6. The kernel K is a symmetric function satisfying .
Condition 7. (i) a.s.; and (ii) let , and assume a.s..
Condition 8. For Q-almost all c, the function is twice continuously differentiable with respect to x in some neighbourhood of .
Conditions 1–5 suffice to obtain the asymptotic properties of the kernel density estimator (consistency, rate of convergence and asymptotic distribution; see the
Supplemental Material). Condition 6 is standard in the literature.
Condition 7(i) implies
. In the standard factor model, this translates into:
Note that
is a random object because
C has not been fixed. Typically in the literature,
is assumed to be independent of
, in which case
where
is an unknown constant. Unlike the standard model, here, we allow
J to be infinite (as long as
C belongs to an appropriate separable metric space); we do not require
to be independent of
, and we allow for more complicated interactions between
X and
C.
Condition 7(ii) allows for conditional heteroskedasticity; and Condition 8 is used to apply Q-almost sure Taylor expansions similar to what is usually done in the kernel literature.
Remark 1. Condition 8 requires to be twice continuously differentiable in x for almost all c. To fix ideas, consider the following case: let , and . Conditioned on the event , we have that:while conditioning only on the event , we obtain the random object:Therefore, to satisfy Condition 8, we need to be twice continuously differentiable with respect to both the first and second arguments, and we also need to be twice continuously differentiable with respect to x.17 To obtain the consistency of , we first show that the kernel density converges in probability to the conditional density . Then, we prove that the mean-squared error of conditional on converges to zero in probability. Finally, consistency follows by the dominated convergence theorem. We then show that the rate of convergence is the same as the rate of convergence without common shocks. The pointwise asymptotic distribution is obtained using the martingale difference sequence central limit theorem.
Proposition 1. Let denote the conditional expectation given , . Let Assumptions 1 and 2 and Conditions 1–8 hold. Then:- 1.
as
- 2.
- 3.
Suppose also that and , for some . Define . Then, (i) as :and (ii) if, in addition, as , then:as .
Proposition 1.1 shows that the kernel regression estimator converges in probability to the random object
. In general,
is different from the conditional expectation
; the equality
only holds when
Y is mean independent of
C given
X. To see how this difference may affect the interpretation of potential estimands, take the standard factor model as an example.
18 In this case,
is given by (
9), while
is given by:
If we assume, as is usually done, that is independent of , we have that . If there is no cross-sectional dependence on regressors resulting from the common shocks, then . In addition, if we normalize for all j, then , while . Because Y is not mean independent of C given X, .
Although we cannot estimate consistently, it is possible to identify and estimate β by noting that , for . Similarly, for nonparametric factor models, , one can identify and estimate the slope of . However, the presence of the common shocks prevents the identification of the intercept α in the linear model (and the identification of the location of in the nonparametric model) even if we normalize for all j.
Remark 2. The nonparametric factor model with , and , for all j, has a structure similar to the one proposed by Robinson [44]. Yet, while Robinson [44] shows that the kernel regression estimator converges in probability to , we obtain convergence to . An important distinction comes from the assumption on the sampling process. Because we have exchangeable data given the common shocks (Assumption 1), the conditions we impose are not sufficient to “get rid of” C in the limit. Robinson [44], in contrast, does not impose the conditional i.i.d. sampling process. Returning to the standard factor model, if we assume now the presence of cross-sectional dependence on regressors captured by, say,
with
, then
and:
Again,
Y is not mean independent of
C given
X, so
, but it is still possible to identify
β in the parametric model and the slope of
in the nonparametric version.
19In the standard factor model, Y is mean independent of C given X only if the common shocks have no direct effect on Y. This is the case when . When this is true, , and the kernel regression converge in probability to , even when there is cross-sectional dependence on X. In this case, we identify both parameters α and β in the linear model and in the nonparametric model. Note that assuming for all j is not an innocuous normalization, but a substantive assumption.
Remark 3. The last case is similar to Andrews [8]. Let and , where is mutually independent and (see Andrews’ Assumption SF1). Imposing is similar to imposing Andrews’ Condition SF3. Assuming Condition 7(i) together with mutual independence is similar to Andrews’ Condition SF2. Proposition 1.2 shows that the rate of convergence of the kernel regression in the presence of common shocks is the same as the rate of convergence without common shocks.
Proposition 1.3 presents the asymptotic distribution of the kernel regression estimator. It shows that even when
, the common shocks affect the asymptotic distribution of the kernel regression because they may impact both the conditional variance of
Y and the conditional density of
X. This result is similar to that of Andrews [
8], Robinson [
44] and others.
Remark 4. A consequence of Proposition 1.3 is that inference results depend on whether Y is mean independent of C given X. To test a null hypothesis, say, against , the corresponding t statistics is:The usual two-sided t test with significance level α rejects the null if , where is the α quantile of the standard normal distribution. If Y is mean independent of C given X, then as . Otherwise, we have as .20 Remark 5. The bandwidth can be chosen by minimizing the approximated integrated mean squared error () conditional on . The bandwidth must be a -measurable random variable, . In the Supplemental Material, we show that , and one might expect both plug-in and cross-validation estimators to be consistent. The usual concerns in the literature about how to select the bandwidth are present here, but for brevity, we do not investigate the topic further. We only emphasize that the bandwidth choice based on the unconditional is infeasible because it is impossible to estimate the distribution of C (and integrate that out) using a single cross-sectional dataset.