This paper considers a nonparametric regression model for cross-sectional data in the presence of common shocks. Common shocks are allowed to be very general in nature; they do not need to be finite dimensional with a known (small) number of factors. I investigate the properties of the Nadaraya-Watson kernel estimator and determine how general the common shocks can be while still obtaining meaningful kernel estimates. Restrictions on the common shocks are necessary because kernel estimators typically manipulate conditional densities, and conditional densities do not necessarily exist in the present case. By appealing to disintegration theory, I provide sufficient conditions for the existence of such conditional densities and show that the estimator converges in probability to the Kolmogorov conditional expectation given the sigma-field generated by the common shocks. I also establish the rate of convergence and the asymptotic distribution of the kernel estimator.
nonparametric regression; common shocks; cross-sectional dependence; disintegration theory
C13, C14, C21
Cross-sectional dependence has attracted considerable attention among economists recently.1 It is well-known that ignoring cross-sectional dependence may lead to inconsistent estimators and misleading inference. A popular and successful way to capture cross-sectional dependence is through common factors.2 Common factor models assume a finite number of unobserved factors that may be the result of economy-wide shocks with impacts on population units that may depend on the characteristics of the unit. Possible common factors include macroeconomic, technological, legal/institutional, political, environmental, health and sociological shocks, among others. The applied literature has considered, for example, technological shocks (such as new procedures, drugs and surgical techniques) affecting the relationship between countries’ healthcare attainments and their per capita health expenditures and educational levels (e.g., ); cross-country cross-industry analysis of returns to R&D, which are affected both by global shocks, such as the recent financial crisis, and by local shocks, such as spillovers between a limited group of industries or countries (e.g., ); and the analysis of transnational terrorism, where common factors may arise from common terrorist training camps, common grievances and demonstration effects (cf. ).
Typically, common factor models allow for a small and known number of unobserved factors. Although such an approach is convincing in empirical macro models, in microeconometric models, it is often more reasonable to think of a potentially large, possibly unknown (and maybe infinite) number of factors that can influence individuals’ behaviour. For instance, in studies of individual earnings, there are many individual-level observables and unobservables that affect income; as well as several common factors, such as region, family, male/female ratio, race composition, education, age composition, and so on (cf. ). The number of common factors may increase as we collect more cross-sectional observations or there may be an infinite number of unobserved factors (see, e.g., ).
The purpose of this paper is to study a nonparametric regression model for cross-sectional data in the presence of common shocks that are very general in nature. The common shocks can be of infinite dimension with flexible impact on different population units. For example, common shocks could take the form of a nonlinear random function of observable or unobservable individual characteristics with the effect on the i-th observation varying continuously across i depending on the value of the characteristic. We focus on nonparametric models because there may be little guidance (or justification) in practice for selecting a particular functional form for the regression function.
There has been important recent work on nonparametric models with many finite common factors (e.g., [5,15,16]). They consider common shocks that enter the regression function additively and with disturbances that are modelled as linear functions of mutually-independent unobserved common factors and individual-specific factor loadings. We, in contrast, allow the regression function to be non-separable for common shocks, and we do not require the mutual independence assumption. In other words, we allow for an unknown large, potentially infinite, number of factors that can influence individuals’ outcomes and that may interact with observable and unobservable individual characteristics in extremely rich and flexible ways. To the best of our knowledge, this is the first paper that allows for such a flexible framework.
We consider this flexible setting because we are interested in investigating how general the regression function and the common shocks can be while still allowing for meaningful nonparametric estimates. We focus on the Nadaraya-Watson kernel estimator and study the effects of general common shocks on its asymptotic properties. Asymptotic results for kernel estimators are typically obtained by manipulating conditional densities of random variables. However, if the common shocks are too general, conditional densities do not necessarily exist. Doob  (pp. 623–624) and Halmos  (Section 48) present some examples of non-existence. If conditional densities do not exist, then what we would expect to be the probability limit of the kernel estimator in the present context is either meaningless or difficult to interpret.3
The idea here is to let the common shocks be as general as possible and to work with well-defined conditional densities that adhere as closely as possible to the standard kernel literature. To do so, we appeal to the disintegration theory for conditional distributions that can be found in Pollard , Dellacherie and Meyer  and Hoffmann-Jorgensen . We find that an important sufficient condition to guarantee the existence of conditional densities is that the common shocks must belong to a separable metric space equipped with the Borel σ-field. We conclude that the sufficient conditions are mild and not very restrictive in practice.4
Given the existence of conditional densities, we adjust the standard assumptions of the kernel literature to the present case. We show that the Nadaraya-Watson kernel estimator converges in probability to the Kolmogorov conditional expectation given the sigma-field generated by the common shocks. The optimal rate of convergence is the same as the rate obtained when the observations are i.i.d. The asymptotic distribution is mixed normal with weights depending on the common shocks. It is obtained by exploring a martingale difference sequence central limit theorem. We find that inference depends on how the common shocks affect the regression variables. A dichotomy similar to that of Andrews  is present here: if the dependent variable is mean independent of the common shocks given the explanatory variables, the usual t-test has the correct size; but if the dependent variable is not mean independent, the t statistic diverges to infinity in probability under the null hypothesis.
The closest paper in the literature to ours is that of Andrews , who considers a linear regression model in the presence of general common shocks. He shows that the least-squares estimator converges in probability to Kolmogorov conditional expectations given the σ-field generated by the common shocks. The random probability limit is a well-defined object because the Kolmogorov conditional expectation always exists. Andrews, therefore, does not need to guarantee the existence of conditional densities. Extending his results to a nonparametric model is important because parametric models may be misspecified. We show that the price to be paid is that mild restrictions must then be imposed on the nature of the common shocks.
The nonparametric version of the standard factor model is a special case of the model considered here. For this class of models, we show that, even though the kernel regression converges in probability to a random object measurable with respect to the common shocks, it is possible to identify and estimate the slope of the regression function. However, its location (e.g., the intercept in a linear model) is not identified even if we normalize common shocks to have a zero mean. To identify and estimate location, the dependent variable must be mean independent of the common shocks given the regressors.
Common factor models are typically applied to panel data sets (e.g., [14,17,18,19]). We view the present paper as a first step towards nonparametric panel data models that may incorporate a more general and flexible common factors structure. Indeed, in a companion paper, Souza-Rodrigues  develops a two-step nonparametric estimator that requires a “large-N, large-T” dataset for a generalized regression model based on the identification results of Berry and Haile . The estimator applies equally to datasets with a large number of individuals in different groups and a large number of groups. The empirical application in Souza-Rodrigues  considers the impact of hospital volumes of surgical procedures on individual health status (e.g., mortality rate).5 Group-level observables (i.e., hospital volume of surgeries) may be correlated with group-level unobservables (hospital unobserved quality), which, by its turn, may be indexed by individual characteristics (since an unobserved hospital characteristic that is helpful for patients with some demographic characteristics may not be as helpful for other patients). The strategy proposed by Souza-Rodrigues  is to run a nonparametric regression of individual outcomes on individual observables within each group (hospital) in the first step. It is a nonparametric regression with common shocks where the common shocks are the group-level observables and unobservables. Because the group-level unobservables may be a (random) function of individual characteristics, it is important to allow for this possibility, as we do here.6 The results of the present paper can be incorporated in other nonlinear panel data settings.
The present paper also relates to the literature of spatial dependence.7 Typically, in this literature, common shocks are presumed to have predominantly local effects, and the dependence is modelled as a function of an exogenously-given spatial or economic distance, with some form of stationary mixing condition analogous to the time series data. Recent nonparametric versions of spatial models have been considered by Martins-Filho and Yao , Gerolimetto and Magrini , among others. Although the present paper can incorporate common shocks with differential local effects (e.g., assuming that individual factor loadings include geographic location), we do not allow individual outcomes to depend on the characteristics of other individuals. We therefore view spatial dependence models as complementary to ours.
Robinson  provides an alternative way of modelling cross-sectional dependence. He considers a nonparametric kernel regression in which the disturbances are represented by a (possibly infinite) sum of independent random variables with unknown weights. The structure in the disturbances is sufficiently rich to cover spatial dependence models, but, since it does not require known economic distances, it can accommodate stronger forms of dependence than mixing conditions. Robinson  investigates the properties of kernel estimators, and Robinson and Lee  study the properties of sieve estimators within this framework. The present paper can accommodate disturbances of the type represented by Robinson , but with a vector of common shocks in place of the vector of independent random variables. We do not require the vector of common shocks to be independent random variables, and we allow for potentially correlated random weights in the summation term for the disturbances. However, the restrictions we need to impose on the common shocks differ from the assumptions in Robinson . Furthermore, we require i.i.d. sampling schemes that are neither assumed by Robinson , nor by the spatial dependence literature. Our model is therefore neither more general than, nor is it a special case of Robinson’s model.
The paper is organized as follows: Section 2 presents the regression model and discusses sufficient conditions to guarantee the existence of conditional densities. Section 3 establishes the asymptotic properties of the Nadaraya-Watson kernel regression estimator and discusses its implications. Section 4 concludes. The Appendix presents the disintegration theory and briefly discusses the role of separability of common shocks in the existence of conditional densities. The Supplemental Material presents results for the kernel density estimator, contains all relevant proofs and discusses the probabilistic framework adapted from Andrews  that justifies the approach taken here.
2. Regression Model and Conditional Densities
The dataset is , where () and (). Consider the model:
where (, with ) is a vector of individual-specific random variables; is the common shock; and is the idiosyncratic error. Some components of may be observable (in which case, it may be incorporated in ) or it may be completely unobservable. We allow the common shock to be either a random vector (possibly infinite-dimensional) or a random function of . In the latter case, the common shocks may affect individuals differently. As usual, we use upper-case letters to denote random quantities and lower-case letters to denote realizations.
The standard parametric factor model is a special case of our model and is typically written as:
where is the vector of individual-specific factor loadings; is the vector of unobserved common factors; is the idiosyncratic error that is independent of and has zero mean; and is the vector with the parameters of interest. Cross-sectional dependence in the disturbances is generated by the term . The standard model can also accommodate cross-sectional dependence on regressors . For example, consider the expanded vectors and and take and . Note that if , then and are correlated even when and are independent of each other (e.g., [8,10,11]). The nonparametric version of (2) takes , with the same structure for the disturbances .8
The standard factor model (2) is a special case of our model (1) with the regression function given by the linear and additively separable and the common shock function given by . We therefore generalize the standard model in the following ways: (i) we let the regression function be nonparametric; (ii) we allow the regressors to freely interact with the common shock ; and (iii) we let the common shock be a general function of individual-specific factor loadings (subject to the restrictions discussed below). Furthermore, factor models typically impose independence between and and assume that is a mutually independent vector, while we do not need to impose these independence assumptions. We, however, do not consider a fully-non-separable model; we maintain the additive separability assumption in the idiosyncratic error .
Robinson  also considers a nonparametric version of (2), but with another structure for . He considers the model:
where are scalar unknown functions; are independent random variables with zero mean and unit variance; and are unknown fixed weights.9 The present paper compares to Robinson  when the following holds: , with , and . Unlike Robinson , we allow for (potentially correlated) random coefficients and do not restrict to be independent variables with zero mean and unit variances. The restrictions we need to impose on the function are discussed below and are of a different nature than the assumptions used by Robinson .
Denote the vector , where . Define the measurable space , where is the Borel sigma-field. The random elements are defined on , where is the product space and is the product Borel sigma-field on . We suppose the common shocks across observations are captured by the σ-field generated by C, denoted by . We impose the following assumption:
The sequence is i.i.d. conditional on the σ-field .
As shown by Andrews , this assumption is valid when the units are drawn randomly from the population. One difference between the present paper and Andrews  is that he states the existence of some σ-field, such that the data are i.i.d. conditional on it without specifying a priori how this σ-field is constructed, while we impose more structure and state explicitly how the σ-field is generated. Andrews’ framework is, therefore, more general than ours in this respect. Note that neither the spatial dependence models nor Robinson’s  approach require random sampling.10
Existence of Conditional Densities
Because we make use of the Nadaraya-Watson kernel estimator and because the kernel estimator requires the existence of conditional densities, we now discuss the existence problem.
To guarantee the existence of conditional densities that allow for very general common shocks, we make use of the disintegration theory. Disintegration of a probability measure is a collection of regular conditional probabilities, each satisfying (i) a concentration property (i.e., conditional on an event, the probability of its complement is zero) and (ii) a decomposition property (i.e., the probability of an event is a weighted sum of the conditional probability measures, also known as the law of total probability).11 The reader unfamiliar with disintegration theory might want to read the Appendix (or the references cited there) before proceeding.
Define the sub-vector , for . We want to guarantee the existence of the conditional density of given C. By Assumption 1, the probability distribution of , denoted by , is exchangeable on . Call the marginal distribution of under . We impose the following:
(iii) C maps into . is a separable metric space, and is the Borel σ-field.
(iv) μ is a sigma-finite measure on . Let the measure induced by C and λ on be absolutely continuous with respect to μ.
(v) Let , for any , be absolutely continuous with respect to λ. Denote its Radon-Nikodym density by .
Assumption 2(iii) requires to be a separable metric space. This is trivially satisfied when belongs to a finite-dimensional Euclidean space. However, if C is an infinite dimensional vector of random variables, we need restrictions, such as , for some , where is the space of sequences with finite -norm, and we need to rule out the case , because is non-separable. Similarly, if C is a random function of , it must belong to spaces, such as the space for , or the space of bounded and continuous functions defined on a closed bounded subset of and equipped with the sup-norm, or the Hölder space, etc. However, it cannot belong to the space of bounded functions with the sup-norm, , because it is not separable. See the discussion about the role of separability for existence of conditional densities in the Appendix.13
The restrictions in Assumption 2 are mild and sufficient to guarantee the existence of conditional densities of given C for any . The reason for sufficiency is the following: first, Assumptions 2(i)–(iv) are sufficient for the sigma-finite Radon measure λ to have a -disintegration; i.e., they guarantee the existence of a collection of measures, denoted by , that satisfy the aforementioned concentration and decomposition properties (but note that do not have to be probability measures; see Definition 3 and Theorem 1 in the Appendix).
Second, if the disintegration exists and the probability measure on is absolutely continuous with respect to λ with density (Assumption 2(v)), then two implications follow (see Theorem 2 in the Appendix): (i) the probability distribution of C induced by (i.e., the image measure ) is absolutely continuous with respect to μ with density:
and (ii) the probability measure has a conditional distribution given C, denoted by the collection , where is defined by having density:
with respect to for -almost all . The conditional density is therefore similar to elementary conditional densities: it is the ratio of the joint density and the marginal . However, it does not require C to belong to a finite-dimensional Euclidean space.
Because C is common to all i, the equality follows for all . In addition, for all and for Q-almost all by Assumption 1. We state this result as a lemma:
Let Assumptions 1 and 2 hold. Then, there exist conditional densities of given C, for all , defined by:
Suppose is scalar and is the separable Hilbert space . Take a basis for and represent the common shock by , where for . Note that one can define , in which case the random coefficients are not independent of each other. More important for us is to note that selecting a function in is equivalent to selecting the infinite dimensional vector in . Let be the Borel σ-field on and be the Borel σ-field on . Because the spaces and are homeomorphic, their topologies are equivalent, and so, and are equivalent.15As a result, the event on is equivalent to the (potentially more intuitive) event on . In addition, conditioning on is equivalent to conditioning on . We have, therefore, and:
Example 1 intends to translate properties of conditional probabilities given an element in some abstract space of functions into properties in (hopefully) more concrete spaces defined by random vectors. Example 1, however, does not apply when is not a Hilbert space. Although we may approximate any of the separable metric spaces by other simpler spaces, the conditioning argument does not hold without running into problems, such as the Borel paradox (see, e.g., ). For instance, take to be the set of bounded and continuous functions, . It is separable, and any can be well approximated by a polynomial of order , say with some coefficients . Because we can take J such that , for some , the probability of the event is close to the probability of the event . However, the topology of is not the same as the topology of the Euclidean for any finite J. Therefore, the Borel σ-field on is different from the Borel σ-field on any . Conditioning on different σ-fields delivers different conditional probability distributions, and so, we are not guaranteed to have close to for all measurable sets A. We can still obtain the existence of conditional densities, but we cannot derive conclusions based on some approximation for , no matter how large J is.
3. Regression Estimator
Next, we consider the properties of the Nadaraya-Watson kernel regression estimator:
where (·) is the kernel function and is the bandwidth. As previously mentioned, the objective here is to work as closely as possible to the standard kernel literature. The assumptions we impose are therefore similar to the standard assumptions (see Pagan and Ullah ), but with the population density and regression function substituted for the corresponding conditional functions and with the extra “Q-almost all c” qualifiers added. For brevity, we relegate the properties of the kernel density estimator to the Supplemental Material.
We maintain Assumptions 1 and 2 from now on. In addition, we impose the following conditions:
Let K be the class of all Borel measurable nonnegative bounded real-valued functions , such that: (i) ; (ii) ; (iii) as ; (iv) ; (v) ; and (vi) .
For Q-almost all , the conditional density is continuous at any point .
(i) as and (ii) as .
For Q-almost all c, (i) is twice continuously differentiable with respect to x in some neighbourhood of and (ii) the second-order derivatives of with respect to x are bounded in this neighbourhood.
For Q-almost all c, the point is in the interior of the support of X conditional on and , for some finite ξ.
The kernel K is a symmetric function satisfying .
(i) a.s.; and (ii) let , and assume a.s..
For Q-almost all c, the function is twice continuously differentiable with respect to x in some neighbourhood of .
Conditions 1–5 suffice to obtain the asymptotic properties of the kernel density estimator (consistency, rate of convergence and asymptotic distribution; see the Supplemental Material). Condition 6 is standard in the literature.
Condition 7(i) implies . In the standard factor model, this translates into:
Note that is a random object because C has not been fixed. Typically in the literature, is assumed to be independent of , in which case where is an unknown constant. Unlike the standard model, here, we allow J to be infinite (as long as C belongs to an appropriate separable metric space); we do not require to be independent of , and we allow for more complicated interactions between X and C.
Condition 7(ii) allows for conditional heteroskedasticity; and Condition 8 is used to apply Q-almost sure Taylor expansions similar to what is usually done in the kernel literature.
Condition 8 requires to be twice continuously differentiable in x for almost all c. To fix ideas, consider the following case: let , and . Conditioned on the event , we have that:
while conditioning only on the event , we obtain the random object:
Therefore, to satisfy Condition 8, we need to be twice continuously differentiable with respect to both the first and second arguments, and we also need to be twice continuously differentiable with respect to x.17
To obtain the consistency of , we first show that the kernel density converges in probability to the conditional density . Then, we prove that the mean-squared error of conditional on converges to zero in probability. Finally, consistency follows by the dominated convergence theorem. We then show that the rate of convergence is the same as the rate of convergence without common shocks. The pointwise asymptotic distribution is obtained using the martingale difference sequence central limit theorem.
Let denote the conditional expectation given , . Let Assumptions 1 and 2 and Conditions 1–8 hold. Then:
Suppose also that and , for some . Define . Then, (i) as :
and (ii) if, in addition, as , then:
Proposition 1.1 shows that the kernel regression estimator converges in probability to the random object . In general, is different from the conditional expectation ; the equality only holds when Y is mean independent of C given X. To see how this difference may affect the interpretation of potential estimands, take the standard factor model as an example.18 In this case, is given by (9), while is given by:
If we assume, as is usually done, that is independent of , we have that . If there is no cross-sectional dependence on regressors resulting from the common shocks, then . In addition, if we normalize for all j, then , while . Because Y is not mean independent of C given X, .
Although we cannot estimate consistently, it is possible to identify and estimate β by noting that , for . Similarly, for nonparametric factor models, , one can identify and estimate the slope of . However, the presence of the common shocks prevents the identification of the intercept α in the linear model (and the identification of the location of in the nonparametric model) even if we normalize for all j.
The nonparametric factor model with , and , for all j, has a structure similar to the one proposed by Robinson . Yet, while Robinson  shows that the kernel regression estimator converges in probability to , we obtain convergence to . An important distinction comes from the assumption on the sampling process. Because we have exchangeable data given the common shocks (Assumption 1), the conditions we impose are not sufficient to “get rid of” C in the limit. Robinson , in contrast, does not impose the conditional i.i.d. sampling process.
Returning to the standard factor model, if we assume now the presence of cross-sectional dependence on regressors captured by, say, with , then and:
Again, Y is not mean independent of C given X, so , but it is still possible to identify β in the parametric model and the slope of in the nonparametric version.19
In the standard factor model, Y is mean independent of C given X only if the common shocks have no direct effect on Y. This is the case when . When this is true, , and the kernel regression converge in probability to , even when there is cross-sectional dependence on X. In this case, we identify both parameters α and β in the linear model and in the nonparametric model. Note that assuming for all j is not an innocuous normalization, but a substantive assumption.
The last case is similar to Andrews . Let and , where is mutually independent and (see Andrews’ Assumption SF1). Imposing is similar to imposing Andrews’ Condition SF3. Assuming Condition 7(i) together with mutual independence is similar to Andrews’ Condition SF2.
Proposition 1.2 shows that the rate of convergence of the kernel regression in the presence of common shocks is the same as the rate of convergence without common shocks.
Proposition 1.3 presents the asymptotic distribution of the kernel regression estimator. It shows that even when , the common shocks affect the asymptotic distribution of the kernel regression because they may impact both the conditional variance of Y and the conditional density of X. This result is similar to that of Andrews , Robinson  and others.
A consequence of Proposition 1.3 is that inference results depend on whether Y is mean independent of C given X. To test a null hypothesis, say, against , the corresponding t statistics is:
The usual two-sided t test with significance level α rejects the null if , where is the α quantile of the standard normal distribution. If Y is mean independent of C given X, then as . Otherwise, we have as .20
The bandwidth can be chosen by minimizing the approximated integrated mean squared error () conditional on . The bandwidth must be a -measurable random variable, . In the Supplemental Material, we show that , and one might expect both plug-in and cross-validation estimators to be consistent. The usual concerns in the literature about how to select the bandwidth are present here, but for brevity, we do not investigate the topic further. We only emphasize that the bandwidth choice based on the unconditional is infeasible because it is impossible to estimate the distribution of C (and integrate that out) using a single cross-sectional dataset.
In this paper, we investigate a nonparametric regression estimator for cross-sectional data in the presence of very general, potentially infinite-dimension, common shocks. In a companion paper Souza-Rodrigues , we extend the results to a “large-N, large-T” panel data framework for a nonlinear generalized regression model. We plan to investigate extensions to finite-T panel data models in the future.
Supplementary File 1
I am grateful to Donald Andrews, Xiaohong Chen, Philip Haile, Steven Berry, Tai Otsu, Yuichi Kitamura, Ed Vytlacil, Peter Phillips, Marfisa Queiroz, two anonymous referees, and the participants of the Econometrics Lunch at Yale. Financial support from Charles V. Hickox Fellowship at Yale University, Yale University Fellowship, and Kernan Brothers Environmental Fellowship at Harvard University are gratefully acknowledged. All errors are mine.
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Disintegration Theory
We follow the discussion in Pollard  (Chapter 5 and Appendix F).21 Throughout this section, let the measurable space be , and let C be a measurable map from into . Let λ be a sigma-finite measure on and μ be a sigma-finite measure on . The definition of conditional distributions given in Pollard  (p. 113) is:
Let P be a probability measure on , and let Q be the probability distribution of C induced by P. A family of probability measures on is called the conditional probability distribution of P given C if:
, for Q-almost all ;
and for each nonnegative measurable function f on Ω:
the map is -measurable; and
the equality holds.
The conditional probability distribution is a family of probability measures satisfying (i) a concentration property (), (ii) a measurability condition (Property 2) and (iii) a decomposition property (Property 3). Unfortunately, the conditional probability distribution may not exist. The Kolmogorov conditional expectation, on the other hand, always exists. For completeness, we state the definition below Pollard  (p. 126):
Let f be a random variable on Ω and P be a probability measure on . For each sub-sigma-field , the conditional expectation is the random variable defined on , such that, for all sets , with indicator functions ,
The random variable is called the conditional expectation of f given the sub-sigma-field and it is unique up to P-equivalence.
If the conditional probability distribution of P given C exists, , the -measurable function defined by:
The problem with the Kolmogorov conditional expectation is that each of its usual properties (mainly, being a linear increasing functional of f satisfying the monotone convergence property) holds Q-almost everywhere, but with possible uncountably many negligible sets in which these properties do not hold. The accumulation of these null sets may lead to paradoxes when one is trying to compute the conditional expectation (see, e.g., ). To avoid these difficulties, topological assumptions are invoked to guarantee the existence of conditional probability distributions , such that all of the properties of the Kolmogorov conditional expectation are satisfied except in countably many Q-negligible sets. By collecting all of these countably many Q-negligible sets into a single Q-null set, we avoid the problems and paradoxes coming from an accumulation of uncountably many null sets. Under the topological assumptions, the family is a version of the Kolmogorov conditional expectation that does not run into such difficulties and, as a by-product, guarantees the existence of conditional densities. The conditional densities may then be (carefully) manipulated preserving the intuition we have for the cases where the conditioning event has positive probability.
The existence of conditional probability distribution follows from a general decomposition called disintegration. The definition of disintegration given in Chang and Pollard  (p. 292) is:
The measure λ has a disintegration with respect to C and μ (or a -disintegration) if:
is a sigma-finite measure on concentrated on , that is for μ-almost all c;and for each nonnegative measurable function f on Ω:
the map is -measurable; and
the equality holds.
From the definitions, it is clear that a -disintegration can be a conditional probability distribution . Yet, it is useful to let Λ be a collection of sigma-finite measures, so that we can define conditional densities with respect to dominating measures.
Based on this disintegration, we can define a new measure on the product , by the iterated integral:
for all sets . The measure has to be well-defined to satisfy Condition 3 of the Definition 3.
The existence of disintegration is guaranteed by the following theorem (Theorem 6 in Pollard , Appendix F)).
(Existence of disintegration) Let λ be a sigma-finite Radon measure on the Borel sigma-field of a metric space Ω. Let μ be a sigma-finite measure on that dominates the image measure (i.e., the measure on induced by the map C and the measure λ). If the set:
is measurable, then λ has a -disintegration, , uniquely determined up to μ-equivalence (i.e., if is another -disintegration, then ).
To guarantee the existence of the -disintegration, we need, therefore, to restrict: (i) Ω to be a metric space with the Borel sigma-field ; (ii) to be a sigma-finite Radon measure; and (iii) the set to be measurable. Depending on the problem at hand, it may be reasonable to assume (i) directly. To see the importance of the requirements (ii) and (iii), we briefly describe how the proof works. We then finally discuss the existence of conditional densities.
First, assume that Ω is a compact metric space, and let be a compact paving. A compact paving is a class of compact sets in Ω that is closed under finite unions and intersections. One can show that is countable when Ω is compact. The proof carefully constructs a finitely additive measure , for some , so that the desired “measure-like” properties of the disintegration (Definition 3) hold for μ-almost all c. Because is countable, all of the desired properties of hold, except on countably many negligible sets, which can be collected into a single negligible set. It is shown, then, that there exists a unique extension of to a countably additive measure defined on a sigma-field containing (see , Appendix A)). The extension is (inner) approximated by compact sets. By construction, all of the desirable properties hold for the extension of and for all , where N is a single set with . The proof then shows that is -measurable for all Borel sets . Finally, the argument is extended for Ω that is not compact, but the measure λ concentrates all of the mass on a disjoint union of countably many compact Borel sets; i.e., the measure λ is a sigma-finite Radon measure. Intuitively, the proof explores compact approximations as a way to obtain countable additivity from finite additivity and to collect the negligible sets into a single null set N.
Pachl  shows that a sigma-finite Radon λ (Requirement (ii)) is a necessary condition for existence of disintegration. Therefore, even when Ω is not compact (or not separable), λ must have separable support.22
The third requirement, the -measurability of the set , is also necessary because the measure:
is well-defined only if . The condition is not innocuous: it is well known that the may not be -measurable even when C is measurable. The -measurability can be obtained if the σ-field is countably generated and contains all of the singleton sets (see , p. 344). In particular, if is the Borel σ-field on the separable metric space , these conditions are satisfied (see , p. 103).
A separable with the Borel σ-field is sufficient, but not necessary, for the -measurability of the . It is possible, but not trivial to obtain such a result for non-separable spaces. Hansell  provides very abstract (and somewhat difficult to interpret) sufficient conditions for the -measurability when is not separable. Yet, even if the -measurability holds for a non-separable , the Radon measure λ puts all mass on a separable subset of . To see why, let G be a countable union of compact sets on Ω, such that , where is the complement of G. The map defined by is such that λ concentrates all mass in the set . If C is Borel measurable, the set is separable and, so, is (see Bogachev , Corollary 6.10.17)). The image measure of C under λ therefore puts all mass on a separable subset of when is non-separable. Therefore, although does not have to be separable to obtain the existence of disintegration, it seems difficult to get away from separability in this context.
The next theorem provides the conditions under which conditional densities exist (Theorem 12 in Pollard , Chapter 5).
(Conditional densities) Let P be a probability measure on with density with respect to the sigma-finite measure λ. Let λ have a -disintegration . Then:
The image measure (i.e., the probability distribution of C induced by P) is absolutely continuous with respect to μ, with density .
The set has zero measure.
The probability measure P has conditional distribution given C, where is defined by having density:
with respect to , for Q-almost all .
The formula in (12) is the general version of the conditional density as the ratio of the joint density to the marginal density, but not requiring C to belong to a Euclidean space. To guarantee the existence of the conditional density, we therefore need the existence of the -disintegration . For a more detailed discussion, see [28,29,30,50].
1.See, for example, Arbia , the proceedings of the 2008 Cowles Summer Conference , the special issue of the Journal of Econometrics (“Analysis of Spatially Dependent Data,” 2007, 140(1), edited by Baltagi, Kelejian and Prucha), and the special issue of Econometrics (“Spatial Econometrics,” 2015, edited by Arbia and Lee). For recent surveys, see [3,4,5].
3.Formally, the probability limit of the kernel estimator for a nonstationary process can be obtained using the concept of local time, as in Wang and Phillips . However, the probability limit of the kernel regression estimator may not be measurable with respect to the conditioning variables, including the common shocks. This is a particularly important problem when we extend the results to panel data models, as in Souza-Rodrigues .
4.Although separability is not a necessary condition, it seems difficult to avoid it if we are to obtain the existence of conditional densities; see the discussion about the role of separability in the Appendix. Note that several separable metric spaces satisfying the sufficient conditions are available, but careful interpretation is needed in particular cases. For instance, suppose that an infinite-dimensional common shock can be well-approximated by a finite dimensional object. Because the sigma-field generated by the common shock may be different from the sigma-field generated by the approximating object, the conditional expectations given the common shock and given the finite-dimensional object are different. Ignoring this difference leads to problems such as the Borel paradox.
5.The motivation for this application is that numerous studies have documented an inverse relationship between hospital volumes of operations and mortality rates (see ). This suggests that thousands of deaths per year could have been prevented if hospitals with inadequate experience (i.e., with low volume of operations) had performed fewer surgical procedures. The evidence, however, is weak for most operations. Furthermore, existing papers have estimated parametric models that may be misspecified and have not considered the potential correlation between hospital volume of operations and hospital unobserved quality.
6.The second step runs a nonparametric instrumental variable regression across groups (hospitals) of the predicted outcome obtained in the first step on the group-level observables. It separates the impacts of group-level observables (hospital volume of surgeries) and unobservables (hospital unobserved quality).
8.In a panel data setting, one typically allows for time-varying regressors , but restricts , so that it does not vary over time, and the common shock C, so that it does not vary across individuals. Fixed-effect panel data models let and be correlated.
9.Note that this approach does not require known economic distances, but can readily accommodate them by taking , and by making some assumptions regarding how depends on the distance .
10.When Andrews  specializes to factor structure models, he imposes more restrictions on the common shocks, which makes his approach more similar to ours.
11.A regular conditional probability, , is a family of probability distribution, such that (i) for a fixed x, is a probability measure and (ii) for a fixed measurable set A, is a measurable function mapping x to .
12.The measure λ is Radon if (i) for each compact K and
13.It is possible to characterize all of the objects in Assumption 2 when . First, we have that (i) is a separable metric space provided that is a separable metric space, as well, and (ii) the Borel σ-field on equals the product Borel σ-field , where we denote the Borel σ-field on (see , Proposition 1.5). Second, let be the projection of onto the coordinate space , i.e., . Then, (i) the sub-sigma field is contained in and (ii) because , for all ; the sigma-field generated by C is . Furthermore, if we define the sigma-finite Radon λ on to be the product measure , where is defined on and μ on , then the measure induced by C and λ on equals μ, and so, is (trivially) absolutely continuous with respect to μ. Finally, we have to assume both ν and μ are sigma-finite Radon, so that λ is sigma-finite Radon on , as well.
14.Note that we can manipulate the conditional density (6) on as is usually done. Fix and think of as a copy of embedded into the product space. For a fixed , take the measure living on to coincide with the Lebesgue measure on . If ·) is a vector-valued function with , then:
15.Any infinite-dimensional separable Hilbert space, say , is isometrically isomorphic to a suitable , where the cardinality of the set I is the cardinality of an arbitrary Hilbertian basis for , i.e., there exists a linear operator , such that , where , is the norm on and is the -norm.
16.Conditioning on the event is only one possibility. For some , we could condition either on the event , or on the event , where the randomness of the event comes from , or on , where the randomness comes from .
17.It should be clear that it is not possible to separately identify from in this example.
18.Recall that the nonparametric version of the factor model takes , with . The parametric model imposes .
19.Note that if we were able to estimate the conditional expectation instead of , it would be impossible to separate from , and so, we would not be able to identify β.
20.In the Supplemental Material, we provide conditions under which the kernel density estimator is consistent: . For the variance , we can take to be:
The first term on the right-hand side converges in probability to using the same arguments as in Proposition 1. The second term on the right-hand side converges in probability to by the Slutsky theorem. Therefore, . Next, note that:
The first term on the RHS converges in distribution to by Proposition 1.3(ii). The second term on the RHS is such that: (a) , for some finite ξ, with probability approaching one because for Q-almost all c (see the Supplemental Material). If (b) is finite Q-almost surely (implying is finite with probability approaching one); and if (c) with positive probability; then, the second term on the RHS diverges in probability to . As a result, as under the null.
21.Dellacherie and Meyer , Hoffmann-Jorgensen  (Chapter 6 and Section 10.11), Pachl , and Chang and Pollard  are also important references.
22.Formally, the necessary condition is that λ must be approximated by a compact paving that is closed under countable unions.
G. Arbia. Spatial Econometrics: Statistical Foundations and Applications to Regional Convergence. Berlin, Germany: Springer-Verlag, 2006. [Google Scholar]
D.W.K. Andrews. “Handling Dependence: Temporal, Cross-Sectional and Spatial.” In Proceedings of the Cowles Summer Conference, New Haven, CT, USA, 22–23 June 2009.
A. Chudik, and M.H. Pesaran. “Large panel data models with cross-sectional dependence: A survey.” In The Oxford Handbook on Panel Data. Edited by B. Baltagi. New York, NY, USA: Oxford University Press, 2015, pp. 3–45. [Google Scholar]
Q.H. Xu, Z.W. Cai, and Y. Fang. “Panel data models with cross-sectional dependence: A selective review.” Appl. Math.-J. Chin. Univ. 31 (2016): 127–147. [Google Scholar] [CrossRef]
P.C.B. Phillips, and D. Sul. “Dynamic panel estimation and homogeneity testing under cross section dependence.” Econom. J. 6 (2003): 217–259. [Google Scholar] [CrossRef]
P.C.B. Phillips, and D. Sul. “Bias in dynamic panel estimation with fixed effects, incidental trends and cross section dependence.” J. Econom. 137 (2007): 162–188. [Google Scholar] [CrossRef]
D.W.K. Andrews. “Cross-section regression with common shocks.” Econometrica 73 (2005): 1551–1585. [Google Scholar] [CrossRef]
J. Bai, and S. Ng. “Evaluating latent and observed factors in macroeconomics and finance.” J. Econom. 131 (2006): 507–537. [Google Scholar] [CrossRef]
M.H. Pesaran. “Estimation and inference in large heterogeneous panels with a multifactor error structure.” Econometrica 74 (2006): 967–1012. [Google Scholar] [CrossRef]
J. Bai. “Panel data models with interactive fixed effects.” Econometrica 77 (2009): 1229–1279. [Google Scholar]
M.H. Pesaran, and E. Tosetti. “Large panels with common factors and spatial correlation.” J. Econom. 161 (2011): 182–202. [Google Scholar] [CrossRef]
L. Su, and S. Jin. “Sieve estimation of panel data models with cross section dependence.” J. Econom. 169 (2012): 34–47, In Special Issue “Recent Advances in Panel Data, Nonlinear and Nonparametric Models: A Festschrift in Honor of Peter C.B. Phillips”. [Google Scholar] [CrossRef]
X. Huang. “Nonparametric Estimation in Large Panels with Cross-Sectional Dependence.” Econom. Rev. 32 (2013): 754–777. [Google Scholar] [CrossRef]
G.M. Kuersteiner, and I.R. Prucha. “Limit theory for panel data models with cross sectional dependence and sequential exogeneity.” J. Econom. 174 (2013): 107–126. [Google Scholar] [CrossRef] [PubMed]
A. Chudik, and M.H. Pesaran. “Common correlated effects estimation of heterogeneous dynamic panel data models with weakly exogenous regressors.” J. Econom. 188 (2015): 393–420. [Google Scholar] [CrossRef]
G. Forchini, and B. Peng. “A conditional approach to panel data models with common shocks.” Econometrics 4 (2016): 4. [Google Scholar] [CrossRef]
D. Evans, A. Tandon, C. Murray, and J. Lauer. The comparative efficiency of national health systems in producing health: An analysis of 191 countries. GPE Discussion Paper No. 29; Geneva, Switzerland: World Health Organization, 2000. [Google Scholar]
M. Eberhardt, C. Helmers, and H. Strauss. “Do spillovers matter when estimating private returns to R&D? ” Rev. Econ. Stat. 95 (2013): 436–448. [Google Scholar]
K. Gaibulloev, T. Sandler, and D. Sul. “Common drivers of transnational terrorism: Principal component analysis.” Econ. Inq. 51 (2013): 707–721. [Google Scholar]
J. Altonji, T. Conley, T.E. Elder, and C.R. Taber. Methods for Using Selection on Observed Variables to Address Selection on Unobserved Variables. New Haven, CT, USA: Yale University, 2010. [Google Scholar]
J.L. Doob. Stochastic Processes. New York, NY, USA: Wiley, 1953. [Google Scholar]
P.R. Halmos. Measure Theory. New York, NY, USA: Van Nostrand, 1950, (July 1969 reprinting). [Google Scholar]
Q. Wang, and P.C.B. Phillips. “Asymptotic Theory for Local Time Density Estimation and Nonparametric Cointegration Regression.” Econom. Theory 25 (2009): 710–738. [Google Scholar] [CrossRef]
E.A. Souza-Rodrigues, and University of Toronto, Toronto, ON, Canada. “Nonparametric estimation of generalized regression model with group effects.” Unpublished paper. 2014. [Google Scholar]
D. Pollard. A User’s Guide to Measure Theoretic Probability. New York, NY, USA: Cambridge University Press, 2002. [Google Scholar]
C. Dellacherie, and P.A. Meyer. Probabilities and Potential. Amsterdam, The Netherland: North-Holland, 1978. [Google Scholar]
J. Hoffmann-Jorgensen. Probability with a View Towards Statistics. New York, NY, USA: Chapman and Hall, 1994, Volume 2. [Google Scholar]
S.T. Berry, and P.A. Haile. “Identification of a nonparametric generalized regression model with group effects.” Discussion paper. New Haven, CT, USA: Yale University, 2009. [Google Scholar]
J.F. Finks, N.H. Osborne, and J.D. Birkmeyer. “Trends in hospital volume and operative mortality for high-risk surgery.” N. Engl. J. Med. 364 (2011): 2128–2137. [Google Scholar] [CrossRef] [PubMed]
L. Anselin. Spatial Econometric Methods and Models. Boston, MA, USA: Springer, 1988. [Google Scholar]
T.G. Conley. “GMM estimation with cross-sectional dependence.” J. Econom. 92 (1999): 1–45. [Google Scholar] [CrossRef]
H.H. Kelejian, and I.R. Prucha. “A generalized moments estimator for the autoregressive parameter in a spatial model.” Int. Econ. Rev. 40 (1999): 509–533. [Google Scholar] [CrossRef]
L.F. Lee. “Consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models.” Econom. Theory 18 (2002): 252–277. [Google Scholar] [CrossRef]
C. Martins-Filho, and F. Yao. “Nonparametric regression estimation with general parametric error covariance.” J. Multivar. Anal. 100 (2009): 309–333. [Google Scholar] [CrossRef]
M. Gerolimetto, and S. Magrini. Nonparametric Regression with Spatially Dependent Data. Veneza, Italy: Dipartimento di Scienze Economiche, Università Ca’ Foscari Venezia, 2009. [Google Scholar]
L. Lee, and J. Yu. “Efficient GMM estimation of spatial dynamic panel data models with fixed effects.” J. Econom. 180 (2014): 174–197. [Google Scholar] [CrossRef]
L. Su, and Z. Yang. “QML estimation of dynamic panel data models with spatial errors.” J. Econom. 185 (2015): 230–258. [Google Scholar] [CrossRef]
C.A. Bester, T.G. Conley, C.B. Hansen, and T.J. Vogelsang. “Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators.” Econom. Theory 32 (2016): 154–186. [Google Scholar] [CrossRef]
G. Arbia. “Spatial Econometrics: A Rapidly Evolving Discipline.” Econometrics 4 (2016): 18. [Google Scholar] [CrossRef]
L. Lee, and J. Yu. “Some recent developments in spatial panel data models.” Reg. Sci. Urban Econ. 40 (2010): 255–271. [Google Scholar] [CrossRef]
P.M. Robinson. “Asymptotic theory for nonparametric regression with spatial data.” J. Econom. 165 (2011): 5–19. [Google Scholar] [CrossRef]
P.M. Robinson, and J. Lee. “Series estimation under cross-sectional dependence.” J. Econom. 190 (2016): 1–17. [Google Scholar]
G.B. Folland. Real Analysis: Modern Techniques and Their Applications, 2nd ed. Pure and Applied Mathematics: AWiley Series of Texts, Monographs and Tracts; New York, NY, USA: Wiley Interscience, 1999. [Google Scholar]
M.M. Rao. “Paradoxes in conditional expectation.” J. Multivar. Anal. 27 (1988): 434–446. [Google Scholar] [CrossRef]
A.R. Pagan, and A. Ullah. Nonparametric Econometrics. New York, NY, USA: Cambridge University Press, 1999. [Google Scholar]
J. Pachl. “Disintegration and compact measures.” Mathematica Scandinavica 43 (1978): 157–168. [Google Scholar]
J.T. Chang, and D. Pollard. “Conditioning as disintegration.” Statistica Neerlandica 51 (1997): 287–317. [Google Scholar] [CrossRef]
R.W. Hansell. “Sums, products and continuity of Borel maps in nonseparable metric spaces.” Proc. Am. Math. Soc. 104 (1988): 465–471. [Google Scholar] [CrossRef]