Nonparametric Regression with Common Shocks

This paper considers a nonparametric regression model for cross-sectional data in the presence of common shocks. Common shocks are allowed to be very general in nature; they do not need to be finite dimensional with a known (small) number of factors. I investigate the properties of the Nadaraya-Watson kernel estimator and determine how general the common shocks can be while still obtaining meaningful kernel estimates. Restrictions on the common shocks are necessary because kernel estimators typically manipulate conditional densities, and conditional densities do not necessarily exist in the present case. By appealing to disintegration theory, I provide sufficient conditions for the existence of such conditional densities and show that the estimator converges in probability to the Kolmogorov conditional expectation given the sigma-field generated by the common shocks. I also establish the rate of convergence and the asymptotic distribution of the kernel estimator.


Introduction
Cross-sectional dependence has attracted considerable attention among economists recently. 1It is well-known that ignoring cross-sectional dependence may lead to inconsistent estimators and misleading inference.A popular and successful way to capture cross-sectional dependence is through common factors. 2 Common factor models assume a finite number of unobserved factors that may be the result of economy-wide shocks with impacts on population units that may depend on the characteristics of the unit.Possible common factors include macroeconomic, technological, legal/institutional, political, environmental, health and sociological shocks, among others.The applied literature has considered, for example, technological shocks (such as new procedures, drugs and surgical techniques) affecting the relationship between countries' healthcare attainments and their per capita health expenditures and educational levels (e.g., [20]); cross-country cross-industry analysis of returns to R&D, which are affected both by global shocks, such as the recent financial crisis, and by local shocks, such as spillovers between a limited group of industries or countries (e.g., [21]); and the analysis of transnational terrorism, where common factors may arise from common terrorist training camps, common grievances and demonstration effects (cf.[22]). 1 See, for example, Arbia [1], the proceedings of the 2008 Cowles Summer Conference [2], the special issue of the Journal of Econometrics ("Analysis of Spatially Dependent Data," 2007, 140 (1), edited by Baltagi, Kelejian and Prucha), and the special issue of Econometrics ("Spatial Econometrics," 2015, edited by Arbia and Lee).For recent surveys, see [3][4][5].
Typically, common factor models allow for a small and known number of unobserved factors.Although such an approach is convincing in empirical macro models, in microeconometric models, it is often more reasonable to think of a potentially large, possibly unknown (and maybe infinite) number of factors that can influence individuals' behaviour.For instance, in studies of individual earnings, there are many individual-level observables and unobservables that affect income; as well as several common factors, such as region, family, male/female ratio, race composition, education, age composition, and so on (cf.[7]).The number of common factors may increase as we collect more cross-sectional observations or there may be an infinite number of unobserved factors (see, e.g., [23]).
The purpose of this paper is to study a nonparametric regression model for cross-sectional data in the presence of common shocks that are very general in nature.The common shocks can be of infinite dimension with flexible impact on different population units.For example, common shocks could take the form of a nonlinear random function of observable or unobservable individual characteristics with the effect on the i-th observation varying continuously across i depending on the value of the characteristic.We focus on nonparametric models because there may be little guidance (or justification) in practice for selecting a particular functional form for the regression function.
There has been important recent work on nonparametric models with many finite common factors (e.g., [5,15,16]).They consider common shocks that enter the regression function additively and with disturbances that are modelled as linear functions of mutually-independent unobserved common factors and individual-specific factor loadings.We, in contrast, allow the regression function to be non-separable for common shocks, and we do not require the mutual independence assumption.In other words, we allow for an unknown large, potentially infinite, number of factors that can influence individuals' outcomes and that may interact with observable and unobservable individual characteristics in extremely rich and flexible ways.To the best of our knowledge, this is the first paper that allows for such a flexible framework.
We consider this flexible setting because we are interested in investigating how general the regression function and the common shocks can be while still allowing for meaningful nonparametric estimates.We focus on the Nadaraya-Watson kernel estimator and study the effects of general common shocks on its asymptotic properties.Asymptotic results for kernel estimators are typically obtained by manipulating conditional densities of random variables.However, if the common shocks are too general, conditional densities do not necessarily exist.Doob [24] (pp.623-624) and Halmos [25] (Section 48) present some examples of non-existence.If conditional densities do not exist, then what we would expect to be the probability limit of the kernel estimator in the present context is either meaningless or difficult to interpret. 3 The idea here is to let the common shocks be as general as possible and to work with well-defined conditional densities that adhere as closely as possible to the standard kernel literature.To do so, we appeal to the disintegration theory for conditional distributions that can be found in Pollard [28], Dellacherie and Meyer [29] and Hoffmann-Jorgensen [30].We find that an important sufficient condition to guarantee the existence of conditional densities is that the common shocks must belong to a separable metric space equipped with the Borel σ-field.We conclude that the sufficient conditions are mild and not very restrictive in practice. 4  3 Formally, the probability limit of the kernel estimator for a nonstationary process can be obtained using the concept of local time, as in Wang and Phillips [26].However, the probability limit of the kernel regression estimator may not be measurable with respect to the conditioning variables, including the common shocks.This is a particularly important problem when we extend the results to panel data models, as in Souza-Rodrigues [27].4 Although separability is not a necessary condition, it seems difficult to avoid it if we are to obtain the existence of conditional densities; see the discussion about the role of separability in the Appendix.Note that several separable metric spaces satisfying the sufficient conditions are available, but careful interpretation is needed in particular cases.For instance, suppose that an infinite-dimensional common shock can be well-approximated by a finite dimensional object.Because the sigma-field generated by the common shock may be different from Given the existence of conditional densities, we adjust the standard assumptions of the kernel literature to the present case.We show that the Nadaraya-Watson kernel estimator converges in probability to the Kolmogorov conditional expectation given the sigma-field generated by the common shocks.The optimal rate of convergence is the same as the rate obtained when the observations are i.i.d.The asymptotic distribution is mixed normal with weights depending on the common shocks.It is obtained by exploring a martingale difference sequence central limit theorem.We find that inference depends on how the common shocks affect the regression variables.A dichotomy similar to that of Andrews [8] is present here: if the dependent variable is mean independent of the common shocks given the explanatory variables, the usual t-test has the correct size; but if the dependent variable is not mean independent, the t statistic diverges to infinity in probability under the null hypothesis.
The closest paper in the literature to ours is that of Andrews [8], who considers a linear regression model in the presence of general common shocks.He shows that the least-squares estimator converges in probability to Kolmogorov conditional expectations given the σ-field generated by the common shocks.The random probability limit is a well-defined object because the Kolmogorov conditional expectation always exists.Andrews, therefore, does not need to guarantee the existence of conditional densities.Extending his results to a nonparametric model is important because parametric models may be misspecified.We show that the price to be paid is that mild restrictions must then be imposed on the nature of the common shocks.
The nonparametric version of the standard factor model is a special case of the model considered here.For this class of models, we show that, even though the kernel regression converges in probability to a random object measurable with respect to the common shocks, it is possible to identify and estimate the slope of the regression function.However, its location (e.g., the intercept in a linear model) is not identified even if we normalize common shocks to have a zero mean.To identify and estimate location, the dependent variable must be mean independent of the common shocks given the regressors.
Common factor models are typically applied to panel data sets (e.g., [14,[17][18][19]).We view the present paper as a first step towards nonparametric panel data models that may incorporate a more general and flexible common factors structure.Indeed, in a companion paper, Souza-Rodrigues [27] develops a two-step nonparametric estimator that requires a "large-N, large-T" dataset for a generalized regression model based on the identification results of Berry and Haile [31].The estimator applies equally to datasets with a large number of individuals in different groups and a large number of groups.The empirical application in Souza-Rodrigues [27] considers the impact of hospital volumes of surgical procedures on individual health status (e.g., mortality rate). 5Group-level observables (i.e., hospital volume of surgeries) may be correlated with group-level unobservables (hospital unobserved quality), which, by its turn, may be indexed by individual characteristics (since an unobserved hospital characteristic that is helpful for patients with some demographic characteristics may not be as helpful for other patients).The strategy proposed by Souza-Rodrigues [27] is to run a nonparametric regression of individual outcomes on individual observables within each group (hospital) in the first step.It is a nonparametric regression with common shocks where the common shocks are the group-level observables and unobservables.Because the group-level unobservables may be a (random) function of the sigma-field generated by the approximating object, the conditional expectations given the common shock and given the finite-dimensional object are different.Ignoring this difference leads to problems such as the Borel paradox.The motivation for this application is that numerous studies have documented an inverse relationship between hospital volumes of operations and mortality rates (see [32]).This suggests that thousands of deaths per year could have been prevented if hospitals with inadequate experience (i.e., with low volume of operations) had performed fewer surgical procedures.The evidence, however, is weak for most operations.Furthermore, existing papers have estimated parametric models that may be misspecified and have not considered the potential correlation between hospital volume of operations and hospital unobserved quality.individual characteristics, it is important to allow for this possibility, as we do here. 6The results of the present paper can be incorporated in other nonlinear panel data settings.
The present paper also relates to the literature of spatial dependence. 7Typically, in this literature, common shocks are presumed to have predominantly local effects, and the dependence is modelled as a function of an exogenously-given spatial or economic distance, with some form of stationary mixing condition analogous to the time series data.Recent nonparametric versions of spatial models have been considered by Martins-Filho and Yao [37], Gerolimetto and Magrini [38], among others.Although the present paper can incorporate common shocks with differential local effects (e.g., assuming that individual factor loadings include geographic location), we do not allow individual outcomes to depend on the characteristics of other individuals.We therefore view spatial dependence models as complementary to ours.
Robinson [44] provides an alternative way of modelling cross-sectional dependence.He considers a nonparametric kernel regression in which the disturbances are represented by a (possibly infinite) sum of independent random variables with unknown weights.The structure in the disturbances is sufficiently rich to cover spatial dependence models, but, since it does not require known economic distances, it can accommodate stronger forms of dependence than mixing conditions.Robinson [44] investigates the properties of kernel estimators, and Robinson and Lee [45] study the properties of sieve estimators within this framework.The present paper can accommodate disturbances of the type represented by Robinson [44], but with a vector of common shocks in place of the vector of independent random variables.We do not require the vector of common shocks to be independent random variables, and we allow for potentially correlated random weights in the summation term for the disturbances.However, the restrictions we need to impose on the common shocks differ from the assumptions in Robinson [44].Furthermore, we require i.i.d.sampling schemes that are neither assumed by Robinson [44], nor by the spatial dependence literature.Our model is therefore neither more general than, nor is it a special case of Robinson's model.
The paper is organized as follows: Section 2 presents the regression model and discusses sufficient conditions to guarantee the existence of conditional densities.Section 3 establishes the asymptotic properties of the Nadaraya-Watson kernel regression estimator and discusses its implications.Section 4 concludes.The Appendix presents the disintegration theory and briefly discusses the role of separability of common shocks in the existence of conditional densities.The Supplemental Material presents results for the kernel density estimator, contains all relevant proofs and discusses the probabilistic framework adapted from Andrews [8] that justifies the approach taken here.

Regression Model and Conditional Densities
The dataset is {Y i , X i : i = 1, ..., n}, where Y i ∈ Y (⊆ R) and X i ∈ X (⊆ R k ).Consider the model: where S i ∈ S (⊆ R d s , with d s ∈ N) is a vector of individual-specific random variables; C (.) ∈ C is the common shock; and ε i is the idiosyncratic error.Some components of S i may be observable (in which case, it may be incorporated in X i ) or it may be completely unobservable.We allow the common shock C (.) to be either a random vector (possibly infinite-dimensional) or a random function of S i .In the latter case, the common shocks may affect individuals differently.As usual, we use upper-case letters to denote random quantities and lower-case letters to denote realizations. 6 The second step runs a nonparametric instrumental variable regression across groups (hospitals) of the predicted outcome obtained in the first step on the group-level observables.It separates the impacts of group-level observables (hospital volume of surgeries) and unobservables (hospital unobserved quality).
The standard parametric factor model is a special case of our model and is typically written as: where S i = S i1 , ..., S iJ is the vector of individual-specific factor loadings; C = C 1 , ..., C J is the vector of unobserved common factors; ε i is the idiosyncratic error that is independent of (X i , S i , C) and has zero mean; and (α, β) is the vector with the parameters of interest.Cross-sectional dependence in the disturbances is generated by the term S i C. The standard model can also accommodate cross-sectional dependence on regressors X i .For example, consider the expanded vectors C = C 1 , C 2 and S i = S 1 i , S 2 i and take then X i and U i are correlated even when S 1 i and S 2 i are independent of each other (e.g., [8,10,11]).The nonparametric version of (2) takes Y i = m 1 (X i ) + U i , with the same structure for the disturbances U i . 8 The standard factor model ( 2) is a special case of our model (1) with the regression function given by the linear and additively separable m (X i , C (S i )) = α + X i β + C (S i ) and the common shock function given by C (S i ) = ∑ J j=1 S ij C j .We therefore generalize the standard model in the following ways: (i) we let the regression function m (.) be nonparametric; (ii) we allow the regressors X i to freely interact with the common shock C (.); and (iii) we let the common shock be a general function of individual-specific factor loadings S i (subject to the restrictions discussed below).Furthermore, factor models typically impose independence between S i and (X i , C) and assume that C = C 1 , ..., C J is a mutually independent vector, while we do not need to impose these independence assumptions.We, however, do not consider a fully-non-separable model; we maintain the additive separability assumption in the idiosyncratic error ε i .
Robinson [44] also considers a nonparametric version of ( 2), but with another structure for U i .He considers the model: where σ i are scalar unknown functions; e j s are independent random variables with zero mean and unit variance; and b ij s are unknown fixed weights. 9The present paper compares to Robinson [44] when the following holds: Robinson [44], we allow for (potentially correlated) random coefficients b ij s and do not restrict e j s to be independent variables with zero mean and unit variances.The restrictions we need to impose on the function C (S i ) are discussed below and are of a different nature than the assumptions used by Robinson [44].

Denote the vector
Define the measurable space (W, A), where A is the Borel sigma-field.The random elements {W i : i ≥ 1} are defined on W N , A N , where W N is the product space and A N is the product Borel sigma-field on W N .We suppose the common shocks across observations are captured by the σ-field generated by C, denoted by σ(C) ⊂ A N .We impose the following assumption: In a panel data setting, one typically allows for time-varying regressors X it , but restricts S i , so that it does not vary over time, and the common shock C, so that it does not vary across individuals.Fixed-effect panel data models let X it and S i be correlated.9 Note that this approach does not require known economic distances, but can readily accommodate them by taking j=1 b ij e i , e = (e 1 , ..., e n ) and by making some assumptions regarding how b ij depends on the distance |i − j|.

Assumption 1 The sequence {W
As shown by Andrews [8], this assumption is valid when the units are drawn randomly from the population.One difference between the present paper and Andrews [8] is that he states the existence of some σ-field, such that the data are i.i.d.conditional on it without specifying a priori how this σ-field is constructed, while we impose more structure and state explicitly how the σ-field is generated.Andrews' framework is, therefore, more general than ours in this respect.Note that neither the spatial dependence models nor Robinson's [44] approach require random sampling. 10

Existence of Conditional Densities
Because we make use of the Nadaraya-Watson kernel estimator and because the kernel estimator requires the existence of conditional densities, we now discuss the existence problem.
To guarantee the existence of conditional densities that allow for very general common shocks, we make use of the disintegration theory.Disintegration of a probability measure is a collection of regular conditional probabilities, each satisfying (i) a concentration property (i.e., conditional on an event, the probability of its complement is zero) and (ii) a decomposition property (i.e., the probability of an event is a weighted sum of the conditional probability measures, also known as the law of total probability). 11The reader unfamiliar with disintegration theory might want to read the Appendix (or the references cited there) before proceeding.
Define the sub-vector We want to guarantee the existence of the conditional density of Z i given C. By Assumption 1, the probability distribution of {W i : i ≥ 1}, denoted by P N , is exchangeable on W N , A N .Call P i the marginal distribution of W i under P N .We impose the following: (ii) λ is a sigma-finite Radon measure on (W, A). 12(iii) C maps (W, A) into (C,B).C is a separable metric space, and B is the Borel σ-field.
(iv) µ is a sigma-finite measure on (C,B).Let the measure λ C −1 induced by C and λ on (C,B) be absolutely continuous with respect to µ.
(v) Let P i , for any i ≥ 1, be absolutely continuous with respect to λ. Denote its Radon-Nikodym density by f i (z, c).
Assumption 2(iii) requires C to be a separable metric space.This is trivially satisfied when C belongs to a finite-dimensional Euclidean space.However, if C is an infinite dimensional vector of random variables, we need restrictions, such as C = p , for some 1 ≤ p < ∞, where p is the space of sequences with finite • p -norm, and we need to rule out the case C = ∞ , because ∞ is non-separable.Similarly, if C is a random function of S i , it must belong to spaces, such as the L p (S) space for 1 ≤ p < ∞, or the space of bounded and continuous functions defined on a closed bounded subset of S and equipped with the sup-norm, or the Hölder space, etc.However, it cannot belong to the space of bounded functions with the sup-norm, L ∞ (S), because it is not separable.See the discussion about the role of separability for existence of conditional densities in the Appendix. 13he restrictions in Assumption 2 are mild and sufficient to guarantee the existence of conditional densities of Z i given C for any i ≥ 1.
The reason for sufficiency is the following: first, Assumptions 2(i)-(iv) are sufficient for the sigma-finite Radon measure λ to have a (C, µ)-disintegration; i.e., they guarantee the existence of a collection of measures, denoted by Λ = {λ c : c ∈ C}, that satisfy the aforementioned concentration and decomposition properties (but note that λ c s do not have to be probability measures; see Definition 3 and Theorem 1 in the Appendix).
Second, if the disintegration Λ = {λ c : c ∈ C} exists and the probability measure P i on (W, A) is absolutely continuous with respect to λ with density f i (z, c) (Assumption 2(v)), then two implications follow (see Theorem 2 in the Appendix): (i) the probability distribution of C induced by P i (i.e., the image measure Q i = P i C −1 ) is absolutely continuous with respect to µ with density: and (ii) the probability measure P i has a conditional distribution given C, denoted by the collection P i = P i c : c ∈ C , where P i c is defined by having density: with respect to λ c for Q i -almost all c ∈ C. The conditional density f i (z|c) is therefore similar to elementary conditional densities: it is the ratio of the joint density f i (z, c) and the marginal q i (c).However, it does not require C to belong to a finite-dimensional Euclidean space.
Because C is common to all i, the equality Q = Q i follows for all i ≥ 1.In addition, f i (z|c) = f j (z|c) for all i = j and for Q-almost all c ∈ C by Assumption 1.We state this result as a lemma: Lemma 1.Let Assumptions 1 and 2 hold.Then, there exist conditional densities of Z i given C, for all i ≥ 1, defined by: for Q-almost all c ∈ C, where q (c) ≡ f 1 ( z, c) dλ c ( z). 14 Example 1. Suppose S i is scalar and C is the separable Hilbert space L 2 (S).Take a basis φ j ∞ j=1 for L 2 (S) and represent the common shock by C (S i ) = ∑ ∞ j=1 C j φ j (S i ), where C j ∈ R for j ≥ 1.Note that one can define S ij = φ j (S i ), in which case the random coefficients are not independent of each other.More important for us is to note that selecting a function in C is equivalent to selecting the infinite dimensional vector C j ∞ j=1 in equals the product Borel σ-field A Z ⊗ B, where we denote A Z the Borel σ-field on Z (see [46], Proposition 1.5).Second, let π c be the projection of W onto the coordinate space C, i.e., π c : W → C.Then, (i) the sub-sigma field π −1 c (B) is contained in A and (ii) because C (w) = π c (w), for all w ∈ W; the sigma-field generated by Furthermore, if we define the sigma-finite Radon λ on (W, A) to be the product measure λ = ν ⊗ µ, where ν is defined on (Z, A Z ) and µ on (C, B), then the measure λ C −1 induced by C and λ on (C,B) equals µ, and so, λ C −1 is (trivially) absolutely continuous with respect to µ.Finally, we have to assume both ν and µ are sigma-finite Radon, so that λ is sigma-finite Radon on A, as well. 14Note that we can manipulate the conditional density (6) on Z ⊗ C as is usually done.Fix C = c and think of Z ⊗ {c} as a copy of Z embedded into the product space.For a fixed c ∈ C, take the measure λ c living on Z ⊗ {c} to coincide with the Lebesgue measure on Z.If r(•) is a vector-valued function with E r(Z) < ∞, then: 2 .Let B (L 2 ) be the Borel σ-field on L 2 and B ( 2 ) be the Borel σ-field on 2 .Because the spaces L 2 and 2 are homeomorphic, their topologies are equivalent, and so, B (L 2 ) and B ( 2 ) are equivalent. 15As a result, the event {C (•) = c} on L 2 is equivalent to the (potentially more intuitive) event {(C 1 , C 2 , ...) = (c 1 , c 2 , ...)} on 2 .In addition, conditioning on {C (•) = c} is equivalent to conditioning on {(C 1 , C 2 , ...) = (c 1 , c 2 , ...)}.We have, therefore, f (z|c (•)) = f (z|c 1 , c 2 , ...) and: for any measurable set A. 16   Example 1 intends to translate properties of conditional probabilities given an element in some abstract space of functions into properties in (hopefully) more concrete spaces defined by random vectors.Example 1, however, does not apply when C is not a Hilbert space.Although we may approximate any of the separable metric spaces by other simpler spaces, the conditioning argument does not hold without running into problems, such as the Borel paradox (see, e.g., [47]).For instance, take C to be the set of bounded and continuous functions, (BC (S) , • ∞ ).It is separable, and any c ∈ C can be well approximated by a polynomial of order J < ∞, say p J (•) with some coefficients b j J j=1 .Because we can take J such that c − p J ∞ < ε, for some ε > 0, the probability of the event {C (•) = c} is close to the probability of the event B j J j=1 = b j J j=1 .However, the topology of (BC (S) , • ∞ ) is not the same as the topology of the Euclidean R J for any finite J. Therefore, the Borel σ-field on BC (S) is different from the Borel σ-field on any R J .Conditioning on different σ-fields delivers different conditional probability distributions, and so, we are not guaranteed to have Pr (Z ∈ A|C (•) = c) close to Pr Z ∈ A| ∩ J j=1 B j = b j for all measurable sets A. We can still obtain the existence of conditional densities, but we cannot derive conclusions based on some approximation ∑ J j=1 b j p j (S i ) for C (S i ), no matter how large J is.

Regression Estimator
Next, we consider the properties of the Nadaraya-Watson kernel regression estimator: where K(•) is the kernel function and h n is the bandwidth.As previously mentioned, the objective here is to work as closely as possible to the standard kernel literature.The assumptions we impose are therefore similar to the standard assumptions (see Pagan and Ullah [48]), but with the population density and regression function substituted for the corresponding conditional functions and with the 15 Any infinite-dimensional separable Hilbert space, say H, is isometrically isomorphic to a suitable 2 (I), where the cardinality of the set I is the cardinality of an arbitrary Hilbertian basis for H, i.e., there exists a linear operator L : H → 2 (I), such that Lh 2 = h H , where h ∈ H, • H is the norm on H and • 2 is the 2 -norm. 16Conditioning on the event {(C 1 , C 2 , ...) = (c 1 , c 2 , ...)} is only one possibility.For some a ∈ R, we could condition either on the event {C (S i ) = a} = ∑ ∞ j=1 C j φ j (S i ) = a , or on the event {c (S i ) = a} = ∑ ∞ j=1 c j φ j (S i ) = a , where the randomness of the event comes from S i , or on {C (s) = a} = ∑ ∞ j=1 C j φ j (s) = a , where the randomness comes from (C 1 , C 2 , ...).
extra "Q-almost all c" qualifiers added.For brevity, we relegate the properties of the kernel density estimator to the Supplemental Material.We maintain Assumptions 1 and 2 from now on.In addition, we impose the following conditions: Condition 1.Let K be the class of all Borel measurable nonnegative bounded real-valued functions K(u), such that: Condition 2. For Q-almost all c ∈ C, the conditional density f (x|c) is continuous at any point x 0 .
Condition 4. For Q-almost all c, (i) f (x|c) is twice continuously differentiable with respect to x in some neighbourhood of x 0 and (ii) the second-order derivatives of f (x|c) with respect to x are bounded in this neighbourhood.
Condition 5.For Q-almost all c, the point x 0 is in the interior of the support of X conditional on {C = c} and f (x 0 |c) ≥ ξ > 0, for some finite ξ.
Condition 6.The kernel K is a symmetric function satisfying uK(u)du = 0.

Condition 8.
For Q-almost all c, the function m (x, c) is twice continuously differentiable with respect to x in some neighbourhood of x 0 .
Conditions 1-5 suffice to obtain the asymptotic properties of the kernel density estimator (consistency, rate of convergence and asymptotic distribution; see the Supplemental Material).Condition 6 is standard in the literature. Condition In the standard factor model, this translates into: Note that m (x, C) is a random object because C has not been fixed.Typically in the literature, S i is assumed to be independent of (X i , C), in which case , where b ij is an unknown constant.Unlike the standard model, here, we allow J to be infinite (as long as C belongs to an appropriate separable metric space); we do not require S i to be independent of (X i , C), and we allow for more complicated interactions between X and C. Condition 7(ii) allows for conditional heteroskedasticity; and Condition 8 is used to apply Q-almost sure Taylor expansions similar to what is usually done in the kernel literature.Remark 1. Condition 8 requires m (x, c) to be twice continuously differentiable in x for almost all c.To fix ideas, consider the following case: let while conditioning only on the event {X i = x}, we obtain the random object: Therefore, to satisfy Condition 8, we need E [Y i |X i , C (X i )] to be twice continuously differentiable with respect to both the first and second arguments, and we also need C (•) to be twice continuously differentiable with respect to x. 17To obtain the consistency of m (x), we first show that the kernel density converges in probability to the conditional density f (x|C).Then, we prove that the mean-squared error of m (x) conditional on σ (C) converges to zero in probability.Finally, consistency follows by the dominated convergence theorem.We then show that the rate of convergence is the same as the rate of convergence without common shocks.The pointwise asymptotic distribution is obtained using the martingale difference sequence central limit theorem.
Proposition 1.1 shows that the kernel regression estimator converges in probability to the random object m (x, only holds when Y is mean independent of C given X.To see how this difference may affect the interpretation of potential estimands, take the standard factor model as an example. 18In this case, m (x, C) is given by ( 9), while m (x) is given by: If we assume, as is usually done, that S i is independent of (X i , C), we have that If there is no cross-sectional dependence on regressors resulting from the common Although we cannot estimate m (x) consistently, it is possible to identify and estimate β by noting that m (x 1 , C) − m (x 2 , C) = (x 1 − x 2 ) β, for x 1 = x 2 .Similarly, for nonparametric factor models, Y i = m 1 (X i ) + U i , one can identify and estimate the slope of m 1 (x).However, the presence of the common shocks ∑ J j=1 b ij C j prevents the identification of the intercept α in the linear model (and the identification of the location of m 1 (x) in the nonparametric model) even if we normalize E C j = 0 for all j.
Remark 2. The nonparametric factor model with J = ∞, E S ij |X = x, C = b ij and E C j = 0, for all j, has a structure similar to the one proposed by Robinson [44].Yet, while Robinson [44] shows that the kernel regression estimator converges in probability to m (x), we obtain convergence to m (x, C).An important distinction comes from the assumption on the sampling process.Because we have exchangeable data given the common shocks (Assumption 1), the conditions we impose are not sufficient to "get rid of" C in the limit.Robinson [44], in contrast, does not impose the conditional i.i.d.sampling process.
Returning to the standard factor model, if we assume now the presence of cross-sectional dependence on regressors captured by, say, Again, Y is not mean independent of C given X, so m (x) p −→ m (x, C) = m (x), but it is still possible to identify β in the parametric model and the slope of m 1 (x) in the nonparametric version. 19n the standard factor model, Y is mean independent of C given X only if the common shocks have no direct effect on Y.This is the case when E S ij = b ij = 0.When this is true, m (x, C) = m (x), and the kernel regression converge in probability to m (x), even when there is cross-sectional dependence on X.In this case, we identify both parameters α and β in the linear model and m 1 (x) in the nonparametric model.Note that assuming E S ij = 0 for all j is not an innocuous normalization, but a substantive assumption.
Remark 3. The last case is similar to Andrews [8].
Proposition 1.2 shows that the rate of convergence of the kernel regression in the presence of common shocks is the same as the rate of convergence without common shocks.
Proposition 1.3 presents the asymptotic distribution of the kernel regression estimator.It shows that even when m (x) p −→ m (x, C) = m (x), the common shocks affect the asymptotic distribution of the kernel regression because they may impact both the conditional variance of Y and the conditional density of X.This result is similar to that of Andrews [8], Robinson [44] and others.
Remark 4. A consequence of Proposition 1.3 is that inference results depend on whether Y is mean independent of C given X.To test a null hypothesis, say, H 0 : m (x) = m 0 (x) against H 1 : m (x) = m 0 (x), the corresponding t statistics is: The usual two-sided t test with significance level α rejects the null if , where z α is the α quantile of the standard normal distribution.If Y is mean independent of C given X, then Remark 5.The bandwidth can be chosen by minimizing the approximated integrated mean squared error (AMISE) conditional on σ (C).The bandwidth must be a σ(C)-measurable random variable, h n (C).In the Supplemental Material, we show that h n (C) = O p n − 1 4+k , and one might expect both plug-in and cross-validation estimators to be consistent.The usual concerns in the literature about how to select the bandwidth are present here, but for brevity, we do not investigate the topic further.We only emphasize that the bandwidth choice based on the unconditional AMISE is infeasible because it is impossible to estimate the distribution of C (and integrate that out) using a single cross-sectional dataset.

Conclusions
In this paper, we investigate a nonparametric regression estimator for cross-sectional data in the presence of very general, potentially infinite-dimension, common shocks.In a companion paper Souza-Rodrigues [27], we extend the results to a "large-N, large-T" panel data framework for a nonlinear generalized regression model.We plan to investigate extensions to finite-T panel data models in the future.
we can take σ 2 (x) to be: The first term on the right-hand side converges in probability to E Y 2 i |X i = x, C using the same arguments as in Proposition 1.The second term on the right-hand side converges in probability to [m (x, C)] 2 by the Slutsky theorem.Therefore, σ 2 (x) p −→ σ 2 (x, C).Next, note that: The first term on the RHS converges in distribution to N (0, 1) by Proposition 1.3(ii).The second term on the RHS is such that: (a) f (x) ≥ ξ > 0, for some finite ξ, with probability approaching one because f (x|c) ≥ ξ > 0 for Q-almost all c (see the Supplemental Material).If (b) σ 2 (x, C) is finite Q-almost surely (implying σ 2 (x) is finite with probability approaching one); and if (c) m (x, C) = m 0 (x) with positive probability; then, the second term on the RHS diverges in probability to ±∞.As a result, |T n | → ∞ as n → ∞ under the null.
many Q-negligible sets.By collecting all of these countably many Q-negligible sets into a single Q-null set, we avoid the problems and paradoxes coming from an accumulation of uncountably many null sets.Under the topological assumptions, the family P = {P c : c ∈ C} is a version of the Kolmogorov conditional expectation that does not run into such difficulties and, as a by-product, guarantees the existence of conditional densities.The conditional densities may then be (carefully) manipulated preserving the intuition we have for the cases where the conditioning event has positive probability.The existence of conditional probability distribution follows from a general decomposition called disintegration.The definition of disintegration given in Chang and Pollard [50]  λ c is a sigma-finite measure on F concentrated on {C = c}, that is λ c {C = c} = 0 for µ-almost all c; and for each nonnegative measurable function f on Ω: 2.
the equality f (ω From the definitions, it is clear that a (C, µ)-disintegration Λ = {λ c : c ∈ C} can be a conditional probability distribution P = {P c : c ∈ C}.Yet, it is useful to let Λ be a collection of sigma-finite measures, so that we can define conditional densities with respect to dominating measures.
Based on this disintegration, we can define a new measure µ ⊗ Λ on the product (Ω × C,F ⊗ B), by the iterated integral: for all sets A ∈ F ⊗ B. The measure µ ⊗ Λ has to be well-defined to satisfy Condition 3 of the Definition 3.
The existence of disintegration is guaranteed by the following theorem (Theorem 6 in Pollard [28], Appendix F)).
Theorem 1. (Existence of disintegration) Let λ be a sigma-finite Radon measure on the Borel sigma-field F of a metric space Ω.Let µ be a sigma-finite measure on B that dominates the image measure λ C −1 (i.e., the measure on B induced by the map C and the measure λ).If the set: To guarantee the existence of the (C, µ)-disintegration, we need, therefore, to restrict: (i) Ω to be a metric space with the Borel sigma-field F ; (ii) λ to be a sigma-finite Radon measure; and (iii) the set graph (C) ≡ {(ω, c) ∈ Ω × F : C (ω) = c} to be F ⊗ B measurable.Depending on the problem at hand, it may be reasonable to assume (i) directly.To see the importance of the requirements (ii) and (iii), we briefly describe how the proof works.We then finally discuss the existence of conditional densities.
First, assume that Ω is a compact metric space, and let K 0 be a compact paving.A compact paving is a class of compact sets in Ω that is closed under finite unions and intersections.One can show that K 0 is countable when Ω is compact.The proof carefully constructs a finitely additive measure λ c : K 0 → R + , for some c ∈ C, so that the desired "measure-like" properties of the disintegration (Definition 3) hold for µ-almost all c.Because K 0 is countable, all of the desired properties of λ c hold, except on countably many negligible sets, which can be collected into a single negligible set.It is shown, then, that there exists a unique extension of λ c to a countably additive measure defined on a sigma-field containing K 0 (see [28], Appendix A)).The extension is (inner) approximated by compact sets.By construction, all of the desirable properties hold for the extension of λ c and for all c / ∈ N, where N is a single set with µ (N) = 0.The proof then shows that c −→ λ c (A) is B-measurable for all Borel sets A ∈ F .Finally, the argument is extended for Ω that is not compact, but the measure λ concentrates all of the mass on a disjoint union of countably many compact Borel sets; i.e., the measure λ is a sigma-finite Radon measure.Intuitively, the proof explores compact approximations as a way to obtain countable additivity from finite additivity and to collect the negligible sets into a single null set N.
Pachl [49] shows that a sigma-finite Radon λ (Requirement (ii)) is a necessary condition for existence of disintegration.Therefore, even when Ω is not compact (or not separable), λ must have separable support. 22he third requirement, the F ⊗ B-measurability of the set graph (C), is also necessary because the measure: (µ ⊗ Λ) (A) = I A dλ c (ω) dµ (c) = λ {ω ∈ Ω : (ω, C (ω)) ∈ A} is well-defined only if A ∈ F ⊗ B. The condition is not innocuous: it is well known that the graph (C) may not be F ⊗ B-measurable even when C is measurable.The F ⊗ B-measurability can be obtained if the σ-field B is countably generated and contains all of the singleton sets {c} (see [28], p. 344).
In particular, if B is the Borel σ-field on the separable metric space C, these conditions are satisfied (see [28], p. 103).
A separable C with the Borel σ-field B is sufficient, but not necessary, for the F ⊗ B-measurability of the graph (C).It is possible, but not trivial to obtain such a result for non-separable spaces.Hansell [51] provides very abstract (and somewhat difficult to interpret) sufficient conditions for the F ⊗ B-measurability when C is not separable.Yet, even if the F ⊗ B-measurability holds for a non-separable C, the Radon measure λ puts all mass on a separable subset of C. To see why, let G be a countable union of compact sets on Ω, such that λ (G c ) = 0, where G c is the complement of G.The map g : Ω → Ω × C defined by g (ω) = (ω, C (ω)) is such that λ concentrates all mass in the set g (G).If C is Borel measurable, the set g (G) ⊂ Ω × C is separable and, so, is C (G) (see Bogachev [52], Corollary 6.10.17)).The image measure of C under λ therefore puts all mass on a separable subset of C when C is non-separable.Therefore, although C does not have to be separable to obtain the existence of disintegration, it seems difficult to get away from separability in this context.
The image measure Q = P C −1 (i.e., the probability distribution of C induced by P) is absolutely continuous with respect to µ, with density q (c) ≡ f (ω) dλ c (ω).

3.
The probability measure P has conditional distribution {P c : c ∈ C} given C, where P c is defined by having density: f (ω|c) ≡ f (ω) q (c) {0 < q (c) < ∞} (12) with respect to λ c , for Q-almost all c ∈ C.
The formula in (12) is the general version of the conditional density as the ratio of the joint density to the marginal density, but not requiring C to belong to a Euclidean space.To guarantee the existence of the conditional density, we therefore need the existence of the (C, µ)-disintegration Λ = {λ c : c ∈ C}.For a more detailed discussion, see [28][29][30]50].