## 1. Introduction

Since the seminal papers from

Blinder (

1973) and

Oaxaca (

1973), many studies have used what is known as the Oaxaca–Blinder (OB) decomposition for analyzing outcomes differences between two well defined groups. Such differences are characterized as functions of differences in characteristics (composition effect) and differences in coefficients associated with those characteristics (

wage structure effect). Subsequent research provided refinements that extended the OB decomposition analysis to non-linear functions, distributional statistics other than the mean, as well as strategies to identify the model when some of the underlying assumptions do not hold (see

Fortin et al. (

2011) for a review of other methodological extensions).

While the OB decomposition can be directly applied to scenarios with naturally discrete groups (i.e., union and non-union workers, men and women, whites and nonwhites), the application of OB type decompositions on cases with continuous or quasi-continuous groups is not standard.

Ñopo (

2008) and

Ulrick (

2012) have proposed extensions to the standard OB decomposition allowing for a continuous group variable, using ad hoc parametric approximations.

1 These strategies can be biased if the selected functional form is incorrect, and neither strategy deals with a scenario where there is self-selection of individuals into groups based on unobservables (endogenous membership).

The purpose of this paper is to extend the OB decomposition allowing for a continuous group variable using a semiparametric approach known as varying coefficient models (

Hastie and Tibshirani 1993).

2 The strategy accounts for endogenous self-selection into groups abstracting from a generalization of the Heckman selection model that uses generalized inverse mills ratios (GIMR) or generalized residuals (

Heckman 1979;

Lee 1978;

Li and Racine 2007;

Vella 1998) to address the problem. As discussed in

Wooldridge (

2015), the use of GIMR is equivalent to using a control function approach when addressing endogeneity.

A thorough search of the relevant literature yielded only two other papers that discuss the estimation of varying coefficient models with this type of endogeneity.

Centorrino and Racine (

2017) propose a strategy that uses instrumental variables and method of moments to address endogeneity and estimate the varying coefficient models using sieve estimators. More recently,

Delgado et al. (

2019) developed an estimator based on a control function approach, using a combination of spline regressions for the estimation of the first stage residuals, and kernel regressions for the identification of the coefficients in the model. The strategy proposed here is closer to

Delgado et al. (

2019) which generalized residuals from a first stage auxiliary regression, the generalized inverse mills ratios, are included in the main model before it is estimated using local linear kernel regression methods.

The strategy presented could be used for analyzing heterogeneous dose-treatment effects under endogeneity, using an OB decomposition framework. In addition, under the assumption that all other control variables are exogenous, the proposed strategy can also be used to identify the parameters of the model of interest and analyze the heterogeneity of the impact of characteristics across the continuous group variable. For example,

Centorrino and Racine (

2017) re-explore the impact of race, experience and place of residence on wages when looking at individuals with different levels of education (equivalent the continuous grouping variable).

Delgado et al. (

2019) illustrate their methodology analyzing the demand for gasoline in the US using household income as the grouping variable. Other applications may include the analysis of smoking and smoking intensity on wages (

Hotchkiss and Pitts 2013), training duration on employment probabilities (

Kluve et al. 2012), or as will be shown in the illustration section, the impact of Body Mass Index (BMI) on wages (

Cawley 2004).

The subsequent sections of the paper are structured as follows.

Section 2 describes the basic Oaxaca–Blinder decomposition analysis in the presence of self-selection/endogenous membership.

Section 3 introduces the use of the Generalized Inverse Mills Ratio (GIMR), when individuals self-select into continuous group.

Section 4 describes the estimation of varying coefficient models, selection of bandwidths and the estimation of standard errors.

Section 5 provides Monte-Carlo Simulations showing the performance of the proposed strategy.

Section 6 provides an example of the implementation of the methodology revisiting the wage penalty of obesity based on the research of

Cawley (

2004).

Section 7 concludes the paper.

## 2. The OB Decomposition with Selection: Basics

In the standard OB approach, the goal is to analyze how differences in observed characteristics, and returns to these characteristics, explain average differences on outcomes between two groups. For the appropriate identification of the OB decomposition, the strategy requires that potential outcomes can be estimated using two well-specified linear models with exogenous membership into each group. This ensures that the distribution of the errors is orthogonal to the group membership.

In many instances, however, the assumption of membership exogeneity is likely to be violated if individuals self-select to be part of a specific group (i.e., part of the treatment group).

3 When this happens the conditional distribution of the errors is no longer independent of the group membership, ruling out the identification strategy of the standard decomposition approach. A strategy commonly used to address this problem is the implementation of a Heckman Selection model.

As described in

Heckman (

1979), endogenous selection can be considered as an omitted variable problem that can be corrected by modeling the selection process and using this information to identify the parameters of the model of interest.

4 This strategy requires the estimation of a three-equation model that is described as follows:

where

${D}_{i}^{\ast}$ is the latent propensity of an individual

i to be part of group B,

$X$ is a set of exogenous variables uncorrelated with

${\mu}_{A}$ and

${\mu}_{B}$, and

$Z$ is a vector of variables related to individuals membership that may include variables not included in

X.

5 If we assume that (

${\mu}_{A,i},{\mu}_{B,i},{\epsilon}_{i}$) are jointly distributed as multivariate normal:

the model can be estimated using a full information maximum likelihood (FIML) or a two-step procedure (heckit). The latter involves including estimates for the selection correction terms, the inverse mills ratio (IMR), in the main outcome model based on the information from the selection equation. For this setup, the IMR (

$\lambda )$ is defined as follows:

where

$\varphi (.)$ stands for the normal density function, and

$\mathsf{\Phi}(.)$ for the normal cumulative density function.

The parameters

$\gamma $ can be obtained by estimating the selection equation in (1) using a probit model, while unbiased estimations for outcome equations can be obtained using ordinary least squares (OLS) by including the corresponding IMR as explanatory variables:

In this setting, an estimation of the outcome gap after adjusting for selection can be written as follows:

which can be used to implement any variation of the standard OB decomposition based on assumptions of the counterfactual wage structure.

6 As described in

Fortin et al. (

2011), outcome differences accounted for by differences in the coefficients (structure effect) can be interpreted as the treatment effect of membership, after adjusting for differences in observed characteristics and endogenous selection. In addition, under the exogeneity assumption of the explanatory variables

$X$, the detailed decomposition can be used to analyze the heterogeneity of the contribution individual characteristics on the outcome gap.

## 3. Generalized Sample Selection

The model described above assumes that the only information known about the selection process is that individuals are members of one of two groups (A or B). As discussed in

Vella (

1998),

$D$ may contain additional information that can be used to obtain a better approximation of the selection correction term, even if the interest remains in analyzing differences between two groups.

As before, consider a model where the continuous characteristic

${D}_{i}$ is observed for each individual, which can reference their membership status to a continuum of groups. This information can be used to broadly classify individuals into Groups A and B (dichotomization of the groups). The selection process and outcome equations can be described as follows:

with

${\mu}_{A,i},{\mu}_{B,i},{\epsilon}_{i}$ following a joint normal distribution as defined previously, with some arbitrary threshold

$c$ to define membership, and with the third Equation in (7) representing the equation, or equations, that describe the endogenous selection process. This model reverts to the standard switching regression model if a dichotomous transformation

$1\left({D}_{i}>c\right)$ is used as described in the previous section. However, if further variation in

${D}_{i}$ is observed, other methods can be used to exploit this information.

Many authors have proposed alternatives for the estimation of these types of selection models where more information about the endogenous membership is available, using both parametric and semiparametric strategies (see

Li and Racine (

2007, sct. 10.3), and

Vella (

1998)). In general, following the approach proposed by

Heckman (

1979), these methodologies suggest that to obtain consistent estimators for the parameters

$\beta $, one should include an approximation of the selection bias term as a control in the main regression model. This paper concentrates on three methodologies that assume the overall distribution of

$D$ is observed, but can be easily adapted to scenarios where

$D$ is partially observed.

Vella (

1998) discusses the estimation of models such as the one described above and suggests that a feasible strategy is to estimate the selection process as a tobit model if

$D$ has a censored distribution.

7 In this case, assuming

$D$ is censored at zero, the corresponding IMR (selection correction term) is defined as:

These are often called generalized residuals, and are referred here as generalized inverse mills ratios (GIMR). It should be noted when

$D$ is not censored, the selection equation can be estimated using standard OLS and the IMR are simply the OLS residuals. Alternatively, this equation can be modified if

$D$ is censored at different points of its distribution. Including these residuals in the main model is equivalent to the control function described in

Wooldridge (

2015). Control function approach is also a common strategy for dealing with endogeneity in linear and nonlinear parametric frameworks, and in nonparametric frameworks (see

Li and Racine (

2007, chp. 17),

Henderson and Parmeter (

2015, chp. 10), and

Wooldridge (

2015)).

As

Vella (

1998) and

Li and Racine (

2007) describe, using the correction term in Equation (8) provides estimations that are more stable and efficient than using the standard IMR (which assumes dichotomous grouping). However, an instrumental variable is required to identify the coefficients of the selection correction term and the grouping variable

$D$ (intensity), if it were to be included in the model specification.

An alternative method described in

Vella (

1998) is one where the selection process corresponds to a setting with discrete but ordered selection rules. If we assume that

$\tilde{D}$ is a discretized transformation of

$D$ (i.e.,

${\tilde{D}}_{i}=Kif{D}_{i}\in \left\{l{l}_{k},u{l}_{k}\right\}forK\in \left[0,1,\dots ,J\right]$), and that

${\tilde{D}}_{k,i}^{\ast}$ is the latent propensity of an individual

i to be part of group

$\tilde{D}=K$, then the selection equation process can be written as:

Note that Equation (9) is a different way of writing the selection model described in

Vella (

1998), where all coefficients in

${\gamma}_{k}$ are permitted to vary. Additionally, note that all latent coefficients are affected by the same shock (

${\epsilon}_{i}$). Under the parallel lines assumption (

Williams 2016), an ordered probit (O-probit) can be used to estimate this model, where only the constant is allowed to vary across models.

As described in

Chernozhukov et al. (

2013), a more flexible alternatives for the estimation of the selection model is allowing all parameters in

${\gamma}_{k}$ to vary across all points of the distribution of

$D$. This can be done using independent models (

Foresi and Peracchi 1995), or using simultaneous models such as the generalized ordered probit model (

Terza 1985). Both alternatives impose greater computational burden and may produce unrealistic predicted probabilities in the model, as the number of groups (J) increase.

8As described in

Vella (

1998), similar to the binary group case, the outcome equations can be consistently estimated using OLS by simply including a selection correction term, which for the selection rule described by Equations (9) takes the following form:

where

${\lambda}_{i}^{\ast}$ is the GIMR. Here, the term

$E\left({\mu}_{k,i}|{\tilde{D}}_{i},{Z}_{i}\right)$ is only an approximation of the correction term

$E\left({\mu}_{k,i}|{D}_{i},{Z}_{i}\right)$, as it can be considered as the expected value of the correction term for all values of

${D}_{i}$ within the group

${\tilde{D}}_{i}$. Any approximation bias would disappear

$\left(E\left({\mu}_{k,i}|{\tilde{D}}_{i},{Z}_{i}\right)-E\left({\mu}_{k,i}|{D}_{i},{Z}_{i}\right)\to 0\right)$ as the sample size increases to infinity (

$N\to \infty $) and the bandwidth within each category tends to zero

$\left(u{l}_{k}-l{l}_{k}\to 0\right)$. If no instrumental variables are used in the selection equation model, the GIMR will be strongly linear with the estimated latent index, and the estimator will be poorly identified (

Chiburis and Lokshin 2007). This strategy can be easily adapted to scenarios where

${D}_{i}$ is partially observed due to censorship, however, a drawback is that it requires choosing the number of groups to reclassify the original data.

Taking from the literature on distributional regressions (

Chernozhukov et al. 2013), the last alternative suggested here is to use global distributional regressions to characterize the cumulative distribution of the outcome

$F(D|z)$. This can be done using a fractional probit model that takes the form:

Empirically, this model can be estimated by substituting

$P(d\le {D}_{i}|x)$ with the sample unconditional cumulative distribution

$\widehat{F}\left({D}_{i}\right)=\frac{1}{n}{\displaystyle \sum}1({d}_{i}<{D}_{i})$, or some other approximation.

9 In this case, the corresponding GIMR takes the form:

Once the corresponding selection correction terms have been estimated, they can be used to estimate the parameters for the models of interest (Equation (7)) and the selectivity corrected average wage gaps. These elements can then be used to implement an OB decomposition in the standard way using Equation (6). In this framework, the structure effect can be interpreted as the average treatment effect.

As it will be shown through Monte-Carlo Simulations, all these methodologies can be used for identification of the main parameters of the model, but the correct identification of the constant in the original model will depend on the shape of the distribution of membership variable and the method of estimation of the generalized inverse mills ratios.

## 5. Monte-Carlo Simulations

To assess the performance of the proposed methodology, and their finite sample properties, I draw simulate 1000 samples of size n = 500, 1000, 2500 and 5000, from the following scheme:

where

$\mathcal{N}$ represents a joint normal distribution. The endogenous membership is defined by:

with three separate specifications used for the varying coefficient:

where

$\varphi $ and

$\mathsf{\Phi}$ are the standard normal probability and cumulative density functions, respectively. These functional forms were chosen to generate nonlinearities that could not be captured using polynomial approximations. Finally, to add heterogeneity on the degree of endogeneity across d, the outcome of interest is defined as:

After each sample is simulated, the model is estimated with the procedure described in

Section 3, estimating the cross-validated bandwidths for each simulated sample, and estimating bootstrapped standard errors using 199 repetitions.

Table 1 provides a summary of the results, showing the bias, standard errors from the simulations, average bootstrapped standard errors, and the 95% coverage and bias corrected coverage using normal based confidence intervals.

For the results in

Table 1, OLS-GIMR is used to correct for endogenous selection.

Table 2 provides a similar exercise, using simulations with samples of size n = 5000, but applying the GIMR from the ordered probit and fractional probit regression models. In all cases,

Table 1 and

Table 2 reports the average estimates for the coefficients at selected points in the distribution of

$d$, with the top and bottom values (−3 and 5) representing the 2.5th and 97.5th percentiles of the distribution of

$d$.

The simulations suggest that the proposed estimator performs reasonably well in finite samples. Akin to other applications of semiparametric analysis, the estimator presents the largest bias at the boundaries of the distribution, but also around points where the second derivative of the coefficient with respect to $d$ (${\partial}^{2}{\beta}_{k}\left(d\right)/\partial {d}^{2})$ is large. This bias disappears when larger samples and smaller bandwidths are used.

The bootstrap procedure used to correct the standard errors produces estimates that slightly understates the simulated standard errors. For the simulations with samples sizes n = 500, the average bootstrapped standard errors understate the simulated standard errors by 5% in average. For the simulations with sample size of 5000, bootstrapped standard errors understate the simulated standard error in 2.5% in average. Looking at the raw coverage, except for areas with large bias, the estimator obtains coverages between 90% to 95%, even for the simulations with the smallest sample size.

17 After correcting for the average bias, the coverage is above 94% for the majority of the cases. Finally, comparing the performance of the different estimators of the GIMR (

Table 2), all strategies perform similarly well, with only minor differences in coverage. Additional simulations presented in the appendix show that the choice of the GIMR estimations matters if

$d$ has a bounded distribution.

18## 7. Conclusions

In this paper, I have presented a methodology for the implementation of Oaxaca–Blinder decomposition when the grouping variable is continuous, and there is presence of endogenous selection into groups. This methodology uses a semiparametric approach known as varying coefficient models (

Hastie and Tibshirani 1993), which has the advantage to provide a more flexible specification on the parameterization of the coefficients, compare to the models proposed by

Ñopo (

2008) and

Ulrick (

2012). Specifically, this paper describes the use of kernel local linear regressions for the estimation of such models.

The use of the generalized inverse mills ratios, also known as generalized residuals, allow for a feasible strategy to control for the endogenous selection based on the continuous grouping variable. This methodology is similar to the one proposed in

Delgado et al. (

2019), suggesting a similar control function approach to address endogeneity from the semiparametric component of the regression. While I do not discuss the theoretical properties of the estimator, the Monte-Carlo Simulation exercises suggests that the proposed strategy provides a simple but powerful approach to obtain consistent estimators of the outcome model parameters. This suggests that the proposed estimator can be used alongside to the methodologies proposed by

Centorrino and Racine (

2017) and

Delgado et al. (

2019). A more formal analysis of theoretical properties of the proposed estimator is left for future research.

This methodology may prove useful for the analysis of endogenous treatment effects with varying treatment intensity, when heterogeneous effects are present. In addition, it can also be used for analyzing the heterogeneity of the impact of other exogenous variables conditional on a grouping variable of interest.

In the illustration example, I revise the results from

Cawley (

2004) to evaluate the causal effect of BMI on wages. The application of the semiparametric OB decomposition shows that the association between BMI and wages is nonlinear, and that the negative impact of BMI on wages varies considerably compared to the effect described in

Cawley (

2004). Furthermore, it showed that for men, BMI also has a statistically significant and negative association with wages, which was not captured previously because of the weak but positive impact that BMI has on wages for men with low BMI.