An Exponential Endogenous Switching Regression with Correlated Random Coefﬁcients

: This paper presents a method for estimating the average treatment effects (ATE) of an exponential endogenous switching model where the coefﬁcients of covariates in the structural equation are random and correlated with the binary treatment variable. The estimating equations are derived under some mild identifying assumptions. We ﬁnd that the ATE is identiﬁed, although each coefﬁcient in the structural model may not be. Tests assessing the endogeneity of treatment and for model selection are provided. Monte Carlo simulations show that, in large samples, the proposed estimator has a smaller bias and a larger variance than the methods that do not take the random coefﬁcients into account. This is applied to health insurance data of Oregon.


Introduction
The conventional instrumental variable methods fail to consistently estimate the Average Partial Effects (APE) when the individual heterogeneity enters the model in a non-additive way (Heckman and Vytlacil 1998;Card 2001;Browning and Lechene 2003;Browning and Carro 2007;Imbens 2007). One of the simplest models, which allows for non-additive heterogeneity, is the Correlated Random Coefficient (CRC) model, where the heterogeneity interacts with the covariates to create random coefficients. This is in fact a generalization of the additive heterogeneity models, since the sum of the error and the constant in a linear regression equation is a random intercept. This paper provides the identification and the estimation of the average treatment effect (ATE) for an exponential model that has two regimes induced by an endogenous treatment. The covariates in the model have random coefficients that are correlated with the treatment. The conditional expectation is expressed by an exponential function due to the limited nature of the dependent variable. We also provide a test to assess the endogeneity of the treatment and for model selection. One important feature of this model is that it uncovers another mechanism of the individual heterogeneity working through the heterogeneous and endogenous marginal effects of covariates. This advantage comes from the parametric modelling not attempted by others who used non-parametric approaches, such as Imbens (1994, 1995), Abadie (2003) and Heckman and Vytlacil (2005).
The CRC models are closely related, although not exactly the same, to heterogeneous treatment effects models. There are two studies on this subject. The first is by Heckman and Vytlacil (1998) where they considered random, i.e., individual-specific, coefficients on the explanatory variables. The second is due to Angrist and Imbens (1994) under Rubin's counterfactual setting, where the heterogeneous treatment effect, i.e., the individual-specific difference of outcomes in two regimes, appears as the coefficient on the binary regime indicator. These two types of models end up being seemingly very similar in that the individual-specific heterogeneity is a non-constant coefficient. The differences are that the former is not Rubin's causal model and that it may have random coefficients on a continuous variables. The conventional 2SLS fails to identify what the parameters of interest might be for both models. For the first model, it would be the mean of the random coefficient, and, for the second, the mean of treatment effects. Nevertheless, for the second model, Imbens (1994, 1995) showed that the 2SLS estimates another meaningful parameter called local average treatment effects (LATE) or their weighted averages. The identification of LATE was made possible largely due to the counterfactual two-regime setting, which the first model does not allow. The model that I deal with in this article is a synthesis of these two models. It has random coefficients on the continuous covariates as well as the two counterfactual regimes under Rubin's causal model. I will show that the ATE can be identified and estimated.
When the variable of interest is continuous and its coefficient is correlated with the variable itself, the parameter of interest would be the average of the random coefficient. The error term will then include the interactions between a random part of the coefficient and the variable itself. For such cases, the conventional instrumental variable assumptions are not sufficient to consistently estimate the parameters of interests. To solve this problem, Garen (1984) suggests a control function method using some distributional assumptions. Heckman and Vytlacil (1998) propose a two-step plug-in estimator, and Wooldridge (2003) delineates the additionally required conditions for instrumental variables to ensure consistency. More recently, Masten and Torgovitsky (2016) provide the identification result when the first-stage equation also has random coefficients.
The main theme of this paper is the identification and estimation of the ATE in the exponential endogenous switching (ES) regression, where the covariates in the structural equations have CRCs. Terza (1998Terza ( , 2009 considers the model without CRC. Wooldridge (2015), in a linear setting with a continuous dependent variable, allows the coefficients on all other covariates to be random and correlated with the treatment. In line with this, this study extends Terza (2009)'s exponential ES regression model by allowing the coefficients on the covariates in each regime to be correlated with the binary regime indicator. This is an exponential version of the linear two-regime CRC model. I will show how it can estimate the ATE of health insurance from the number of emergency room visits. Simulations confirm that the exponential CRC model has smaller finite sample bias when the covariates have CRC.
The remainder of this paper is organized as follows. Section 2 discusses the tworegime CRC model with a non-negative dependent variable and provides the estimating equation under some mild assumptions. Section 3 relates the estimating equation with the identification of ATE. Section 4 provides the specification tests for endogeneity and model selection. Section 5 describes the data-generating processes of Monte Carlo simulations and presents the results. Section 6 provides an empirical application that analyses the effect of Oregon's health insurance on the number of emergency room visits. Section 7 presents the concluding remarks.

Exponential Regime Switching CRC Model
The variable of interest w is binary, i.e., w ∈ {0, 1}, and this calls for an analysis of ATE under a counterfactual framework. In this setting, the first-stage binary selection equation is assumed to follow the probit model, which has been assumed in Heckman (1976) and Terza (2009). For a continuous w, the average partial effects (APE) are identified (Wooldridge 2015), although the average of the random coefficient on w is not. Similarly, I will show that, in the exponential regression model, the ATE is identified while the average of the coefficient on w is not. One can instead use the log transformation on the dependent variable. In this case, however, the dependent variable must not be zero. This can be problematic when it is a count variable, as in the example in Section 6. The exponential conditional expectation is more flexible in this sense and thus is chosen as the preferred model. Let a w and b w denote individual heterogeneities that may affect the outcomes y w . Here, a w and b w are the random intercept and the vector of random coefficients at the regime w. The regime subscript w will often be omitted unless necessary. Let b j denote the random coefficient on the j-th explanatory variable so that b = (b 1 , b 2 , · · · , b K ) . We make the following assumptions: where the equation describes Regime 1 when w = 1 and regime 0 otherwise.
We assume that the conditional expectation is correctly specified. In the estimation later on, the correct specification guarantees the consistency despite an incorrect loglikelihood function (Gourieroux et al. 1984). Once the regime is determined, the regime indicator w is redundant conditional on a w and b w , because they summarize all the information about the determined regime. The vector z denotes all the available exogenous variables that include x. The exogeneity of z means that it is independent of a w and b w .
The assumption is that there must be some exogenous variables that are excluded from the conditional expectation function. Additionally, once the unobserved heterogeneity in a particular regime is accounted for, the heterogeneity for the other regime has no additional information. By using the exponential conditional expectation, the model can be written as The random intercepts a w are scalars, x is 1 × K vector of covariates, and b w are K × 1 random coefficient vectors. The heterogeneity b w enters as random coefficients. Although we do not attempt to identify the APEs of covariates per se, their presence hampers the identification of ATE that we identified and estimated. For random coefficients, we will use notation σ 2 (x) ≡ var(x) and σ(x, y) ≡ cov(x, y). (2) follows the probit model, i.e., v ∼ N(0, 1), and γ is reparametrized accordingly. Additionally, z is independent of v.

Assumption 2. (Probit) The regime choice equation in
Although Terza (1998) initially derives the correction term under the assumption of the multivariate normality between v and the error in the structural equation, Terza (2009) later shows that the normality of v is in fact sufficient for deriving the correction terms. We absorb the dependence by using linear projections and avoid using the multivariate normality. This parametric probit assumes the homogenous instrument effects on the endogenous variables. Heterogeneous instrument effects, as discussed by Masten and Torgovitsky (2016), are not attempted here. Their approach can help relax not only the homogenous instrument effects assumption, but also the distributional assumption on v at the same time. We leave it as a future task.
Let u b , e b and σ(u b , v) denote K × 1 vectors, of which the elements are u b j , e b j and σ(u b j , v).
The random coefficients in (1) can be written as For the errors in the above equations, their orthogonal projections on v are for each j. In addition to the orthogonality between v and (e a , e b ), which is implied by linear projection, we need a slightly stronger assumption, as below.
We use the independence of v with e a and e b . This is stronger than the orthogonality guaranteed by the projection theorem. Although e a and e b are independent with v, their sum may not be independent with v. The second part precludes this. It says that the σ-field of v and that of e a + xe b are independent depending on z. Here, additionally depending on w = 1 is equivalent to conditioning on v > −zγ. When an arbitrary event in the σ-field of v is independent with the σ-field of e a + xe b , its intersection with v > −zγ must also be independent, because v > −zγ is an event in the σ-field of v. Therefore, v ⊥ ⊥ e a + xe b | z, w is implied by Assumption 3(ii). One can now establish the following lemma: Lemma 1. Under Assumptions 2 and 3, for Regime 1, and for Regime 0.

Proof. See Appendix A.
Let and v follow the bivariate normal distribution with the correlation ρ. Terza (1998Terza ( , 2009) directly uses the bivariate normality in order to solve the conditional expectation, i.e., E[exp( )|z, v] = exp ρσv + 1 2 σ 2 (1 − ρ 2 ) was derived under the bivariate normal assumption between and v. In the above derivation, I use the linear projections of u a and u b on v. This gives an estimating equation that can be directly used for identifying ATE. The structural parameters, such as β, are not independently identified, but the coefficients on x and x 2 from the estimating equation derived by the linear projection coincide with the coefficients of the estimating equation for ATE.
Let us now derive an estimating equation of the model in (2). Note that by Assumptions 1 and 2, The above equation identifies only σ 2 (u b j ), σ(u a , v) and σ(u b j , v) separately, and the average of semi-elasticity, i.e., β, is not identified due to the nuisance parameters σ(u a , u b j ). The equation can be estimated by a two-step procedure using the probit model for the first-stage selection equation. Given the estimated index z γ, the above equation can be estimated either by non-linear least squares or by quasi-maximum likelihood under a Poisson distribution. It can still be estimated simultaneously by a single-step method. See Keay (2018) for a detailed discussion of the estimation procedure and their asymptotic distribution. Since the structural parameters α and β are not separately identified, our interest should lie in the ATEs. This is discussed in the next section.

Estimating the Average Treatment Effects
The ATE is defined as E[y 1 − y 0 ]. One of the easiest ways to identify this is by using the law of iterated expectation, i.e., E[y 1 − y 0 ] = EE[y 1 − y 0 |x], as long as the expectation conditional on x can be derived. To this end, the following lemma is useful: Lemma 2. Under Assumption 3, the following result holds: From the above lemma, Note the similarity of this equation with the estimating equation in (7); the arguments of the exponential functions are identical. The structural parameters β are not separately identified, but the whole coefficients are estimated by the procedures applied to Equation (7). Let E[y 1 − y 0 |x i ] be Equation (9) with the coefficients replaced by the estimates. The ATE estimator is the sample average of E[y 1 − y 0 |x i ] over the values of x i . The consistency is guaranteed by Assumption 1. We now establish the asymptotic normality of the ATE. To this end, let us define θ as the vector of the coefficients estimated by the . The asymptotic variance of θ is simply A −1 BA −1 , where A and B are the Hessian and information matrices. For the asymptotic distribution of the ATE, we have the following proposition. (9). Let G 1 and G 0 denote the expectation of the derivatives of g 1 (x, θ) and g 0 (x, θ) with respect to θ. Under the usual regularity conditions of a maximum likelihood estimator, the asymptotic distribution of ATE estimator is

Proof. See Appendix A.
We have seen that the above identification relies on the normality of v. Is there any possible way of relaxing the normality assumption while continuing to use the linear index binary choice setting as in (2)? The estimating equations given in Lemma 1 can be semiparametrically estimated. For E exp (σ(u a , v) + xσ(u b , v))v z, w = 1 , where w = 1 is equivalent to v > −zγ, the expectation is conditional on z and zγ, and it can be semiparametrically estimated in terms of these variables by using two-step series or spline methods (Newey 2009). In order to identify the ATE, however, the expectation that it is conditional on x has to be found as in Equation (9). The fact that w = 1 is not in the conditioning set implies that v is integrated over the entire real number. The corresponding value of zγ thus has to be infinity for which no data are available. It becomes clear from this that the role of the normality assumption is to provide information for such an extreme area in the domain. Although Blundell and Powell (2004) provides a semiparametric control function approach, their method is not applicable when the endogenous explanatory variable, the binary regime indicator in this case, is not continuous.

Tests for Endogeneity of Treatment
In the previous section, we derived the estimating equation Equation (7) and ATE conditional on covariates (9). No restrictions were imposed in these equations, and all the coefficients are presumed to be correlated with the treatment. An estimation of the model without restriction, where the number of identifiable composite parameters is no less than (K + 4)(K + 1)/2, might cause difficulty in numerical optimization. We might be able to estimate a simpler model if some of the random coefficients are not correlated. The Lagrange Multiplier (LM) test can be used for this purpose. The variable addition test (VAT) is also available, which is asymptotically equivalent to LM, but easier to apply. This test constructs a conditional mean by adding appropriately defined variables to create the likelihood function of which the score under restriction is the same as the one used in the LM test (for a detailed proof, see Wooldridge 2014). The actual test was performed by Wald test on the significance of coefficients of the added variables; look at the estimating Equation (9). Inside the exponential function, we have each covariate, the squares of each covariate and their cross products. Let the vector of these functions of covariates be denoted by x and rewrite the estimating equation for Regime 1 as follows: where As already mentioned, θ 2 is scalar, θ 3 is a K × 1 vector and θ 1 is a (K + 2)(K + 1)/2 × 1 vector. The estimating equation for Regime 0 is the same as above, except for the negative sign inside the Φ(·) in the correction function.
In order to perform VAT, we need to construct a conditional mean function, of which the score is identical to the one from Equation (10) under restriction. Let the restriction or the null hypothesis be H 0 : In other words, there are no correlations between the selection error v and any random coefficients in the structural equations. The LM test can be applied and does not require an estimation of a complicated model without restriction. The VAT provides a device that facilitates this LM test by using an auxiliary regression for Regime 1 where λ(·) is the inverse Mill's ratio. The auxiliary regression for Regime 0 will have −zγ instead of zγ. It can be easily verified that the likelihood and score functions derived from (11) under restrictions are the same as those from (10). Therefore, the score tests on the significance of coefficients from the above auxiliary regression are equivalent to the LM tests in (10). Thus, the Wald tests for θ 2 and θ 3 in (11) will give a simple and asymptotically equivalent way to test the null without estimating the complicated model without restriction. In the actual test, λ(zγ) has to be estimated through the first-stage regression.

Model Selection Test
What we have considered above is whether or not the random coefficients are correlated with the selection error. Even when the null hypothesis is not rejected, it does not warrant coming back to the exponential ES model by Terza (2009), where all the coefficients are constants. Both the null and alternative hypotheses in the VAT mentioned above assume the presence of random coefficients.
One may want to run a test for model selection between the current model and the one with constant coefficients. In the latter model, all the individual heterogeneity is in the additive error. Since u b j = 0, all the variance and covariance terms involving u b vanish. From Equation (7), the null hypothesis is the zero restriction on the coefficients of x 2 j and of interactions x j x r inside the exponential functions, and of x j inside the correction functions.

Monte Carlo Simulations
There are three main purposes of the simulation. First, we compare three ATE estimators: the exponential endogenous switching regression with CRC (E-CRC, main topic of this article), the one without CRC (E-nCRC), and the linear version of the endogenous switching regression with CRC (L-CRC) proposed by Wooldridge (2015). This will show the advantage of the E-CRC model for estimating ATE. Second, it will see the impact of violation of probit assumption based on which the estimating equation was derived in the previous section. Third, it will check the reliability of the proposed estimator when the conditional mean is linear instead of being exponential. In the following simulations, the E-CRC and E-nCRC are all estimated by the two-step Poisson Quasi-Maximum Likelihood method.

Data-Generating Processes
The following data-generating processes (DGP) are inspired by the Oregon experiment discussed in the next section. We designed the DGP so that it resembles its data set. DGPs 1 and 2 are based on the following specifications: x ∼ uniform[15, 50] and z is 0 or 1 with 1/2 probability. Additionally, where y 1 and y 0 follow the Poisson distribution with the mean specified above. The above DGP implies that the ATE is in unity.
There are two sessions of simulations for the above population; DGP 1 deals with the ideal case where the errors follow the multivariate normal distribution. The joint distribution of errors (u a 0 , u b 0 , u a 1 , u b 1 , v) for DGP 1 follows the multivariate normal with equicorrelation for any pair of two errors. The variance of v is set as equal to one and those of other errors are all equal to 1/100 2 . The coefficient on x and the error variances might seem too small at first sight. As the coefficient on x is interpreted as the semi-elasticity in exponential models, the values are chosen so that the standard deviation is 1/100 or 1%. Therefore, in Regime 1, the mean of the coefficient is 2% and its standard deviation is 1%. The values of correlations (ρ) are 0.3 and 0.5. This is the case where the above estimator might perform the best, since the actual DGP conforms to the assumptions on which the estimating equation is based.
An important assumption for Lemma 1 is that v follows a normal distribution, which has brought the normal cumulative distribution function Φ(·) in the correction terms. In order to check the robustness of the estimator on that assumption, DGP 2 generates errors that follow a skewed distribution. Each error is constructed by (X − 5)/ √ 10, where X ∼ χ 2 (5), so that it has zero mean and unit variance. Then, it is again divided by 100 to set the variance to the same as that in DGP 1. Each error was constructed to have the specified correlations. 1 Lastly, DGP 3 uses the same x, z and w, except that x is demeaned, i.e., x ∼ uniform [−17.5, 17.5]. The dependent variables for two different regimes follow these equations: where the joint distribution of (u a 0 , u b 0 , u a 1 , u b 1 , v) is the same as DGP 1. 2 The structural errors ( 1 , 0 ) follow the independent standard normal distribution. This is to compare the relative performances of linear and exponential models when the population is linear.

Simulation Results
The simulation results for estimating ATE under DGPs 1, 2 and 3 are given in Tables 1-3, respectively. The left and right panels show the results for ρ = 0.3 and 0.5 in each table. Each panel lists and compares these three estimators for sample sizes of 5000 and 10,000. Mc. st. dev., RMSE, MAD and IR stand for Monte Carlo standard deviation, root mean squared error, mean absolute deviation, and interdecile range, respectively.  The E-CRC is expected to perform the best when the errors follow a multivariate normal distribution. In Table 1, the biases of E-CRC are smaller than those of L-CRC. The E-CRC has larger Monte Carlo variances than L-CRC under both the correlations. This might be a sign that the efficiency and the consistency are traded. One can also see that, in Table 1, E-CRC has smaller bias, but is less efficient than E-nCRC. These larger variances are simply due to the complexity of the model. We again face the trade-off between consistency and efficiency. Table 2 shows the effect of non-normal errors of DGP 2 that may cause some biases for the exponential models because the correction terms were derived under the normality of v. The comparison of L-CRC and E-CRC in Table 1 is also echoed here in Table 2; E-CRC has larger variances, but smaller biases than L-CRC. However, E-CRC has both larger biases and variances than E-nCRC. The simpler model performs relatively better under error misspecification.
DGP 3 in Table 3 checks whether the exponential regression is generally as reliable as the linear one when the dependent variable is non-negative. The overall results from Table 3 do not reveal any substantial difference from Tables 1 and 2. The performances of the linear model under the linear DGP 3 are no better than that of the exponential model in terms of bias; biases from exponential models are in most cases smaller than in linear models. Larger variances of exponential models are also found here, as in DGPs 1 and 2. The true motivation of the DGP 3 simulations is the possibility of incorrect data fitting, and its impact on the ATE estimation. However, such an adverse effect of the exponential regression has not been found by simulations. The exponential CRC model may not be necessarily worse than the linear model even under the linear population. The CRCs generate quadratic terms in the estimating equations. For instance, a linear function a + bx can be better fitted by an exponential function with quadratic terms because ln(a + bx) can be better approximated by a quadratic equation. Thus, the quadratic terms, which might not be present in a non-CRC setting, help the model be more robust than the linear model. Thus, we expect that the exponential CRC might be well suited for prediction as well as the ATE estimation.

An Application: Oregon's Health Insurance Experiment
One of the motivations of expanding Medicaid to those currently not qualified is that it can shift treatment from costly emergency department care to more effective primary care. Health insurance is presumed to reduce the use of emergency departments by encouraging patients to use preventive care. A way to learn about the effect of expanding Medicaid would be to randomly offer coverage to those without it and to compare the outcomes between the treatment and the control groups.
The state of Oregon initiated such an experiment in 2008 through a publicly announced health insurance policy (Finkelstein et al. 2012;Baicker et al. 2004;Taubman et al. 2014). Oregon initially allowed residents without health insurance to register for a lottery, selecting winners and losers. Only the winners were then given a chance to apply for the Oregon version of Medicaid, known as the Oregon Health Plan (OHP). The OHP coverage was not automatic for the winners, however. It was provided only to the applicants who were US citizens or legal immigrants, aged 19-64, and who satisfied some other additional criteria. These selected people had not been previously eligible for Medicaid based on the federal poverty income level and had been without insurance for at least previous six months.
We are mostly interested in the effect of health insurance on the number of emergency room (ER) use. Taubman et al. (2014) use the following regression model where i denotes an individual and h denotes a household. There might be multiple individuals from the same household in the data set. va f ter, vbe f ore and f amsize are the numbers of ER use after and before OHP and the number of family members in the household. The instrument is binary and indicates whether an individual is a lottery winner. If an individual purchases OHP, OHP = 1, and he or she will be part of Regime 1. If not, they will be part of regime 0.
There are a couple of issues in Taubman et al. 2014. First, they implicitly assume that vbe f ore is exogenous. Although potentially inaccurate, Keay (2019); Fourkan et al. (2021) find, by a copula regression in a slightly different setting, that the endogeneity of vbe f ore is of little concern. Second, they run 2SLS without paying enough attention to the fact that va f ter and vbe f ore are censored at 22 and 17. Developing a censored data version of the model is not in the scope of the current research; here, we will alleviate this issue by excluding observations with censored covariates. Reducing the support of covariates does not affect the consistency in general. Even better, it eliminates the source of bias by eliminating wrong covariate values. The observations at censored point 17 must be thrown out. In the data set, 1.2% of observations have a vbe f ore of over 10. A few large values of vbe f ore over 10 either make the maximum likelihood method unstable or cause an enormous standard error. For the purpose of a reliable comparison, I will consider the subsamples with vbe f ore ≤ 9, 7, and 5, which make up 98.8%, 98.0% and 96.2% of the original data set. Each subsample contains 36%, 29% and 21% of the original number of y-censored points.
We ran regressions of (17) and its exponential regime switching versions. The dependent variable is the count variable, and thus the exponential regression models, such as E-nCRC or E-CRC, would be the alternatives to the conventional linear models. In particular, E-CRC allows for the presence of a random coefficient on vbe f ore. For a person who is more willing to take up the insurance (measured by higher v), their current ER visiting behaviours would be different than those from the past. This may weaken the link between the current and past visits so that the coefficient b 1 vbe f ore is closer to zero. Then, the covariance between them can be negative. If the covariance is non-zero, then the ATE estimator without CRC can be biased. This is the reason why the E-CRC model is estimated. Table 4 shows the descriptive statistics for the variables in the data set. Table 5 compares OLS, 2SLS, E-nCRC, E-CRC estimates. Panels A, B and C display the results for the subsamples with vbe f ore ≤ 9, 7, and 5. On OHP (ATE), the coefficient estimates of OHP from OLS and 2SLS and the ATE estimates of E-nCRC and E-CRC are displayed. R1 and R0 stand for the regimes with OHP = 1 and 0, respectively. The covariates coefficients for OLS and 2SLS are all displayed in R1. b 1 vbe f ore is the random coefficient on vbe f ore, where the superscript and subscript indicate the regime and the covariate that this random coefficient is on. The 2SLS estimates, the weighted averages of the LATE for each covariate (Angrist and Imbens 1994), are all greater than the OLS across all the subsamples. They all indicate that the effects of insurance are significantly positive. On Panels B and C, the ES estimator gives the ATE not far from LATE. Their standard errors are larger because the model is more complex; the ATE values are all insignificant in all the samples. The 2SLS shows the treatment effect only for compliers, i.e., those who get an insurance because they are eligible. This subpopulation does not include the always takers and never takers. In the data, there are at least 2200 always takers who got insurance although they are not eligible, and at least 5897 never takers who did not get insurance although they are eligible. They make up about 1/3 of the whole observations in the data set. The 2SLS estimate says that the effect of the insurance is positive only for them. E-nCRC and E-CRC indicate that the average treatment effect for the whole population is not statistically different from zero. This might be because the positive effect of the compliers is diluted by the negative effect of never and always takers. Another reason is that the standard errors are larger. The ATE estimates, unlike other parameters, are affected by the randomness of covariates. The additional term E[T] 2 in the asymptotic variance of ATE helps increase the standard errors.
The estimation results show that the ATE (or the coefficient on OHP) estimates are smaller when those with unusually large vbe f ore are eliminated. Although not significant, the ATE estimate for E-CRC in Panel C is even negative. In the data, 2.6% of the individuals visited the ER more than five times. Dropping them changes the magnitude of IV estimates by more than 10%. These indicate that Taubman et al. (2014)'s results are exaggerated by the handful of frequent visitors who may not be from the common distribution. The t-statistics are also lower in the smaller samples, indicating that the evidence for the positive effect of OHP is weaker therein.
A presence of the CRC is detected. The correlations between the coefficient b 1 vbe f ore and the selection error v are significantly negative. The results show the presence of CRC only on vbe f ore, particularly in Regime 1. The CRC on vbe f ore was not taken into account in other models and might have caused biases. Some of those who changed the ER visiting behaviour after obtaining insurance ended up visiting less, which could have helped reduce the ATE estimates.   Note: OHP shows the OLS and 2SLS estimates of the coefficient on OHP, and the ATE estimates of E-nCRC and E-CRC. cov(b 1 vbe f ore , v) and cov(b 1 f amsize , v) indicate the covariances between the selection error and the coefficients on vbe f ore and f amsize in Regime 1. The numbers report the estimates and standard errors in parenthesis. *, ** and *** indicate the significance at 10%, 5% and 1% levels, respectively. Panels A, B and C display the results for the subsamples with vbe f ore ≤ 9, 7 and 5, which make up 98.8%, 98.0% and 96.2% of the original data set.

Conclusions
We have so far considered the endogenous switching regression, where there is a random coefficient correlated with the switching variable under the count dependent variable. The count dependent variable requires non-linear modeling using the exponential conditional mean function. Although it is impossible to identify the structural parameters, the identification of ATE has been achieved. The simulations show that this ATE estimator, by the non-linear two-regime CRC model, performs reasonably well at larger sample sizes. It is also found that the exponential model performs well as long as the dependent variable is non-negative, even when the population conditional mean is linear. The empirical application has shown that the CRC model allows us to study the nature of self-selection more fully. Relaxing the probit assumption would be an obvious extension. By introducing the first-stage equation in a more general form, it would be possible to relax the distribution assumption along with the homogeneous instrument effects at the same time.
where the third equality comes from v ⊥ ⊥ e a + xe b | z, w, which is implied by Assumption 3(ii), and the fourth from Assumption 3(i). The last equality is shown by Terza (1998). The argument of the exponential function at the last line can be written as where σ 2 (·, ·) is a squared covariance. Now consider the last term at the last line. Since e a + xe b is a sum of multiple random variables, it can be approximately thought of as following a normal distribution. Then, by using the results on the log-normal distribution 3 Collecting these terms along with the above second term, we obtain the stated result.
Proof for Lemma 2. Note that E(u a + xu b |x) = E(u a |x) + E(xu b |x) = 0 var(u a + xu b |x) = var(u a |x) + var(xu b |x) + 2cov(u a , xu b |x) For any fixed value of x, the distribution of u a + xu b is approximately normal due to the central limit theorem. Since the mean and variance of normal random variables are already obtained, the mean of its log-normal variable is trivially found.
Notes 1 One can use the copula method to generate a multivariate distribution in which marginal effects are scaled χ 2 (5) distribution. However, the copula method was not used here, since we only want to have pairs of errors with some predetermined correlation without further specifying the dependence structure. Dependence can also be created by using some common standard normal random variables from which two different chi-squared random variables are constructed. We obtained the desired correlation by changing the "parameters" through trial and error. Codes will be provided upon request.

2
The intercept values are sufficiently large to prevent any negative dependent variable. In 10,000 replications, there are at most one or two negative dependent variables. They were dropped and the effect is negligible. 3 Terza (2009) assumes the joint normality of (v, u a , u b ). On the contrary, my model does not assume the jointicity. By linear projection, u a = ρv + e a and e a is orthogonal or independent with v. The only requirement is the normality of e a and e b . However, the CLT makes the approximation more accurate as the number of covariates increases. My model presupposes the existence of covariates with CRC. Therefore, the more appropriate the model is, the more accurate the approximation is. Finally, according to the simulation results for χ 2 errors in Tables 1 and 2, under large samples, the performances of CRC model is better than the linear model. So, again, when the model is appropriate, it outperforms the preexisting linear method.