Response-Based Sampling for Binary Choice Models With Sample Selection

: Sample selection models attempt to correct for non-randomly selected data in a two-model hierarchy where, on the ﬁrst level, a binary selection equation determines whether a particular observation will be available for the second level (outcome equation). If the non-random selection mechanism induced by the selection equation is ignored, the coefﬁcient estimates in the outcome equation may be severely biased. When the selection mechanism leads to many censored observations, few data are available for the estimation of the outcome equation parameters, giving rise to computational difﬁculties. In this context, the main reference is Greene (2008) who extends the results obtained by Manski and Lerman (1977), and develops an estimator which requires the knowledge of the true proportion of occurrences in the outcome equation. We develop a method that exploits the advantages of response-based sampling schemes in the context of binary response models with a sample selection, relaxing this assumption. Estimation is based on a weighted version of Heckman’s likelihood, where the weights take into account the sampling design. In a simulation study, we found that, for the outcome equation, the results obtained with our estimator are comparable to Greene’s in terms of mean square error. Moreover, in a real data application, it is preferable in terms of the percentage of correct predictions.


Introduction
Most empirical work in the social sciences is based on observational data that are incomplete. There are many types of selection mechanisms that result in a non random sample. Some of them are due to sample design, while others depend on the behavior of the units being sampled, other than non-response or attrition.
In the first case, data are usually missing for all the variables of interest; for example, in estimating a saving function for all the families of a given country, a bias would arise if only families whose household head shows certain characteristics were sampled. However, when causes of missingness are appropriately exogenous, using a sub-sample has no serious consequences. In the second case, instead, there is a self-selection of the sample units and data availability on a key variable depends on the behavior of the units for another variable. The classical example is the estimation of the wage offer equation for people of working age, where we want to estimate the expected wage of an individual using a set of exogenous characteristics (gender, age, education, etc). This equation, by definition, should be valid for people of working age, independently of their working conditions at the time of the survey. On the contrary, we can only observe the wage offer for employed individuals; in regressing wages on their characteristics, we are not making inferences for the population as a whole. In other words, people in employment are a selected sample of the population and their wages are higher than the unemployed would have had. Hence, the results will tend to be biased (sample selection bias).To avoid this bias, we should take into account the selection mechanism due to the individual's decision to take a job and then receive a wage.
As is well known, Heckman (1979) proposed a useful framework for handling estimation when the sample is subject to a selection mechanism. In the original framework, the dependent variable in the outcome equation (the wage equation in the above example) is continuous and can be explained by a linear regression model with a normal random component. In addition to the output equation, a selection equation describes the selection rule by means of a binary choice model (probit). The original Heckman model was extended in many directions and a survey would be beyond the scope of this paper, but the interested reader can refer to Vella (1998) and Lee (2003). For our purposes, the relevant framework is the one where both the output and the selection equations are defined as a binary choice model (Dubin and Rivers 1989). The applications are countless and span every field of applied research: from political science (Dubin and Rivers 1989;Grier et al. 1994;Jacobs and O'Brien 1998) to health economics ( Van de Ven and Van Praag 1981), transport (Ingram 1999;Kayser 2000), and credit scoring (Boyes et al. 1989;Greene 2008), just to make a very brief and non-exhaustive list.
In our work, we focus on the problem that arises when the selection mechanism is particularly severe and gives rise to a large amount of censored observations. This situation might occur for example when the event of interest is infrequent or fragmented or no frames are available for standard sampling procedures. Relevant examples are the surveys on elusive populations (such as working children, homeless persons, illegal immigrants, tax evaders, and drug users); in these populations, by virtue of their characteristics or difficulties in obtaining the required information, adequate samples cannot be defined, drawn or implemented using the standard procedures of random sampling. In this case, no or very partial frames are available for sampling, or many units are not available or willing to participate in the survey. Consequently, random sampling is either not feasible or inefficient, because it is very costly (due to the high number of total units to be sampled to obtain a sufficient number of uncensored observations).
One solution is given by the response-based sampling scheme, also known as case-control setting. In this design (see Hosmer and Lemeshow 2013), samples of a fixed size are randomly chosen from the two strata identified by the dependent variable.
In the more specific context of binary choice models with response-based sampling and sample selection bias, a solution is given by Greene (2008), who extended the work of Manski and Lerman (1977). In both proposals, however, the estimator requires the knowledge of the true proportion of occurrences in the outcome equation. In some applications, this requirement can be a serious limitation.
Our work fits into this last framework and aims to overcome these shortfalls. The remainder of the paper is organized as follows. In Section 2, we start by describing the methodological background and then we illustrate the proposed estimation procedure. In Section 3, a simulation study compares our approach with Greene's. In Section 4, the proposed method is applied to real data on credit scoring. In Section 5, a discussion and some concluding remarks are provided.

Sample Selection
Let us first introduce some notations and briefly illustrate the sample selection framework with a binary choice model for both the selection and the output equations (Dubin and Rivers 1989).
Let Y * and A * be two latent (unobservable) variables characterizing the output and the selection equations, respectively. The model, in its general form, is: where X i = (X 1i , X 2i ) is a vector of exogenous variables (namely, X 1i for Y i and X 2i for A i ), containing all the relevant covariates, and β and θ are the vectors of regression coefficients. Let us define Y i and A i as two observable variables such that: The p.d.f. of Y i and A i is Bernoulli, with probability of success depending on the parameters β and θ respectively.
Model (3) defines the mechanism which governs the censoring process: we can observe Y i if and only if A i = 1. On the contrary, if A i = 0, Y i will be missing.
In the general case, if we were to estimate the parameters of Equation (1a) without considering the selection process in (1b), a bias would arise. This is because the processes represented by the two equations are related, i.e., corr( 1 , 2 ) = ρ is not null (see for example Cameron and Trivedi (2005) for further details).
The likelihood function for model (1a-1b) is: where η = (β, θ, ρ) is the vector of parameters to be estimated, y i = 0, 1 and the function A π(·) gives the probability that an observation is uncensored.

Response-Based Sampling
Let us now briefly review the main characteristics of the response-based sampling scheme relevant to our method. In this context, samples of fixed size are randomly chosen from the two strata identified by the dependent variable A (note that, for our purposes, in the context of sample selection, it corresponds to the dependent variable of the selection equation). In particular, n A units are drawn at random from the N A cases and nĀ units from the NĀ controls.
The likelihood function is the product of the stratum-specific likelihoods, and depends on the probability that the individual is in the sample and on the joint density of the covariates (Hosmer and Lemeshow 2013): where S i is a binary variable which takes value 1 if the i-th individual is in the sample and 0 otherwise.

The Weighted Endogenous Sampling Likelihood
The estimator, first proposed by Manski and Lerman (1977), is designed for a binary response model under a response based sampling framework and it is called the Weighted Endogenous Sampling Maximum Likelihood (WESML) estimator, because it assigns weights to the likelihood function. The weights are given by the ratio between the proportion of individuals in the population for which A = 1 and the corresponding proportion in the sample. Greene (2008) extended the work of Manski and Lerman to the context of a binary response model with selection bias and response-based sampling; in that work, the goal was to estimate the probability of loan defaults P(Y = 1|X) from a sample of individuals whose credit card application was accepted (A = 1). The corresponding likelihood is:

The WESML Estimator Corrected for Selection Bias
where again y i = 0, 1 and, according to Manski and Lerman, the weights are given by the ratio of two proportions: we have population-level quantities at the numerator and the corresponding sampling quantities at the denominator. More precisely, in Greene's application NĀ/N represents the fraction of non-cardholders in the population and nĀ/n is the homologous in the sample; coherently, N yA /N is the prevalence of defaults (y = 1) and non defaults (y = 0) in the population, while n yA /n is the sample counterpart.
It is important to note that Greene's estimator (GE) requires knowledge of the proportion in the population not only for the controls (i.e., NĀ/N), but also for the response variable (namely N 1A /N and N 0A /N). This requirement can be an insurmountable obstacle in some applications.

The Sample-Selection Response-Based-Sampling Likelihood
In the following, we provide our main result which is the likelihood function in the framework of interest, i.e., a sample selection mechanism with a severe censoring process assuming that the population prevalences N 0A N A and N 1A N A are unknown. The full proof is given in the Appendix. We make the following very general and non restrictive assumptions: 1. We have a set of fully informative and exogenous covariates X i = (X 1i , X 2i ). 2. Conditional on the covariates, the probability that an observation is uncensored does not depend on its value, i.e., P( The set of covariates X 1i , specific for Y i , and the set X 2i , specific for A i , may have common elements but they cannot fully overlap. In particular, we assume that there is at least one covariate in the selection equation which is not in the outcome. Assumption 1 means that a correlation between the covariates and the residual terms in Equations (1a) and (1b) does not exist . Assumption 2 is justified because, as the covariates are informative, all the information brought by Y i is contained in X i . Assumption 3 is necessary for parameter identification (exclusion condition).
Under the conditions stated, the likelihood function for a binary-choice model with sample-selection response-based sampling is: is the probability of observing Y = 1 given that the observation is uncensored and n yA is the amount of uncensored units in the sample having Y = y, with y = 0, 1. Moreover, as previously said, A π(X 2i ) is the probability that an observation is uncensored, n A is the number of units sampled from the N A uncensored observations and nĀ is the number of units sampled from the NĀ censored observations. It is easy to see that the likelihood (6) is a weighted version of (4), and the weights simply take into account the sampling scheme. In addition, note that, in the maximization process, the term f (X i |S i = 1) is non influential, as it does not contain any information on the vector of parameters η, and that the only known quantities at the population level are N A and N.
The estimator for η obtained by maximizing this likelihood will be referred to as Sample Selection Response-based Sampling (SSRS hereafter).

Simulation Results
In this section, we will compare the results obtained using the SSRS and GE estimators through a Monte Carlo experiment.
The generating model is: where all Xs are independently generated from a univariate standard normal. Note that, in Equation (7b), θ 0 governs the proportion of uncensored observations in the population (i.e., N A /N). In particular, in the following, we consider three (approximate) proportions: 4% (ensured by θ 0 = −2), 15% (θ 0 = −1.43) and 30% (θ 0 = −0.72). Even though in the derivation of Equation (6) we have not assumed any probability model for Y π(X 1i ) and for A π(X 2i ), we had to do it for the Monte Carlo experiment. More precisely, we set: From Equation (8) and referring to the probabilities A π(X 2 ) and Y π(X 1 ) in (6), it follows: where Φ and Φ 2 are c.d.f. of the univariate and the bivariate normal respectively. For each censoring scheme (N A /N), we performed 500 replications for any ρ ∈ [−0.8, 0.8] with a step of 0.1 and for three different sampling proportions of cases and controls (i.e., pc = n a n ∈ (0.25, 0.5, 0.75)). From a population of size N = 1, 000, 000, we drew samples of different size to evaluate the performances of the two estimators. More precisely, we drew sample with dimension spanning from n = 2000 to n = 10, 000 by 500.
In commenting on the results, we begin by comparing the empirical densities of SSRS and GE estimators for n = 2000 and n = 10, 000, for each censoring scheme 1 . Rather than overwhelm the reader with statistics, we prefer to present information in a summary graphical form.
As regards β 1 , for N A /N = 0.04 (see Figure 1), we note that the patterns are similar for pc = 0.25, whereas the behavior differs more as pc increases. Furthermore, the SSRS estimator is almost always more concentrated except for few combinations of ρ and pc.
As expected, when N A /N increases, the discrepancies between SSRS and GE estimators fade away (see Figure 2).
Analogous considerations apply for β 2 (see Figures 3 and 4). In summary, Figures 1-4 show that the SSRS and GE estimators have comparable behavior, even though the former uses less information at population level. It should be underlined that a direct comparison between the two sets of coefficient estimates may not always be appropriate. In fact, as noted by Mroz and Zayats (2008), a better comparison would be based on the relative effect, that is on the coefficients ratio. For these reasons we computed the average ratiosθ 2 /θ 1 andβ 2 /β 1 in the simulations, for each k, pc, N A /N. The results, available upon request, are very close to the true ratios, θ 2 /θ 1 = −0.63 and β 2 /β 1 = −1.2, for both GE and SSRS estimators and they show once again that GE slightly outperforms SSRS in the selection equation, while the reverse seems to happen for the outcome equation.
Coming to the comparison of the estimated MSEs, as shown in Figures 5 and 6, the overall results obtained by the SSRS estimator are quite similar to Greene's. Specifically, it seems that SSRS gives better results in estimating the parameters of the outcome equation (which are, generally speaking, those of greater interest), especially in the case of a severe censoring mechanism (N A /N = 0.04). The situation is reversed for the selection equation.
When N A /N = 0.30, the two estimators are substantially equivalent for the selection equation, while SSRS continues to slightly outperform GE for the outcome (except for pc = 0.25).
As expected, when N A /N increases, the differences between the MSEs of the two estimators decrease in absolute value. Everything we said holds for both sample sizes. 1 For the sake of brevity, we do not report the results for N A /N = 0.15 and for all values of ρ, but they are available upon request.   Finally, in Figure 7, we present the trend of the MSEs as the sample size increases, to have a first idea of the possible consistency of our estimator. Clearly, the property of consistency can only be proven analytically. The evaluation has been made only for the SSRS estimator, as it is proven that GE is consistent (Greene 2008;Manski and Lerman 1977). As usual, we consider the three sampling percentages and we compute the MSE forβ 1 ,β 2 ,θ 1 andθ 2 . Once again, to save space, we present only the former, but all are available upon request. As suggested by a referee, we made further simulations to study the behavior of the two estimators when the true generating model of the disturbances is not normal, while the likelihood function is based on the probabilities specified in Equations (9)-(11) (normal distribution). In particular, the generating model considered here is a bivariate skew-T; we then simulated 500 samples of size n = 2000 and n = 10, 000, ρ ∈ 0, ±0.4, ±0.8, pc ∈ 0.25, 0.50, 0.75 and N A /N ∈ 0.04, 0.15, 0.30. The behavior of the two estimators remains comparable in terms of bias and mean square error. In particular, comparing the mean square errors, GE outperforms SSRS in the estimation of the selection equation; the situation is often reversed for the outcome equation. Furthermore, to compare the robustness to distribution mispecification of the two estimators, we computed the ratio between the bias obtained when the true generating model is a skew-T and the bias obtained when the model is correctly specified: the closer to one the ratio (in absolute value), the more robust the estimator. It emerges that the SSRS estimator is more robust than GE, especially for the selection equation parameters. All results are available upon request.

Application on Real Data: Estimation of Credit Scoring
In the following, we consider the problem of estimating the risk of a loan default for credit-card holders. The population of interest is that of loan applicants, for whom some requests are approved and some are rejected according to their loan default risk.
The idea is to use the model to assign a default probability to a random individual who applies for a loan, but the only information that exists about default probabilities comes from previous loan recipients. The problem is that the probability of default for the overall population of applicants is not necessarily the same as for those who have already received a credit card (in fact, we expect that the probability for the whole population is greater than that of cardholders). To avoid this kind of selection bias, we need to consider the selection mechanism explicitly, that is a selection equation which explains the cardholder status.
The dataset considered here 2 contains a subset of the covariates used in Greene (2008). On n = 13, 444 cardholders, the following characteristics are measured: cardholder status (CH), taking 1 if the application for a credit card was accepted and 0 if not; default status (D), taking 1 if defaulted and 0 if not (observed only when CH = 1, that is for 10,499 observations); age in years plus twelfths of a year (Age); number of dependents (Adepcnt); months living at current address (Acadmos); number of major derogatory reports (Mjrg); number of minor derogatory reports (Mndrg); owner or tenant (Ownrent), taking 1 if the applicant owns his/her home, 0 if renting; monthly income in US dollars divided by 100 (Income); self employment status (Selfempl), taking 1 if self employed, 0 if not; and ratio of monthly credit card expenditure to yearly income (ExpInc).
We used the following specification: and we estimated the parameters by SSRS and GE estimators. The results, shown in Table 1, are quite similar for the outcome equation (the one of greater interest because it gives the credit scoring), where the signs are coherent with expectations.
To evaluate the predictive performance of the two models, we computed the confusion matrices (see Table 2), which allowed us to compute the percentages of correct predictions. 2 The data are available in Table 7.3 at http://people.stern.nyu.edu/wgreene/Text/econometricanalysis.htm From these matrices, it is easy to obtain the percentages of correct predictions for both estimation procedures. In particular, the overall percentages are 0.724 and 0.317 for SSRS and GE, respectively, showing a strong dominance of the former estimation method over the latter. This dominance is even more evident in the percentages conditioned on the default status: for non defaulters, the percentage is 0.969 for SSRS and 0.160 for GE, while, for defaulters, it is 0.043 for SSRS and 0 for GE. On the other hand, GE outperforms SSRS in the prediction of non cardholder status: the values are 0.164 for SSRS and 0.933 for GE.

Conclusions
In this paper, we propose a method for estimating the regression coefficients in binary response models with sample selection, when a censoring mechanism intervenes to make a vast majority of units unobservable. We derive the likelihood function analytically, taking advantage of the response-based sampling framework and find that it is a weighted version of Heckman's likelihood.
A simulation study highlights that the finite sample performance of the point estimators are very satisfactory even when compared to a similar estimator (Greene 2008), which in turn assumes knowledge of the population proportion of occurrences in the outcome equation. Specifically, our estimator slightly outperforms Greene's in estimating the parameters of the outcome equation (which are, generally speaking, those of greater interest), especially in the case of a severe censoring mechanism (N A/N = 0.04). Under this censoring scenario, the situation is reversed for the selection equation. When the censoring is less severe, the two estimators are substantially equivalent for the selection equation, while in general ours still outperforms Greene's for the outcome.
An analogous result is obtained in an empirical analysis aimed to estimate the risk of loan default: our estimator performs better in the prediction of the dependent variable of the outcome equation (i.e., the default status), while Greene's does better in the selection equation (i.e., the prediction of the cardholder status).
The fact that the best results refer to the outcome equation is an advantage, since it is generally the one of greater interest. The main result is that we do not require knowledge of the true prevalence of occurrences, that is, our estimator can be successfully used when N 1A and N 0A are unknown.
Future research will be devoted to the problems arising from measurement errors, either in the dependent variable or in the covariates of the outcome equation. In the first case, the problem is obviously simpler and a modified version of the likelihood can take into account the bias arising from this kind of error. In the second case, an endogeneity issue arises and it is necessary to consider further information, for example, instrumental variables.

Acknowledgments:
This work has been funded by Sapienza University of Rome, grant 000041-17-RDB-ATENEO2016-AREZZO. The authors wish to thank Judith Turnbull for her precious help with English corrections and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

GE
Greene's estimator SSRS sample-selection response-based sampling estimator MSE Mean Square Error

Appendix A. Derivation of the Likelihood Function
Making use of the Bayes rule, we can rewrite the likelihood function (5) for our sample as: The ratio on the right hand side represents the contribution to the likelihood jointly given by Y i and A i . However, since Y i is observable only when A i = 1, the contribution to the likelihood when A i = 0 is only given by: where, in the second ratio of Equation (A2) ,we used the Bayes rule; furthermore, the probability of being in the sample does not depend on the covariates. More precisely, P(S i = 1|X i , Y i , A i = a i ) = P(S i = 1|A i = a i ), with a i = 0, 1. For the probabilities in the third ratio, we put: • P(S i = 1|A i = 0) = nĀ/NĀ; • P(A i = 0|X i ) = 1 − P(A i = 1|X i ) = 1 − A π(X 2i ); • P(A i = 0|S i = 1) = nĀ/n; • P(S i = 1) = n/N.
On the other hand, when A i = 1, we have: where y i = 0, 1; hence, n y i A becomes n 1A or n 0A , according to the Y i value, and indicates the number of individuals in the sample for which it is observed, respectively, Y i = 1 or Y i = 0. For the probabilities in Equation (A3), we put: • P(A i = 1|X i ) = A π(X 2i ), as before; • P(Y i = y i |A i = 1, X i ) can be specified as a binary response model for Y i in the covariates X 1i , so that P(Y i = 1|A i = 1, X i ) = Y π(X 1i ), while P(Y i = 0|A i = 1, X i ) = 1 − Y π(X 1i ); and • P(S i = 1|A i = 1) = n A /N A .