Next Article in Journal
Projection Pursuit Through ϕ-Divergence Minimisation
Previous Article in Journal
Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities
Article

A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model

by 1,* and 2
1
Department of Economics and the Info-Metrics Institute, American University, Kreeger Hall 104, 4400 Massachusetts Ave., NW, Washington, DC 20016-8029, USA
2
Centro de Finanzas, IESA, Caracas, Venezuela
*
Author to whom correspondence should be addressed.
Entropy 2010, 12(6), 1569-1580; https://doi.org/10.3390/e12061569
Received: 16 April 2010 / Revised: 3 June 2001 / Accepted: 11 June 2010 / Published: 14 June 2010

Abstract

This paper develops a semi-parametric, Information-Theoretic method for estimating parameters for nonlinear data generated under a sample selection process. Considering the sample selection as a set of inequalities makes this model inherently nonlinear. This estimator (i) allows for a whole class of different priors, and (ii) is constructed as an unconstrained, concentrated model. This estimator is easy to apply and works well with small or complex data. We provide a number of explicit analytical examples for different priors’ structures and an empirical example.
Keywords: concentrated model; inequalities; information; maximum entropy; priors; sample selection concentrated model; inequalities; information; maximum entropy; priors; sample selection

1. Introduction and Basic Model

The sample selection problem appears often in empirical studies of labor supply, individuals' wages and other topics. For small sample sizes the existing parametric [1] and semi-parametric estimators [2,3,4,5] have difficulties. Recently, [6], henceforth GMP, developed a semi-parametric, Information-Theoretic (IT) estimator for the sample-selection problem that performs well when the sample is small. This estimator is based on the IT generalized maximum entropy (GME) approach of [7] and [8]. GMP used a large number of sampling experiments to investigate and compare the small-sample behavior of their estimator relative to other estimators. GMP concluded that their IT estimator is the most stable estimator while the likelihood estimators predicted better within the sample for large enough samples. Their IT estimator outperformed the AP estimator in most cases and in all small samples. Another set of experiments within the nonlinear framework appear in [9].
GMP specified their IT-GME model with bounds on the parameters and with finite and discrete support. Though, their IT-GME estimator performs relatively well, it still has some of the basic shortcomings of that estimator. It has finite and bounded supports for both signal and noise, it is not flexible enough to incorporate infinitely large bounds and continuous support spaces, and it is constructed as a constrained optimization estimator. The objective here is to extend the estimator discussed in GMP in three directions. First, we allow unbounded support spaces for all parameters. Second, we accommodate for a whole class of (discrete and continuous) priors. Third, we construct our estimator as an unconstrained concentrated model.

1.1. The Basic Sample Selection Model

For simplicity, we follow a common labor model discussed in [10]. Suppose individual h (h=1,…, N) values staying (working) at home at wage y 1 h * and can earn y 2 h * in the marketplace. If y 2 h * > y 1 h * , the individual works in the marketplace, y1h = 1, and we observe the market value, y 2 h = y 2 h * . Otherwise, y1h = 0 and y2h = 0.
The individual's value at home or in the marketplace depends (linearly) on demographic characteristics (x):
y 1 h * = x 1 h t β 1 + ε 1 h
y 2 h * = x 2 h t β 2 + ε 2 h
where x1h and x2h are K1 and K2-dimensional vectors, β1 and β2 are K1 and K2-dimensional vectors of unknowns and “t” stands for “transpose”. This model can be expressed as
y 1 h = { 1 if   y 2 h *   >   y 1 h * 0 if   y 2 h *     y 1 h *
y 2 h = { x 2 h t β 2 + ε 2 h if   y 2 h *   >   y 1 h * 0 if   y 2 h *     y 1 h *
Our objective is to estimate β1 and β2. Typically the researcher is interested primarily in β2.
Unlike the more traditional models, GMP constructed their model as a solution to a constrained optimization problem such that the information represented by the set of censored equations (3)-(4) enters the estimation as inequalities:
x 2 h t β 2 + ε 2   h = y 2 h ,     if   y 1 h = 1
x 2 h t β 2 + ε 2   h   >   x 1 h t β 1 + ε 1h   ,      if   y 1 h = 1
x 2 h t β 2 + ε 2   h     x 1 h t β 1 + ε 1   h   ,      if   y 1 h = 0
In our formulation, we use inequalities as well to represent all available information in the set of censored equations.

2. The Information-Theoretic Estimator

Rewrite equations (1)-(2) as finding γ1 and γ2 in y 1 * = A 1 γ 1 and y 2 * = A 2 γ 2 , where the dependent variable is censored and where γ 1 = ( β 1 ε 1 ) ,   A 1 = [ X 1 I ] , γ 2 = ( β 2 ε 2 )  and  A 2 = [ X 2 I ] . We formulate the censored model (5)-(7) in the following way.
Let the constraint sets C i = C i , s × C i , n ( i = 1 , 2 ) . For each i, Ci,s is an auxiliary closed, convex set used to model the a-priori constraints on the β’s. Similarly, the closed convex set Ci,n is part of the specification of the “physical” nature of the noise and contains all possible realizations of ε . We view the coordinates ς i  in  C i , s and ν i  in  C i , n as values of random variables distributed according to some probability measure d P i ( ξ i ) d P i ( ς i , ν i ) such that their expectations (E) are
β i = E P i [ ς i ]  and  ε i = E P i [ ν i ]
We note that the qualifier “prior” assigned to the Q probability measures is not the traditional Bayesian view. Rather, the Qs is just a mathematical construct to transform the estimation problem into a variational problem. The Qn, however, could be viewed as the probability measure describing the statistical nature of the noise. The process of estimation of the noise involves a tilting of the prior measure.
Given some (any) prior measures d Q i ( ξ i ) d Q i ( ς i , ν i ) = d Q i , s ( ς i ) d Q i , n ( ν i ) we search for densities ρ i ( ξ i ) such that d P i = ρ i ( ξ i ) d Q i ( ξ i ) satisfies the system (3)-(4). This yields the parameter estimator β * and the estimated residuals ε * .
Next, let S ( P i , Q i ) denotes the differential entropy divergence measure between the priors, Qi, and the post-data (posteriors) Pi. This is just the continuous version of the Kullback-Liebler information divergence measure, also known as relative entropy (see [11,12,13]). Since the data are naturally divided into observed and unobserved parts, we divide the data into two subsets: J and Jc of {1,2,…,N}. Next, rewrite the data (3)-(4) y 1 * = A 1 γ 1 and y 2 * = A 2 γ 2 as
y 1 * = ( y 1 y ¯ 1 ) = ( B 1 B ¯ 1 ) E P 1 [ ξ 1 ] ;  and  y 2 * = ( y 2 y ¯ 2 ) = ( B 2 B ¯ 2 ) E P 2 [ ξ 2 ]
where the matrices B1 and B2 correspond to the rows of the matrices Ai (i=1, 2) labeled by the indices for which observations are available. For the indices in J the values y2 are observed and y 2 h * > y 1 h * , whereas for the values in Jc all we know is that y ¯ 1 h * > y ¯ 2 h * .
Our “Basic (Primal) Problem” is the solution to
S u p ( P 1 , P 2 ) { S ( P 1 , Q 1 ) + S ( P 2 , Q 2 ) | y 2 = B 2 E P 2 [ ξ 2 ] , B 2 E P 2 [ ξ 2 ] > B 1 E P 1 [ ξ 1 ] ; B ¯ 2 E P 2 [ ξ 2 ] B ¯ 1 E P 1 [ ξ 1 ] }
where the inequalities between vectors are taken to be component wise.
Next, we formulate the problem as a concentrated (unconstrained) entropy problem. To do so, we view the basic primal problem as a two stage problem, call it an “equivalent primal problem.” In the equivalent model the first stage consists of the standard Generalized Entropy problem (the equality portion of the model) for which a dual can be easily formulated.
The Equivalent primal problem is a solution to the two stage optimization problem:
S u p { S u p { S ( P 1 , Q 1 ) + S ( P 2 , Q 2 ) | y 2 = B 2 E P 2 [ ξ 2 ] , η 1 = B 1 E P 1 [ ξ 1 ] ; η ¯ 2 = B ¯ 2 E P 2 [ ξ 2 ] ; η ¯ 1 = B ¯ 1 E P 1 [ ξ 1 ] } | y 2 > η 1 , η ¯ 2 η ¯ 1 }
Theorem 2.1.
The equivalent primal problem (11) is equivalent to the following (dual) problem
I n f { l n Z ( λ 1 , λ ¯ , λ 2 , λ ¯ ) + λ 1 + λ 2 , y 2 | λ 1 + | J | , λ 2 | J |  and  λ ¯ + N | J | }
where a , b denotes the Euclidean scalar (inner) product of the vectors a and b,
Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) = C 1 × C 2 e λ 1 , B 1 ξ 1 e λ ¯ 1 , B ¯ 1 ξ 1 e λ 2 , B 2 ξ 2 e λ ¯ 2 , B ¯ 2 ξ 2 d Q 1 ( ξ 1 ) d Q 2 ( ξ 2 ) = C 1 e λ 1 , B 1 ξ 1 e λ ¯ 1 , B ¯ 1 ξ 1 d Q 1 ( ξ 1 ) C 2 e λ 2 , B 2 ξ 2 e λ ¯ 2 , B ¯ 2 ξ 2 d Q 2 ( ξ 2 ) = Z 1 ( λ 1 , λ ¯ 1 ) Z 2 ( λ 2 , λ ¯ 2 )
and ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) are the four sets of Lagrange multipliers associated with (11). To carry out the procedure specified in (12) first set λ ¯ 1 = λ ¯ 2 = λ ¯ , and then carry out the minimization.
Proof: 
To confirm the uniqueness of the solution to problem (12), observe that the function
( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) = ln Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) + λ 1 , η 1 + λ ¯ 1 , η ¯ 1 + λ 2 , y 2 + λ ¯ 2 , η ¯ 2
is strictly convex on its domain Ψ ( Q ) = { λ = ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) | Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) < } , and if ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) as λ = ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 )     Ψ , then problem (12) has a unique solution, where “ ” is the boundary of the set Ψ. This is always true in the cases we consider here. A simple example in which it does not hold is l ( λ ) = e λ + λ y with as domain. This has no minimum for a positive y.
Solving (12) yields λ 1 * , λ ¯ * , λ 2 * , which in turn yields the optimal maximum entropy (posterior) density.
ρ * ( ξ 1 , ξ 2 ) = e λ 1 * , B 1 ξ 1 λ ¯ , B ¯ 1 ξ 1 e λ 2 * , B 2 ξ 2 + λ ¯ , B ¯ 2 ξ 2 Z ( λ 1 * , λ ¯ ) Z ( λ 2 * , λ ¯ ) = ρ 1 * ( ξ 1 ) ρ 2 * ( ξ 2 )
This density is naturally factored into a product of the maximum entropy densities of the two sets of equations. Therefore, ξ 1  and  ξ 2 are independent with respect to the reconstructed density d P * ( ξ 1 , ξ 2 ) = ρ 1 * ( ξ 1 ) ρ 2 * ( ξ 2 ) d Q 1 ( ξ 1 ) d Q 2 ( ξ 2 ) , and with respect to the original priors. Once P * is solved, we follow (8), or (9), to get
( β i * ε i * ) = E P i * [ ξ i * ] = C i ξ i ρ i * ( ξ i ) d Q i ( ξ i ) ; i = 1 , 2
With that generic formulation, we show below three analytic examples that cover a wide range of possible priors and support spaces for β and ε.

3. Large Sample Properties

Denote by β i N * the estimator of the true β i when the sample size is N. Throughout this section we add a subscript N to all quantities introduced in Section 2 to remind us that the size of the data set is N. We show that β i N * β i and N ( β i N * β i ) N ( 0 , V i ) as N in some appropriate way. The proof is similar in logic to Proposition 3.2 in [14]. We assume:
Assumption 3.1.
For every sample size N, the minimizers of (12) are all in the interior of their domains: λ 1 * i n t ( + | J | )  and  λ ¯ 1 * i n t ( + N | J | ) where “int” stands for interior.
Assumption 3.2.
Let 1 N X i t X i = 1 N ( B i t B i + B ¯ i t B ¯ i ) = J N ( 1 J B i t B i ) + N J N ( 1 N J B ¯ i t B ¯ i ) . Then, assume there exists α ( 0 , 1 ) such that (i) N and J such that ( J N ) α and (ii) assume that there exists two matrices W i o and W i u such that 1 J B i t B i W i o and 1 N J B ¯ i t B ¯ i W i u .
Note that 1 N X i t X i W i = α W i o + ( 1 α ) W i u .
Proposition 3.1.
(Convergence in distribution.) Under Assumptions 3.1 and 3.2
a)
β i N * D β i as N , for i=1, 2.
b)
Ν ( β i N * β i )   D N ( 0 , V i ) as N
where D stands for convergence in distribution and V i = Σ i W i 1 Σ i , where Σ i is the covariance matrix of ς i with respect to d Q i ( ς i , ν i ) .
The approximate finite sample variance is σ i * 2 = 1 N K i   ε i * t ε i * for i = 1, 2 and ε i * = E P i * [ ν i ] as is shown in (8) or similarly in (12).

4. Analytic Examples

We discuss three examples, corresponding to assuming that the β’s are either unbounded (Normal), bounded below (Gamma) and bounded below and above (Bernoulli). Under the normal priors, the minimum described in (12) can be explicitly computed. In the other cases, a numerical computation is necessary.

4.1. Normal Priors

Let the constraint space be C = C s × C n = K × N . Using the traditional view and centering the support spaces at zero, the prior—a product of two normal distributions—is d Q ( ξ ) = exp ( ξ , Σ 2 ξ / 2 ) ( 2 π ) ( N + K ) / 2 ( det Σ 2 ) 1 / 2 d ξ . The covariance Σ has two diagonal blocks: K × K and N × N . Without loss of generality, we assume that these two matrices are σ s 2 I K  and  σ n 2 I N respectively. Our basic model holds for the general covariance structure Σ = [ Σ 1 Σ 2 ] .
Formulating these priors within our model yields
ln Z 1 ( λ 1 , λ ¯ 1 ) = 1 2 { λ 1 , B 1 Σ 1 2 B 1 t λ 1 2 λ ¯ , B ¯ 1 Σ 1 2 B 1 t λ 1 + λ ¯ , B ¯ 1 Σ 1 2 B ¯ 1 t λ ¯ }
ln Z 2 ( λ 2 , λ ¯ 1 ) = 1 2 { λ 2 , B 2 Σ 2 2 B 2 t λ 2 + 2 λ ¯ , B ¯ 2 Σ 2 2 B 2 t λ 2 + λ ¯ , B ¯ 2 Σ 2 2 B ¯ 2 t λ ¯ }
where Σ 1 2 is a diagonal (K + N) × (K + N) matrix, the first block being a K × K matrix with entries equal to σ 1 , s 2 and the second block is a N × N matrix with entries equal to σ 1 , n 2 . That is, the priors on signal and noise spaces are iid normal random variables. Thus, problem (12) consists of finding the minimum of
l n Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 1 ) + λ 1 + λ 2 , y 2 = λ 1 + λ 2 , y 2 + 1 2 { λ 1 , B 1 Σ 1 2 B 1 t λ 1 2 λ ¯ , B ¯ 1 Σ 1 2 B 1 t λ 1 + λ ¯ , B ¯ 1 Σ 1 2 B ¯ 1 t λ ¯ } + 1 2 { λ 2 , B 2 Σ 2 2 B 2 t λ 2 + 2 λ ¯ , B ¯ 2 Σ 2 2 B 2 t λ 2 + λ ¯ , B ¯ 2 Σ 2 2 B ¯ 2 t λ ¯ }
over the set described in (12).
To verify that the minimizer of (12) occurs in the interior of the constraint set, we look at the first order conditions
B 1 Σ 1 2 B 1 t λ 1 B 1 Σ 1 2 B ¯ 1 t λ ¯ + y 2 = 0
B 2 Σ 2 2 B 2 t λ 2 + B 2 Σ 2 2 B ¯ 2 t λ ¯ + y 2 = 0
B ¯ 1 Σ 1 2 B ¯ 1 t λ ¯ + B ¯ 1 Σ 1 2 B 1 t λ 1 B ¯ 2 Σ 2 2 B ¯ 2 t λ ¯ B ¯ 2 Σ 2 2 B 2 t λ 2 = 0
A feasible solution to (12) may lie inside the domain of the constraints and provides a solution.
Once the system is solved for λ 1 * , λ ¯ 1 * , λ 2 * , the estimated densities are
d P i * ( ξ i ) = e | | Σ i 1 ξ i + Σ i h i * | | 2 / 2 ( 2 π ) ( N + K i ) / 2 ( det Σ i 2 ) 1 / 2 d ξ i
which, as expected, are normally distributed. Defining
h i = B i t λ i + B ¯ i t λ ¯ i t = A i t μ i ( X i t I ) ( λ i * λ ¯ i * )
(recall that λ ¯ 2 * = λ ¯ 1 * = λ ¯ due to the constraints) we use (13) to get
( β i * ε i * ) = E P i * [ ξ i * ] = Σ i 2 A i t μ i = ( σ i , s 2 0 0 σ i , n 2 ) A i t μ i = ( σ i . s 2 X i t μ i σ i , n 2 μ i )
for i = 1, 2 and where σ i . s 2 and σ i , n 2 are matrices.

4.2. Gamma Priors

Let β’s be bounded below by 0. This can be easily generalized by an appropriate shifting of the support of the distributions. To show the generality of our model, we let the prior on the noise be normal thereby showing that one can use different priors for the signal and the noise.
The signal and noise constraint spaces respectively are C s = + K = [ 0 , ) K and C n = N . The prior is
d Q ( ξ ) = ( j = 1 K a b ς j b 1 e a ς j Γ ( b ) ) d ς j e v , Σ n 2 v / 2 ( 2 π ) N / 2 ( det Σ n 2 ) d v
Before specifying the concentrated entropy function, we study the matrix A1 defined as A 1 = ( X 1 I ) = ( B 1 B ¯ 1 ) = ( D 1 D ¯ 1 I 1 I ¯ 1 ) . Note that ( D 1 D ¯ 1 ) splits X1 and ( I 1 I ¯ 1 ) splits the N × N identity matrix to match the splitting of X. The concentrated entropy function is
ln Z 1 ( λ 1 , λ ¯ ) = j = 1 K b ln ( a a + ( D 1 t λ 1 + D ¯ 1 t λ ¯ ) j ) + σ 1 , n 2 ( λ 1 2 + λ ¯ 2 )
(Note that D 1 t λ 1 + D ¯ 1 t λ ¯ = X 1 t μ 1 and D 2 t λ 2 D ¯ 2 t λ ¯ = X 2 t μ 2 .) A similar expression exists for ln Z 2 ( λ 2 , λ ¯ ) .
The problem (12) consists of minimizing
l n Z ( λ 1 , λ ¯ , λ 2 , λ ¯ ) + λ 1 + λ 2 , y 2 = j = 1 K 1 b ln ( a a + ( D 1 t λ 1 + D ¯ 1 t λ ¯ ) j ) + j = 1 K 2 b ln ( a a + ( D 2 t λ 2 D ¯ 2 t λ ¯ ) j ) + σ 1 , n 2 ( λ 1 2 + λ ¯ 2 ) + σ 2 , n 2 ( λ 2 2 + λ ¯ 2 ) + λ 1 + λ 2 , y 2
Once λ 1 * , λ ¯ * , λ 2 * are found, the optimal density is
d P i * ( ξ i ) = ( j = 1 K i ( a + ( X i t μ i ) j ) b ς j b 1 e ( ( X i t μ i ) j + a ) ς j Γ ( b ) d ς j ) e v + σ i , n 2 μ i * 2 / 2 σ i , n 2 ( 2 π σ i , n 2 ) N i / 2 d ν
The estimated parameters are
( β i * ) j = E P i * [ ( ς i ) j ] = b ( a + ( X i t μ i ) j ) ;   for  j = 1 , ... , K i  and  i = 1 , 2
The realized residuals are
( ε i * ) l = E P i * [ ( v i ) l ] = σ i , n 2 ( μ i * ) l ;   for  l = 1 , ... , N i  and   i = 1 , 2

4.3. Bernoulli Priors

This example represents another extreme case where it is assumed that the β’s are bounded. For simplicity, assume that we know that all β’s lie in the interval [a, b], which makes C s = [ a , b ] K i the choice for all of the constraints on the signal space . For the noise component, we follow the previous formulation of normal priors.
With this background, the prior measure used is
d Q ( ξ ) = ( j = 1 K 1 2 ( δ a ( d ς j ) + δ b ( d ς j ) ) ) e v , Σ n 2 v / 2 ( 2 π ) N / 2 ( det Σ n 2 ) d v
The concentrated entropies are
ln Z i ( λ i , λ ¯ ) i = j = 1 K i ln 1 2 ( e g i , j a + e g i , j b ) + σ n 2 μ i 2 , i = 1 , 2
where g i = D i t λ i + D ¯ i t λ ¯ i and μ i = ( λ i λ ¯ i ) t . Recall that λ ¯ 2 = λ ¯ 1 . In this case, the function to be minimized is
l n Z ( λ 1 , λ ¯ , λ 2 , λ ¯ ) + λ 1 + λ 2 , y 2 = j = 1 K 1 ln 1 2 ( e g 1 , j a + e g 1 , j b ) + j = 1 K 2 ln 1 2 ( e g 2 , j a + e g 2 , j b ) + σ 1 , n 2 ( λ 1 2 + λ ¯ 2 ) + σ 2 , n 2 ( λ 2 2 + λ ¯ 2 ) + λ 1 + λ 2 , y 2
which is minimized over the region described in (12). Again, the optimal solutions (minimizing λ 1 * , λ ¯ * , λ 2 * ) is to be found numerically. The estimated post-data is
d P i * ( ξ ) = ( j = 1 K ( p i , j δ a ( d ς j ) + q i , j δ b ( d ς j ) ) ) e v + σ i , n 2 μ i * 2 / 2 σ i , n 2 ( 2 π σ i , n 2 ) N i / 2 d v
for i = 1,2 and where
p i , j = e a g i , j e a g i , j + e b g i , j = 1 q i , j
from which the estimated parameters and residuals are given by
β i , j * = a e a g i , j + b e b g i , j e a g i , j + e b g i , j  and  ε i * = σ i , n 2 μ i *

5. Empirical Example

We illustrate the applicability of our approach using an empirical application consisting of a small data set. The objective here is to demonstrate that our IT estimator is easy to apply and can be used for many different priors. The small sample performance of the IT-GME version of that estimator (uniform discrete priors) and detailed comparisons with other competing estimators is already shown in GMP and it falls outside the objectives of this note. The empirical example is based on one of the examples analyzed in GMP with data drawn from the March 1996 Current Population Survey. We estimated the wage-participation model for the subset of respondents in the labor market. Workers who are self-employed are excluded from the sample. Since the normal maximum likelihood estimator did not converge for that data [15], only results for the OLS, Heckman two-step, a semi-parametric estimator with a nonparametric selection mechanism due to [5], AP, and the different IT models developed here are reported [16]. To make our results comparable across the IT estimators, we use the empirical standard deviations in all three cases and use supports between –100 and 100 for the IT-GME (uniform discrete priors) and the IT-Bernoulli case. In both the IT-Normal and IT-Bernoulli the priors used for the noise components are normal (as is shown in Section 4). Under these very similar specifications, we would expect all three IT examples to yield comparable estimates. Naturally, there are many other priors to choose from, but the objective here is just to show the flexibility and applicability of our approach.
We analyze a sample of 151 Native American females, of whom 65 are in the labor force. The wage equation covariates include years of education, a dummy for currently enrolled in school, potential experience (age - education - 6) and potential experience squared, a dummy for rural location, and a dummy for central city location. The covariates in the selection equation include all the variables in the wage equation and the amount of welfare payments received in the previous year, a dummy equal one for married, and the number of children. We use the three exclusion restrictions to identify the wage equation in the parametric and nonparametric two-step approaches.
Table 1. Estimates of the Native American wage equation (151 individuals; 65 in labor force).
Table 1. Estimates of the Native American wage equation (151 individuals; 65 in labor force).
OLS2-StepAPIT-GMEIT-NormalIT-Bernoulli
Constant1.0731.771NA1.0381.0491.068
Education0.0550.0430.0440.0540.0560.055
Experience0.0380.0230.0380.0380.0380.038
Experience Squared–0.001–0.0005–0.001–0.001–0.001–0.001
Rural0.2140.2680.3320.2100.2150.214
Central City–0.170–0.091–0.171–0.186–0.166–0.169
Enrolled in School–0.290–0.471–0.190–0.301–0.283–0.288
λ –0.461
R20.3550.376NA0.3430.3550.354
MSPE0.1570.135NA0.1470.1440.144
Notes: Bold numbers reflect significantly different than zero at the 10% level
Table 1 presents the estimated coefficients for the wage equation. The R2 and Mean Squared Prediction Error (MSPE) for each model are presented as well. All IT estimators outperform the other estimators in terms of predicting selection [17]. The estimated return to education is about 5% across all estimation methods, but only statistically significantly different from 0 for the OLS and the IT estimators. Though, all estimators have estimated parameters of the same magnitude and sign, only the OLS and the three reported IT estimates are statistically significantly different from zero in most cases.

6. Conclusion

In this short paper we develop a simple to apply, information-theoretic, method for analyzing nonlinear data with sample selection problem. Rather than using a likelihood approach or a semi-parametric approach we generalized further the IT-GME model of Golan, Moretti and Perloff (2004). Our model (i) allows for bounded and unbounded supports on all the unknown parameters, (ii) allows us to use a whole class of priors (continuous or discrete), (iii) is specified as a nonlinear concentrated entropy model, and (iv) is easy to apply. Like GMP our model works well even with small data. This is shown in our empirical example. The extensions developed here mark a significant improvement on the GMP model and other IT, generalized entropy models.
A detailed set of sampling experiments comparing our IT method with all other competitors, under different data processes, will be done in future work.

Acknowledgement

We thank Enrico Moretti and Jeff Perloff for their comments on earlier versions of this work.

References and Notes

  1. Heckman, J. Sample selection bias as a specification error. Econometrica 1979, 47, 153–161. [Google Scholar] [CrossRef]
  2. Manski, C.F. Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator. J. Econom. 1985, 27, 313–333. [Google Scholar] [CrossRef]
  3. Cosslett, S.R. Distribution-free maximum likelihood estimator of the binary choice model. Econometrica 1981, 51, 765–782. [Google Scholar] [CrossRef]
  4. Han, A.K. Non-parametric analysis of a generalized regression model: The maximum rank correlation estimator. J. Econom. 1987, 35, 303–316. [Google Scholar] [CrossRef]
  5. Ahn, H.; Powell, J.L. Semiparametric estimation of censored selection models with a nonparametric selection mechanism. J. Econom. 1993, 58, 3–29. [Google Scholar] [CrossRef]
  6. Golan, A.; Moretti, E.; Perloff, J.M. A small sample estimation of the sample selection model. Econom. Rev. 2004, 23, 71–91. [Google Scholar] [CrossRef]
  7. Golan, A.; Judge, G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
  8. Golan, A.; Judge, G.; Perloff, J.M. Recovering information from censored and ordered multinomial response data. J. Econom. 1997, 79, 23–51. [Google Scholar] [CrossRef]
  9. Golan, A. An information theoretic approach for estimating nonlinear dynamic models. Nonlinear Dynamics Econom. 2003, 7, 2. [Google Scholar] [CrossRef]
  10. Maddala, G.S. Limited-Dependent and Qualitative Variables in Econometrics; Cambridge University Press: Cambridge, UK, 1983. [Google Scholar]
  11. Kullback, S. Information Theory and Statistics; John Wiley & Sons: New York, NY, USA, 1959. [Google Scholar]
  12. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  13. Gokhale, D.V.; Kullback, S. The Information in Contingency Tables; Marcel Dekker: New York, NY, USA, 1978. [Google Scholar]
  14. Golan, A.; Gzyl, H. An information theoretic estimator for the linear model. Working paper 2008. [Google Scholar]
  15. Due to the small size of the data and the large proportion of censored observations, none of the maximum likelihood methods would converge with all standard software.
  16. For discussion of the data, detailed analyses and discussion of the different estimators as well as a detailed discussion of the AP [5] application, see GMP [6]. The 2-step and AP estimates are taken from that paper, Table 8.
  17. To keep the Table simple and since these specific results are not of interested here, they are not presented.

Appendix

Proof of Theorem 2.1.
Proof: 
Recall (11)
S u p { S u p { S ( P 1 , Q 1 ) + S ( P 2 , Q 2 ) | y 2 = B 2 E P 2 [ ξ 2 ] , η 1 = B 1 E P 1 [ ξ 1 ] ; η ¯ 2 = B ¯ 2 E P 2 [ ξ 2 ] ; η ¯ 1 = B ¯ 1 E P 1 [ ξ 1 ] } | y 2 > η 1 , η ¯ 2 η ¯ 1 }
First, note that the inner problem in (A.1) is equivalent to
( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) = ln Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) + λ 1 , η 1 + λ ¯ 1 , η ¯ 1 + λ 2 , y 2 + λ ¯ 2 , η ¯ 2
where λ i and λ ¯ i (i=1, 2) are the Lagrange multipliers associated with the data (9) and Z is the normalization factor of dP1dP2:
Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) = C 1 × C 2 e λ 1 , B 1 ξ 1 e λ ¯ 1 , B ¯ 1 ξ 1 e λ 2 , B 2 ξ 2 e λ ¯ 2 , B ¯ 2 ξ 2 d Q 1 ( ξ 1 ) d Q 2 ( ξ 2 ) = C 1 e λ 1 , B 1 ξ 1 e λ ¯ 1 , B ¯ 1 ξ 1 d Q 1 ( ξ 1 ) C 2 e λ 2 , B 2 ξ 2 e λ ¯ 2 , B ¯ 2 ξ 2 d Q 2 ( ξ 2 ) = Z 1 ( λ 1 , λ ¯ 1 ) Z 2 ( λ 2 , λ ¯ 2 )
Note that the inner sup is over (P1, P2), and the outer sup is over the η’s in the region indicated within the { } . The basic idea here is to replace the inequalities appearing in problem (10) with equalities. Next, the dual-unconstrained model of this inner primal problem is the solution to
I n f λ { ( λ i , λ ¯ i ) | λ i | J | ,  and  λ ¯ i N | J |  for  i = 1 , 2 }
where |J| is the number of observations where y2i > y1i. With this step, the equivalent dual model of the primal problem (A.1) is
S u p { I n f { l n Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) + λ 1 , η 1 + λ ¯ 1 , η ¯ 1 + λ 2 , y 2 + λ ¯ 2 , η ¯ 2 | λ i | J | ,  and  λ ¯ i N | J | } | y 2 > η 1 , η ¯ 2 η ¯ 1 f o r i = 1 , 2 }
Next, we rewrite the constraints for the outer problem. The constraint y 2 > η 1 is rewritten as η 1 = y 2 ζ , ζ > 0 , and the constraint η ¯ 2 η ¯ 1 is written as η ¯ 2 N | J | , η ¯ 1 = η ¯ 2 + ζ ¯ , ζ ¯ 0. Model (A.2) becomes
S u p { I n f { l n Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) + λ 1 + λ 2 , y 2 λ 1 , ζ + λ ¯ 1 + λ ¯ 2 , η ¯ 1 + λ ¯ 2 , ζ ¯ | λ i | J | ,  and  λ ¯ i N | J | } | ζ > 0 , η ¯ 1 , ζ ¯ 0 }
Next, exchanging the sup and the inf operations we get
I n f { l n Z ( λ 1 , λ ¯ 1 , λ 2 , λ ¯ 2 ) + λ 1 + λ 2 , y 2 + S u p { λ 1 , ζ + λ ¯ 1 + λ ¯ 2 , η ¯ 1 + λ ¯ 2 , ζ ¯ | ζ > 0 , η ¯ 1 , ζ ¯ 0 } | λ i | J | ,  and  λ ¯ i N | J | }
To compute the inner supremum, note that
Entropy 12 01569 i001
where + d denotes the non-negative orthant in d , the right hand side of the last identity is usually written as I + d ( λ ) and I A ( x ) is defined as I A ( x ) = { 0 i f x A i f x A .The difference between the first and second problem is that in the first the supremum is reached only when λ = 0 , whereas in the second it is reached at the boundary of + d . Similarly,
Entropy 12 01569 i002
Noting that λ ¯ 1 = λ ¯ 2 λ ¯ , our MinMax problem (A.2) reduces to finding
Entropy 12 01569 i003
which simplified to
I n f { l n Z ( λ 1 , λ ¯ , λ 2 , λ ¯ ) + λ 1 + λ 2 , y 2 | λ 1 + | J | , λ 2 | J |  and  λ ¯ 1 + N | J | }
which is (12).
                         ☐
Back to TopTop