Next Article in Journal
A Frequentist Alternative to Significance Testing, p-Values, and Confidence Intervals
Previous Article in Journal
On Using the t-Ratio as a Diagnostic
Open AccessArticle

Efficiency of Average Treatment Effect Estimation When the True Propensity Is Parametric

Department of Economics, Michigan State University, 486 W. Circle Dr., East Lansing, MI 48824, USA
Econometrics 2019, 7(2), 25; https://doi.org/10.3390/econometrics7020025
Received: 8 March 2019 / Revised: 15 May 2019 / Accepted: 28 May 2019 / Published: 31 May 2019

Abstract

It is well known that efficient estimation of average treatment effects can be obtained by the method of inverse propensity score weighting, using the estimated propensity score, even when the true one is known. When the true propensity score is unknown but parametric, it is conjectured from the literature that we still need nonparametric propensity score estimation to achieve the efficiency. We formalize this argument and further identify the source of the efficiency loss arising from parametric estimation of the propensity score. We also provide an intuition of why this overfitting is necessary. Our finding suggests that, even when we know that the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method in applications.
Keywords: average treatment effect; efficiency bound; propensity score; sieve MLE average treatment effect; efficiency bound; propensity score; sieve MLE

1. Introduction

Estimating treatment effects of a binary treatment or a policy has been one of the most important topics in evaluation studies. In estimating treatment effects, a subject’s selection into a treatment may contaminate the estimate, and two approaches are popularly used in the literature to remove the bias due to this sample selection. One is regression-based control function method (see, e.g., Rubin (1973); Hahn (1998); and Imbens (2004)) and the other is matching method (see, e.g., Rubin and Thomas (1996); Heckman et al. (1998); and Abadie and Imbens (2002, 2006)). When there are many covariates or pre-treatment variables that govern this selection, the matching method may be less practical. In this case, due to Rosenbaum and Rubin (1983, 1984), we can control for the sample selection bias using the propensity score to reduce the dimensionality problem.
Although adjusting for sub-population differences in the propensity score removes the bias, the resulting treatment effect estimators may not be all efficient. Hahn (1998) shows that, using a nonparametric series estimation of the propensity score, we can achieve the efficiency bound. Hirano et al. (2003) also develop an efficient estimation of average treatment effects using the logit series estimation of the propensity score overcoming some practical limitations of Hahn (1998)’s series estimator (see also Li et al. (2009)).
Based on these studies, empirical researchers are encouraged to estimate treatment effects using the imputation method of the inverse weighting of the estimated propensity score. However, a nonparametric method of estimating the propensity score may require a large data set, especially when covariates or pre-treatment variables are high dimensional. For this reason, many empirical researchers estimate the propensity score parametrically using the probit or logit specification, given the idea that these parametric models are still good approximations to the true propensity score. Also in the statistics literature such as Rosenbaum (1987); Rubin and Thomas (1996); and Robins et al. (1995), they show that using parametric estimates of the propensity score can improve the efficiency of the treatment effect estimation.
However, from the existing literature (Hahn (1998); Hirano et al. (2003); Kang and Schafer (2007); and Tan (2007)), we can infer that, even when the true propensity score is parametric and the parametric estimator is consistent, we still need to estimate the propensity score nonparametrically to achieve the full efficiency. The first contribution of this paper is to formalize this efficiency argument and confirm that parametric estimation of the propensity score yields an inefficient estimator of the average treatment effect if some or all of covariates are continuous.1 The second contribution of this paper, which is more interesting, is to identify the source of this inefficiency, and formally characterize the efficiency loss due to parametric estimation of the propensity score.
For our results, we find that a nonparametric sieve estimation of the propensity score has two roles in the efficient estimation of average treatment effects. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment function for the treatment effect with respect to the propensity score. We show that parametric estimation of propensity score accomplishes the first role when the true propensity score is indeed parametric, but cannot achieve the second role, if some of covariates are continuous. In other words, consistent estimation of the propensity score alone is not enough to obtain the efficient estimation of average treatment effects.
This finding also suggests that the performance of the treatment effect estimator in finite samples may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same sieve basis functions or regressors used to estimate the propensity score. We note that the literature has focused on the former, but the latter has been somewhat ignored. Moreover, because these two objects are quite different in nature, a sieve approximation solely targeted for the propensity score does not necessarily well approximate the conditional expectation of the derivative of the moment function in finite samples.
The rest of the paper is organized as follows. Section 2 outlines the average treatment effect estimation using the inverse propensity score weighting. Section 3 examines the role of the nonparametric propensity score estimation when the true one is parametric. We also provide an illustrative example. We conclude in Section 4. Some technical details are provided in Appendix A.

2. Estimation of Average Treatment Effect

In this section, we review estimation of average treatment effects using the inverse propensity score weighting in a standard setting. For this purpose, suppose we have a random sample of size n individuals where some of them received a treatment and others did not. Let T i denote the treatment status with T i = 1 if individual i receives the treatment and T i = 0 otherwise. Using the same notation with Rubin (1973), denote Y i ( 0 ) as the potential outcome for each individual i under control and Y i ( 1 ) as the outcome under treatment. We observe T i , X i , and Y i = T i Y i ( 1 ) + ( 1 T i ) Y i ( 0 ) where X i is a vector of observable covariates of the individual. Here, we have a fundamental missing data problem since we observe only either Y i ( 1 ) or Y i ( 0 ) but not both for each individual depending on the treatment status.
The parameter of interest is the population average treatment effect defined as
τ * = E [ Y ( 1 ) Y ( 0 ) ] .
If we had both Y i ( 1 ) and Y i ( 0 ) for all individuals, we can simply estimate the average treatment effect using its sample analogue, but it is not feasible due to the missing data problem. One important way to circumvent this missing data problem in the literature is using the imputation method based on the propensity score, motivated by Rosenbaum and Rubin (1983, 1984). The propensity score of an individual whose observable characteristics X i equals x is defined by
p * ( x ) = Pr ( T i = 1 | X i = x ) or E [ T i | X i = x ] .
According to Rosenbaum and Rubin, if (i) there exist covariates X i such that the treatment status T i is ignorable given X i and (ii) 0 < p * ( x ) < 1 for all x X Supp ( X ) , then T i and ( Y i ( 0 ) , Y i ( 1 ) ) are independent of each other given the propensity score. This implies that
τ * = E E [ Y i | T i = 1 , p * ( X i ) ] E [ Y i | T i = 0 , p * ( X i ) ] .
This allows us to estimate the treatment effect using a sample analogue of Equation (1). To be precise, define
β ^ 1 ( x ) = E ^ [ T i Y i | X i = x ] E ^ [ T i | X i = x ] and β ^ 0 ( x ) = E ^ [ ( 1 T i ) Y i | X i = x ] 1 E ^ [ T i | X i = x ] ,
where E ^ [ · | · ] ’s denote suitable conditional mean function estimators. Then, we can construct complete data using the imputation such that Y ^ i ( 1 ) T i Y i + ( 1 T i ) β ^ 1 ( X i ) and Y ^ i ( 0 ) T i β ^ 0 ( X i ) + ( 1 T i ) Y i ( 0 ) , and we can estimate the average treatment effect as τ ^ 1 = 1 n i = 1 n ( Y ^ i ( 1 ) Y ^ i ( 0 ) ) or alternatively as τ ^ 2 = 1 n i = 1 n ( β ^ 1 ( X i ) β ^ 0 ( X i ) ) . These nonparametric imputation methods were proposed by Hahn (1998), and he further shows that these treatment effect estimators achieve the semiparametric efficiency bound.2
Hirano et al. (2003) propose an alternative estimator for which the propensity score is estimated using a logit series estimation, and the propensity score is given by p * ( x ) = exp ( h 0 ( x ) ) 1 + exp ( h 0 ( x ) ) for some unknown function h 0 ( x ) . In the logit series estimation, we approximate h 0 ( x ) using linear sieves and the estimated propensity score is given by p ^ L ( x ) = exp ( h ^ n ( x ) ) 1 + exp ( h ^ n ( x ) ) , where h ^ n ( x ) denotes the sieve Maximum Likelihood (ML) estimator.3 The proposed treatment effect estimator is given by τ ^ 3 = 1 n i = 1 n T i Y i p ^ L ( X i ) 1 T i Y i 1 p ^ L ( X i ) . This estimator also achieves the semiparametric efficiency bound, and improves over Hahn (1998)’s estimator in two practical ways. First, we do not need to estimate the conditional mean functions of E ^ [ T i Y i | X i ] and E ^ [ ( 1 T i ) Y i | X i ] . Second, the estimated propensity score lies between zero and one by construction.
Estimation of average treatment effects using the estimated propensity score with a general link function that includes the logit or probit specification was proposed by Kim (2013). We will use this general setting to argue that the inefficiency of the treatment effect estimate with the estimated parametric propensity score is not specific to a particular functional form assumption like logit or probit. To obtain a sieve ML estimator for the propensity score with a general link function, we assume the true function h 0 belongs to a class of bounded and smooth functions such as a Hölder ball, and let p * ( x ) = F ( h 0 ( x ) ) for a known link function F ( · ) .4
Then, based on a triangular sequence of orthonormal basis functions such as polynomials or splines, we construct a tensor-product sieve space H n as
H n = { h ( X ) | h ( X ) = R K ( n ) ( X ) π for all π satisfying h Λ γ 1 c 1 } ,
where · Λ γ 1 denotes a Hölder norm, and we let K ( n ) as n . The sieve ML estimator is obtained by solving
h ^ n = argmax h H n 1 n i = 1 n log F ( h ( X i ) ) T i 1 F ( h ( X i ) ) 1 T i
or equivalently π ^ K = argmax π , R K ( X ) π H n 1 n i = 1 n log F ( R K ( X i ) π ) T i 1 F ( R K ( X i ) π ) 1 T i such that h ^ n ( x ) = R K ( x ) π ^ K , and the resulting propensity score estimator becomes p ^ ( x ) = F ( h ^ n ( x ) ) .
Finally, using the estimated propensity score, we estimate the average treatment effect as
τ ^ = 1 n i = 1 n Y i T i p ^ ( X i ) Y i ( 1 T i ) 1 p ^ ( X i ) .
Define μ t ( x ) E [ Y ( t ) | X = x ] and σ t 2 ( x ) Var [ Y ( t ) | X = x ] . For the general class of F ( · ) , as long as the function is continuous and monotonic in h, Kim (2013) shows that this treatment effect estimator achieves the semiparametric efficiency bound such that
n ( τ ^ τ * ) d N ( 0 , V ) ,
where τ ( X ) = E [ Y ( 1 ) Y ( 0 ) | X ] and
V = E Y T p * ( X ) Y ( 1 T ) 1 p * ( X ) τ * μ 1 ( X ) p * ( X ) + μ 0 ( X ) 1 p * ( X ) ( T p * ( X ) ) 2 = E ( τ ( X ) τ * ) 2 + σ 1 2 ( X ) p * ( X ) + σ 0 2 ( X ) 1 p * ( X ) ,
which is identical to the efficiency bound derived by Hahn (1998). This efficiency result with the general link function is obtained, similarly as in Hirano et al. (2003), following the influence function approach by Newey (1994). To see this, define
ψ ( Z i , τ , p ( X i ) ) = Y i T i p ( X i ) Y i ( 1 T i ) 1 p ( X i ) τ ψ p ( Z i , τ , p ( X i ) ) = Y i T i p ( X i ) 2 + Y i ( 1 T i ) ( 1 p ( X i ) ) 2 s p ( X i ) = E [ ψ p ( Z i , τ * , p * ( X i ) ) | X i ] ,
where ψ p ( · ) denotes the derivative of the moment function for the treatment effect, ψ ( · ) , with respect to the propensity score p ( · ) , and s p ( · ) denotes its conditional expectation at the true parameter values. The asymptotic variance result of Equation (3) is obtained by showing that the estimator is asymptotically linear with influence function decomposed into two terms:
n ( τ ^ τ * ) 1 n i = 1 n ψ ( Z i , τ * , p * ( X i ) ) + s p ( X i ) ( T p * ( X i ) ) = o p ( 1 ) .
The first term in Equation (5) is the influence function when we know the true propensity score p * ( · ) , and the second term represents the contribution of the estimated propensity score on the asymptotic distribution of τ ^ . It follows that the asymptotic variance V in Equation (3) equal to
V = Var ψ ( Z i , τ * , p * ( X i ) ) + s p ( X i ) ( T p * ( X i ) ) ,
which derives the result.

3. Efficient Estimation When the True Propensity Score Is Parametric

As we discuss in the previous section, the efficiency of the treatment effect estimator depends on whether the estimator has the asymptotically linear representation as Equation (5). When the propensity score is estimated using a nonparametric sieve ML, we achieve this representation and hence the efficiency bound. Here, we pose the question of whether we can achieve this asymptotic linear representation if the true propensity is parametric, and is estimated under the correct parametric specification. We confirm that, in this case, the semiparametric efficiency bound is not achieved as can be inferred from the existing literature. This suggests that, even though we know the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method.
Our intuition behind this result is that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. For the purpose of illustration, without loss of generality, suppose p * ( x ) = Φ ( x π 0 ) , where Φ ( · ) denotes the standard normal cumulative distribution function (CDF), so the true propensity is a probit model. We then can estimate π 0 with MLE, denoted by π ^ , and obtain the parametric convergence rate such that n π ^ π 0 = O p ( 1 ) and hence sup x X | p ^ ( x ) p * ( x ) | = O p ( n 1 / 2 ) with p ^ ( x ) = Φ ( x π ^ ) .
For ease of notation without losing the main idea, we consider the special case that Y ( 0 ) = 0 with probability one. Define β 0 = E [ Y ( 1 ) ] as the average outcome of interest, where Y ( 1 ) is missing at random conditional on the covariates X. We estimate the average outcome as β ^ = 1 n i = 1 n Y i T i p ^ ( X i ) . For this estimator, following the Equation (5), if we can obtain the asymptotic linear representation as
n ( β ^ β 0 ) 1 n i = 1 n Y i T i p * ( X i ) β 0 μ 1 ( X i ) p * ( X i ) ( T i p * ( X i ) ) = o p ( 1 ) ,
then we will achieve the efficiency bound. To see whether this asymptotic linear representation is attainable with parametric estimation of the propensity score, we decompose n ( β ^ β 0 ) as
n ( β ^ β 0 ) = 1 n i = 1 n T i Y i p ^ ( X i ) T i Y i p * ( X i ) + T i Y i p * ( X i ) 2 ( p ^ ( X i ) p * ( X i ) )
+ 1 n i = 1 n T i Y i p * ( X i ) 2 ( p ^ ( X i ) p * ( X i ) ) + X μ 1 ( x ) p * ( x ) ( p ^ ( x ) p * ( x ) ) d F 0 ( x )
n X μ 1 ( x ) p * ( x ) ( p ^ ( x ) p * ( x ) ) d F 0 ( x ) 1 n i = 1 n δ * ( X i ) T i p * ( X i ) p * ( X i ) ( 1 p * ( X i ) )
+ 1 n i = 1 n ( δ * ( X i ) δ 0 ( X i ) ) T i p * ( X i ) p * ( X i ) ( 1 p * ( X i ) )
+ 1 n i = 1 n T i Y i p * ( X i ) β 0 + δ 0 ( X i ) ( T i p * ( X i ) ) p * ( X i ) ( 1 p * ( X i ) ) ,
where p * ( x ) = Φ ( x π 0 ) , p ^ ( x ) = Φ ( x π ^ ) , F 0 ( · ) denotes the distribution function of X, W = E [ ϕ ( X i π 0 ) 2 p * ( X i ) ( 1 p * ( X i ) ) X i X i ] with ϕ ( · ) being the standard normal density function, and
δ * ( x ) = X μ 1 ( z ) p * ( z ) ϕ ( z π 0 ) z d F 0 ( z ) W 1 ϕ ( x π 0 ) x p * ( x ) ( 1 p * ( x ) ) , δ 0 ( x ) = μ 1 ( x ) p * ( x ) p * ( x ) ( 1 p * ( x ) ) .
If we can show that all terms (7)–(10) are o p ( 1 ) , we then obtain the desirable result of Equation (6). Following the steps in Hirano et al. (2003) or Kim (2013), it is straightforward to bound the terms (7)–(9) as o p ( 1 ) . We focus on the term (10), from which we derive our main finding.
By inspecting δ * ( x ) and δ 0 ( x ) , we see that δ * ( x ) is the linear projection of δ 0 ( x ) on ϕ ( x π 0 ) x p * ( x ) ( 1 p * ( x ) ) . In other words,
δ * ( x ) δ 0 ( x ) = θ 0 ϕ ( x π 0 ) x p * ( x ) ( 1 p * ( x ) ) δ 0 ( x ) ,
where the projection coefficient is given by θ 0 X μ 1 ( z ) p * ( z ) ϕ ( z π 0 ) z d F 0 ( z ) W 1 . Therefore, unless δ 0 ( x ) is indeed linear in ϕ ( x π 0 ) x p * ( x ) ( 1 p * ( x ) ) ,5 we will have inf x X δ * ( x ) δ 0 ( x ) > C > 0 for some positive constant C. It follows that
Var ( δ * ( X i ) δ 0 ( X i ) ) T i p * ( X i ) p * ( X i ) ( 1 p * ( X i ) ) = E [ ( δ * ( X i ) δ 0 ( X i ) ) 2 ] > C .
Therefore, the term (10) remains as O p ( 1 ) and contributes to the asymptotic distribution of the treatment effect estimator. In other words, the asymptotic linear representation of Equation (6) is not obtained with parametric estimation of the propensity score in general, even when the true propensity score is parametric. In the Appendix, we derive the asymptotic variance of the treatment effect estimator with the estimated parametric propensity score, and characterize the efficiency loss due to parametric estimation of the propensity score. In particular, we show that this efficiency loss is exactly given by Equation (13).
As the key difference, in the treatment effect estimation using the nonparametric sieve estimation of the propensity score like Equation (2), it can be shown that when δ 0 ( x ) is t-times continuously differentiable, we have
sup x X δ * ( x ) δ 0 ( x ) = O ( K t / d x ) ,
where K denotes the number of approximating sieve terms used in δ * ( x ) (see Hirano et al. (2003) or Kim (2013)). Therefore, we can bound the term (10) as o p ( 1 ) for some large enough K. This is because Equation (12) becomes
δ * ( x ) δ 0 ( x ) = θ 0 ϕ ( R K ( x ) π 0 ) R K ( x ) p * ( x ) ( 1 p * ( x ) ) δ 0 ( x )
when the sieve estimation like Equation (2) is used to estimate the propensity score, where R K ( x ) denotes a vector of approximating basis functions, and hence the bound (14) is obtained due to some approximation theories of sieves for a class of smooth functions such as a Hölder class (see, e.g., Chen (2007)). We, however, note that, because p * ( x ) and δ 0 ( x ) are quite different in nature, the sieve approximation used to estimate the propensity score does not necessarily well approximate the latter in finite samples, which may contribute to the inefficiency of the treatment effect estimation.
Finally, by inspecting Equations (4) and (5) for the case Y ( 0 ) = 0 along with Equation (10), note that the term δ 0 ( x ) is related to the conditional expectation of the derivative of the moment function with respect to the propensity score. This implies that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. It first approximates the true propensity score, and second approximates the conditional expectation of the derivative of the moment condition with respect to the propensity score. The parametric propensity score estimation can accomplish the first role, if the true one is parametric, but cannot achieve the second when some of covariates are continuous.
The asymptotic variance of a treatment effect estimator using parametric estimation of the propensity score can also depend on which parametric estimator is being used in practice. In this regard, given a parametric model of the propensity score, one can directly derive the asymptotic variance of the treatment effect estimator using the estimated parametric propensity score by combining two moments as a sequential estimation problem (see, e.g., Newey (1984)). The first moment is given by, e.g., the first order condition of the population ML objective function of the propensity score estimation such as the logit or probit ML, and the second moment is given by the moment condition to estimate the treatment effect E [ ψ ( Z i , τ , p ( X i ) ) ] = 0 defined in Equation (4). We can then directly compare the asymptotic variance of the treatment effect estimator resulting from using a specific parametric estimator of the propensity score to the semiparametric efficiency bound, instead of deriving the inefficiency term from the Equation (13). This joint moments approach for parametric estimation of the propensity score also allows us to explicitly derive the efficiency loss due to a specific parametric estimator of the propensity score, and hence compare different parametric models of the propensity score in terms of efficiency.6

3.1. Reconsidering the Simple Example in Hirano et al. (2003)

Hirano et al. (2003) present a simple example with a binary covariate, illustrating that, weighting by the inverse of the propensity score estimate, rather than the true one, we can improve the efficiency and indeed achieve the efficiency bound. Here, we reproduce the example and provide an intuition why in this case the efficiency bound is achieved in view of the results from the previous section. Consider a simple problem of estimating the population average of a variable Y, β 0 = E [ Y ] , given a random sample of size n of the triple ( T i , X i , T i · Y i ). Therefore, T i and X i are observed for all units in the sample, but Y i is only observed if T i = 1 . Denote μ ( x ) = E [ Y | X = x ] and σ 2 ( x ) = Var ( Y | X = x ) .
Now let N t x denote the number of observations with T i = t and X i = x , for t , x { 0 , 1 } . Further assume that the true selection probability is p ( x ) = π 0 + x ( π 1 π 0 ) .7 The estimated selection probability is then
p ^ ( x ) = N 10 / ( N 00 + N 10 ) if x = 0 N 11 / ( N 01 + N 11 ) if x = 1 .
The true weights estimator is given by β ^ t w = 1 n i = 1 n Y i T i p ( X i ) while the estimated weights estimator is then β ^ e w = 1 n i = 1 n Y i T i p ^ ( X i ) . Hirano et al. (2003) show that β ^ e w is more efficient than β ^ t w , and β ^ e w achieves the efficiency bound. Interestingly, one can easily see that p ^ ( x ) in Equation (15) is a nonparametric estimator of p ( x ) , and is also a parametric MLE of p ( x ) since we can write p ^ ( x ) = π ^ 0 + x ( π ^ 1 π ^ 0 ) with π ^ 0 = N 10 / ( N 00 + N 10 ) and π ^ 1 = N 11 / ( N 01 + N 11 ) .
In this example, for the corresponding terms of δ * ( x ) and δ 0 ( x ) in Equation (10), we show below that indeed
δ * ( x ) δ 0 ( x ) = 0
for all x, and hence the efficiency bound is achieved for the estimator β ^ e w because the asymptotic linear representation like Equation (6) is obtained (i.e., the term (14) is simply equal to zero in this case). To derive the result, consider the following terms corresponding to δ * ( x ) and δ 0 ( x ) in Equation (10) for the stochastic expansion of β ^ e w . Let
W ^ = 1 n i = 1 n X i X i p ( X i ) 1 p ( X i ) , W = E X i X i p ( X i ) 1 p ( X i ) , X i 1 X i X i , δ * ( x ) = x μ ( x ) p ( x ) 1 x , x q ( x ) · W 1 1 x , x p ( x ) 1 p ( x ) , δ 0 ( x ) = μ ( x ) p ( x ) p ( x ) ( 1 p ( x ) ) ,
where q ( · ) denotes the probability mass of X.
By investigating δ * ( x ) and δ 0 ( x ) , we can see that δ * ( x ) is the linear projection of δ 0 ( x ) on 1 x , x p ( x ) ( 1 p ( x ) ) . In other words, δ * ( x ) = 1 x p ( x ) ( 1 p ( x ) ) θ 0 + x p ( x ) ( 1 p ( x ) ) θ 1 for some constants θ 0 and θ 1 that are determined by the linear projection. Note that we have
δ * ( x ) δ 0 ( x ) = ( 1 x ) θ 0 + x θ 1 p ( x ) ( 1 p ( x ) ) + μ ( x ) p ( x ) p ( x ) ( 1 p ( x ) ) = 0
if θ x = μ ( x ) ( 1 p ( x ) ) for x { 0 , 1 } . Indeed, from the definition of δ * ( x ) , we find
( θ 0 , θ 1 ) = x μ ( x ) p ( x ) ( 1 x , x ) q ( x ) · W 1 = μ ( 0 ) p ( 0 ) q ( 0 ) , μ ( 1 ) p ( 1 ) q ( 1 ) q ( 0 ) p ( 0 ) ( 1 p ( 0 ) ) 0 0 q ( 1 ) p ( 1 ) ( 1 p ( 1 ) ) 1 = μ ( 0 ) p ( 0 ) q ( 0 ) , μ ( 1 ) p ( 1 ) q ( 1 ) p ( 0 ) ( 1 p ( 0 ) ) q ( 0 ) 0 0 p ( 1 ) ( 1 p ( 1 ) ) q ( 1 ) = μ ( 0 ) ( 1 p ( 0 ) ) , μ ( 1 ) ( 1 p ( 1 ) )
and therefore the efficiency result follows.
This example clearly illustrates why the condition like (16) is crucial to achieve the efficiency bound. This suggests that, when the covariates are multinomial, we can always achieve the condition like (16) since the parametric ML estimation becomes equivalent to the nonparametric ML estimation. Therefore, we can achieve the efficiency bound. However, when the covariates or a subset of covariates are continuous, using the parametric propensity score estimation cannot achieve the efficiency bound even though the true one is parametric. This also suggests that the efficiency loss due to using the parametric propensity score estimator is attributed to the fact that some covariates are continuous.

3.2. Generalization to Estimating the Weighted Average Treatment Effect

We generalize the efficiency comparison between treatment effect estimators using nonparametric or parametric estimation of the propensity score to the weighted average treatment effect, τ w a t e * , defined as
τ w a t e * E [ Y ( 1 ) Y ( 0 ) | X = x ] g ( x ) d F 0 ( x ) g ( x ) d F 0 ( x )
for a known weight function g ( x ) . We estimate τ w a t e * using the moment condition
ψ ( Z i , τ w a t e , p ( X i ) , g ( X i ) ) = g ( X i ) Y i T i p ( X i ) Y i ( 1 T i ) 1 p ( X i ) τ w a t e
that yields the estimator as
τ ^ w a t e = i = 1 n g ( X i ) [ Y i T i p ^ ( X i ) Y i ( 1 T i ) 1 p ^ ( X i ) ] / i = 1 n g ( X i )
given an estimator of the propensity score p ^ ( x ) .
Because the function g ( x ) is known and only appears as a weight in the moment function (17), following the same line of argument for the average treatment effect, one can obtain the asymptotic linear representation of τ ^ w a t e using the nonparametric propensity score estimator (2) in Equation (18) as
n ( τ ^ w a t e τ w a t e * ) 1 E [ g ( X ) ] 1 n i = 1 n ψ ( Z i , τ w a t e * , p * ( X i ) , g ( X i ) ) + s p ( X i ) ( T p * ( X i ) ) = o p ( 1 )
where ψ p ( Z i , τ , p ( X i ) , g ( X i ) ) = g ( X i ) Y i T i p ( X i ) 2 + Y i ( 1 T i ) ( 1 p ( X i ) ) 2 and s p ( X i ) = E [ ψ p ( · ) | X i ] . Therefore, the semiparametric efficiency bound is achieved for the weighted average treatment effect estimator τ ^ w a t e using the nonparametric propensity score estimator (see Hirano et al. (2003)). On the other hand, for the parametric propensity score estimator, we can derive the inefficiency term similar to Equation (13), the inefficiency term derived for the average treatment effect, as
1 ( E [ g ( X i ) ] ) 2 E [ g ( X i ) 2 ( δ * ( X i ) δ 0 ( X i ) ) 2 ]
and therefore a similar inefficiency result holds for τ ^ w a t e using the parametric propensity score estimator.
Note that, under the unconfoundedness assumption (Rosenbaum and Rubin (1983, 1984)), with the weight function g ( x ) being equal to the true propensity score p * ( x ) , the weighted average treatment effect becomes the average treatment effect for the treated,
τ t r e a t e d * E [ Y ( 1 ) Y ( 0 ) | T = 1 ] .
Based on this equivalence, τ t r e a t e d * can be estimated using the moment condition
ψ ( Z i , τ t r e a t e d , p ( X i ) , p ( X i ) ) = p ( X i ) Y i T i p ( X i ) Y i ( 1 T i ) 1 p ( X i ) τ t r e a t e d
by replacing g ( x ) with p ( x ) . However, an efficiency comparison between treatment effect estimators for τ t r e a t e d * using the nonparametric or parametric propensity score estimator is more complicated because, in this case, the propensity score has two roles in the moment function. One is the inverse weighting to control for the self-selection and the other is the weighting function in place of g ( x ) . To see this, let p ^ ( x ) and p ^ * ( x ) denote the nonparametric and the correctly specified parametric estimator of the propensity score, respectively. Then, we can consider three alternative estimators for the average treatment effect for the treated. One is using the parametric propensity score p ^ * ( x ) everywhere and solving
0 = i = 1 n p ^ * ( X i ) · Y i T i p ^ * ( X i ) Y i ( 1 T i ) 1 p ^ * ( X i ) τ t r e a t e d ,
the second one is using the nonparametric propensity score p ^ ( x ) everywhere and solving
0 = i = 1 n p ^ ( X i ) · Y i T i p ^ ( X i ) Y i ( 1 T i ) 1 p ^ ( X i ) τ t r e a t e d ,
and the last one is using the parametric propensity score p ^ * ( x ) in place of g ( x ) while using the nonparametric propensity score p ^ ( x ) for the inverse weighting and solving
0 = i = 1 n p ^ * ( X i ) · Y i T i p ^ ( X i ) Y i ( 1 T i ) 1 p ^ ( X i ) τ t r e a t e d .
From the efficiency argument of Hahn (1998) and Hirano et al. (2003) when the true propensity score is known, one can conjecture that the treatment effect estimator that solves Equation (21) will be more efficient than other estimators that solve Equations (19) and (20), respectively. However, in terms of efficiency, the two estimators solving Equations (19) and (20) (or other variations) cannot be uniformly ranked in general, and studying these alternative estimators is beyond the scope of this paper.

4. Conclusions

One can obtain efficient estimation of average treatment effects by the method of inverse propensity score weighting based on the estimated propensity score, rather than the true one, even when the true one is known. From the literature, we can infer that, even when the true propensity score is a parametric function, we still need to estimate the propensity score nonparametrically to achieve the efficiency. We formalize this argument and further identify the source of the efficiency loss due to parametric estimation of the propensity score. We also provide an intuition as to why this overfitting is necessary. The idea is that the nonparametric estimation of the propensity score plays two roles in the treatment effect estimation. It first replaces the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. The parametric propensity score estimation can achieve the first but cannot achieve the second when some of covariates are continuous. This also suggests that the finite sample performance of the treatment effect estimator, using the imputation method based on the estimated propensity score, may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same approximating sieves or regressors that are used to estimate the propensity score.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Efficiency Loss Due to Parametric Estimation of the Propensity Score

For ease of notation, we assume Y ( 0 ) = 0 with probability one, and define β 0 = E [ Y ( 1 ) ] as the average outcome of interest. In the main text, we have established the following:8
n ( β ^ β 0 ) = 1 n i = 1 n T i Y i p * ( X i ) β 0 + δ 0 ( X i ) ( T i p * ( X i ) ) p * ( X i ) ( 1 p * ( X i ) ) + ( δ * ( X i ) δ 0 ( X i ) ) T i p * ( X i ) p * ( X i ) ( 1 p * ( X i ) ) + o p ( 1 ) ,
where δ * ( x ) = X μ 1 ( z ) p * ( z ) ϕ ( z π 0 ) z d F 0 ( z ) W 1 ϕ ( x π 0 ) x p * ( x ) ( 1 p * ( x ) ) and δ 0 ( x ) = μ 1 ( x ) p * ( x ) p * ( x ) ( 1 p * ( x ) ) . Define
ψ ( y , t , x , β , p ( · ) ) = t · y p ( x ) β , α 0 ( t , x ) = μ 1 ( x ) p * ( x ) ( t p * ( x ) ) , and c * ( t , x ) = ( δ * ( x ) δ 0 ( x ) ) t p * ( x ) p * ( x ) ( 1 p * ( x ) ) .
The first term ψ ( · ) is the moment function that would be obtained when we do not estimate the propensity score p * ( · ) . The second and the third term, α 0 ( t , x ) and c * ( t , x ) , are the contribution of estimating p * ( · ) using the parametric ML estimator to the asymptotic distribution of β ^ . If we estimated the propensity score using a nonparametric ML estimation, even when the true propensity score is parametric, we would not have the third term since we can replace δ * ( x ) with δ 0 ( x ) without affecting the asymptotic distribution.
The asymptotic variance of β ^ is equal to the variance of the sum of ψ ( Y , T , X , β 0 , p * ( · ) ) , α 0 ( T , X ) , and c * ( T , X ) . We obtain for each component that potentially determines the asymptotic variance:
E [ ψ ( Y , T , X , β 0 , p * ( · ) ) 2 ] = E μ 1 ( X ) 2 p * ( X ) + E σ 1 2 ( X ) p * ( X ) β 0 2 E α 0 ( T , X ) 2 = E μ 1 ( X ) 2 p * ( X ) E μ 1 ( X ) 2 E c * ( T , X ) 2 = E δ * ( X ) δ 0 ( X ) 2 E [ ψ ( Y , T , X , β 0 , p * ( · ) ) α 0 ( T , X ) ] = E μ 1 ( X ) 2 p * ( X ) + E μ 1 ( X ) 2 E [ { ψ ( Y , T , X , β 0 , p * ( · ) ) + α 0 ( T , X ) } c * ( T , X ) ] = 0 .
Combining these results, we obtain
n ( β ^ β 0 ) d N ( 0 , E ( μ 1 ( X ) β 0 ) 2 + σ 1 2 ( X ) / p * ( X ) + E δ * ( X ) δ 0 ( X ) 2 ) ,
where the first term in the asymptotic variance is identical to the efficiency bound derived by Hirano et al. (2003). Therefore, the efficiency loss due to parametric estimation of the propensity score is given by E δ * ( X ) δ 0 ( X ) 2 .

References

  1. Abadie, Alberto, and Guido W. Imbens. 2002. Simple and Bias-Corrected Matching Estimators for Average Treatment Effects. NBER Working Paper. Cambridge: National Bureau of Economic Research, vol. 283. [Google Scholar]
  2. Abadie, Alberto, and Guido W. Imbens. 2006. Large Sample Properties of Matching Estimators for Average Treatment Effects. Econometrica 74: 235–67. [Google Scholar] [CrossRef]
  3. Chen, Xiaohong. 2007. Large Sample Sieve Estimation of Semi-Nonparametric Models. Handbook of Econometrics 6: 5549–632. [Google Scholar]
  4. Chen, Xiaohong, and Xiaotong Shen. 1998. Sieve Extremum Estimates for Weakly Dependent Data. Econometrica 66: 289–314. [Google Scholar] [CrossRef]
  5. Hahn, Jinyong. 1998. On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects. Econometrica 66: 315–31. [Google Scholar] [CrossRef]
  6. Heckman, James J., Hidehiko Ichimura, and Petra Todd. 1998. Matching as an Econometric Evaluations Estimator. Review of Economic Studies 65: 605–54. [Google Scholar] [CrossRef]
  7. Hirano, Keisuke, Guido W. Imbens, and Geert Ridder. 2003. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score. Econometrica 71: 1161–89. [Google Scholar] [CrossRef]
  8. Hitomi, Kohtaro, Yoshihiko Nishiyama, and Ryo Okui. 2008. A Puzzling Phenomenon in Semiparametric Estimation Problems with Infinite-Dimensional Nuisance Parameters. Econometric Theory 24: 1717–28. [Google Scholar] [CrossRef]
  9. Hristache, Marian, and Valentin Patilea. 2017. Conditional Moment Models with Data Missing at Random. Biometrika 104: 735–42. [Google Scholar] [CrossRef]
  10. Imbens, Guido W. 2004. Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review. The Review of Economics and Statistics 86: 4–29. [Google Scholar] [CrossRef]
  11. Kang, Joseph D. Y., and Joseph L. Schafer. 2007. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science 22: 523–39. [Google Scholar] [CrossRef]
  12. Kim, Kyoo Il. 2013. An Alternative Efficient Estimation of Average Treatment Effects. Journal of Market Economy 42: 1–41. [Google Scholar]
  13. Li, Qi, Jeffrey S. Racine, and Jeffrey M. Wooldridge. 2009. Efficient Estimation of Average Treatment Effects with Mixed Categorical and Continuous Data. Journal of Business and Economics Statistics 27: 206–23. [Google Scholar] [CrossRef]
  14. Newey, Whitney K. 1984. A Method of Moments Interpretation of Sequential Estimators. Economics Letters 14: 201–6. [Google Scholar] [CrossRef]
  15. Newey, Whitney K. 1994. The Asymptotic Variance of Semiparametric Estimators. Econometrica 62: 1349–82. [Google Scholar] [CrossRef]
  16. Newey, Whitney K. 1997. Convergence Rates and Asymptotic Normality for Series Estimators. Journal of Econometrics 79: 147–68. [Google Scholar] [CrossRef]
  17. Prokhorov, Artem, and Peter Schmidt. 2009. GMM Redundancy Results for General Missing Data Problems. Journal of Econometrics 151: 47–55. [Google Scholar] [CrossRef]
  18. Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1995. Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association 90: 106–21. [Google Scholar] [CrossRef]
  19. Rosenbaum, Paul R. 1987. Model-Based Direct Adjustment. Journal of the American Statistical Association 82: 387–94. [Google Scholar] [CrossRef]
  20. Rosenbaum, Paul R., and Donald B. Rubin. 1983. The Central Role of the Propensity Score in Observational Studies for Casual Effects. Biometrika 70: 41–55. [Google Scholar] [CrossRef]
  21. Rosenbaum, Paul R., and Donald B. Rubin. 1984. Reducing Bias in Observational Studies Using Subclassification on the Propensity Score. Journal of the American Statistical Association 79: 516–24. [Google Scholar] [CrossRef]
  22. Rubin, Donald B. 1973. The Use of Matched Sampling and Regression Adjustments to Remove Bias in Observational Studies. Biometrics 29: 185–203. [Google Scholar] [CrossRef]
  23. Rubin, Donald B., and Neal Thomas. 1996. Matching Using Estimated Propensity Scores: Relating Theory to Practice. Biometrics 52: 249–64. [Google Scholar] [CrossRef] [PubMed]
  24. Shen, Xiaotong, and Wing Hung Wong. 1994. Convergence Rate of Sieve Estimates. The Annals of Statistics 22: 580–615. [Google Scholar] [CrossRef]
  25. Tan, Zhiqiang. 2007. Comment: Understanding OR, PS and DR. Statistical Science 22: 560–68. [Google Scholar] [CrossRef]
1
As it is pointed out by one referee, in the literature, there have been several studies, related to our findings, that show using an estimated nuisance parameter rather than the true value improves the efficiency of the parameter estimate of interest (see, e.g., Prokhorov and Schmidt (2009); Hitomi et al. (2008); and Hristache and Patilea (2017)). Our work provides a new insight to this problem by illustrating parametric estimation of the nuisance parameter may not achieve the full efficiency.
2
Hahn (1998) proposes to estimate E ^ [ T i Y i | X i ] , E ^ [ ( 1 T i ) Y i | X i ] , and E ^ [ T i | X i ] using series estimations (e.g., Newey (1997)). The resulting treatment effect estimators, however, are subject to some practical issues, e.g., the propensity score estimate E ^ [ T i | X i ] may lie outside the zero and one interval.
3
See, e.g., Shen and Wong (1994) and Chen and Shen (1998) for further details on the sieve extremum estimations that include the sieve ML estimation.
4
The Hölder space is a space of functions g Λ γ ( X ) , g : X R such that the first γ ̲ derivatives are bounded, and the γ ̲ -th derivatives are Hölder continuous with the exponent γ γ ̲ ( 0 , 1 ] , where γ ̲ is the largest integer smaller than γ . The Hölder space becomes a Banach space when endowed with the Hölder norm:
| | g | | Λ γ = sup x | g ( x ) | + max a 1 + a 2 + + a d x = γ ̲ sup x x | a g ( x ) a g ( x ) | ( | | x x | | E ) γ γ ̲ < ,
where a g ( x ) a 1 + a 2 + + a d x x 1 a 1 x d x a d x g ( x ) and | | · | | E denotes the Euclidean norm. The Hölder ball Λ c γ ( X ) is defined as Λ c γ ( X ) { g Λ γ ( X ) : | | g | | Λ γ c < } .
5
When the true propensity score is the logit model instead, this term is replaced by p * ( x ) ( 1 p * ( x ) ) · x where p * ( x ) = exp ( x π 0 ) / ( 1 + exp ( x π 0 ) ) .
6
We thank the referee for this useful suggestion.
7
In the orignial example, we have π 0 = π 1 = 1 / 2 .
8
This is because the terms (7)–(9) are o p ( 1 ) and only the terms (10) and (11) remain in the stochastic expansion.
Back to TopTop