Efﬁciency of Average Treatment Effect Estimation When the True Propensity Is Parametric

: It is well known that efﬁcient estimation of average treatment effects can be obtained by the method of inverse propensity score weighting, using the estimated propensity score, even when the true one is known. When the true propensity score is unknown but parametric, it is conjectured from the literature that we still need nonparametric propensity score estimation to achieve the efﬁciency. We formalize this argument and further identify the source of the efﬁciency loss arising from parametric estimation of the propensity score. We also provide an intuition of why this overﬁtting is necessary. Our ﬁnding suggests that, even when we know that the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method in applications.


Introduction
Estimating treatment effects of a binary treatment or a policy has been one of the most important topics in evaluation studies. In estimating treatment effects, a subject's selection into a treatment may contaminate the estimate, and two approaches are popularly used in the literature to remove the bias due to this sample selection. One is regression-based control function method (see, e.g., Rubin (1973); Hahn (1998);and Imbens (2004)) and the other is matching method (see, e.g., Rubin and Thomas (1996); Heckman et al. (1998); and Imbens (2002, 2006)). When there are many covariates or pre-treatment variables that govern this selection, the matching method may be less practical. In this case, due to Rosenbaum andRubin (1983, 1984), we can control for the sample selection bias using the propensity score to reduce the dimensionality problem.
Although adjusting for sub-population differences in the propensity score removes the bias, the resulting treatment effect estimators may not be all efficient. Hahn (1998) shows that, using a nonparametric series estimation of the propensity score, we can achieve the efficiency bound. Hirano et al. (2003) also develop an efficient estimation of average treatment effects using the logit series estimation of the propensity score overcoming some practical limitations of Hahn (1998)'s series estimator (see also Li et al. (2009)).
Based on these studies, empirical researchers are encouraged to estimate treatment effects using the imputation method of the inverse weighting of the estimated propensity score. However, a nonparametric method of estimating the propensity score may require a large data set, especially when covariates or pre-treatment variables are high dimensional. For this reason, many empirical researchers estimate the propensity score parametrically using the probit or logit specification, given the idea that these parametric models are still good approximations to the true propensity score. Also in the statistics literature such as Rosenbaum (1987); Rubin and Thomas (1996); and Robins et al. (1995), they show that using parametric estimates of the propensity score can improve the efficiency of the treatment effect estimation.
However, from the existing literature (Hahn (1998); Hirano et al. (2003); Kang and Schafer (2007); and Tan (2007)), we can infer that, even when the true propensity score is parametric and the parametric estimator is consistent, we still need to estimate the propensity score nonparametrically to achieve the full efficiency. The first contribution of this paper is to formalize this efficiency argument and confirm that parametric estimation of the propensity score yields an inefficient estimator of the average treatment effect if some or all of covariates are continuous. 1 The second contribution of this paper, which is more interesting, is to identify the source of this inefficiency, and formally characterize the efficiency loss due to parametric estimation of the propensity score.
For our results, we find that a nonparametric sieve estimation of the propensity score has two roles in the efficient estimation of average treatment effects. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment function for the treatment effect with respect to the propensity score. We show that parametric estimation of propensity score accomplishes the first role when the true propensity score is indeed parametric, but cannot achieve the second role, if some of covariates are continuous. In other words, consistent estimation of the propensity score alone is not enough to obtain the efficient estimation of average treatment effects.
This finding also suggests that the performance of the treatment effect estimator in finite samples may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same sieve basis functions or regressors used to estimate the propensity score. We note that the literature has focused on the former, but the latter has been somewhat ignored. Moreover, because these two objects are quite different in nature, a sieve approximation solely targeted for the propensity score does not necessarily well approximate the conditional expectation of the derivative of the moment function in finite samples.
The rest of the paper is organized as follows. Section 2 outlines the average treatment effect estimation using the inverse propensity score weighting. Section 3 examines the role of the nonparametric propensity score estimation when the true one is parametric. We also provide an illustrative example. We conclude in Section 4. Some technical details are provided in Appendix A.

Estimation of Average Treatment Effect
In this section, we review estimation of average treatment effects using the inverse propensity score weighting in a standard setting. For this purpose, suppose we have a random sample of size n individuals where some of them received a treatment and others did not. Let T i denote the treatment status with T i = 1 if individual i receives the treatment and T i = 0 otherwise. Using the same notation with Rubin (1973), denote Y i (0) as the potential outcome for each individual i under control and Y i (1) as the outcome under treatment. We observe T i , X i , and is a vector of observable covariates of the individual. Here, we have a fundamental missing data problem since we observe only either Y i (1) or Y i (0) but not both for each individual depending on the treatment status.
The parameter of interest is the population average treatment effect defined as As it is pointed out by one referee, in the literature, there have been several studies, related to our findings, that show using an estimated nuisance parameter rather than the true value improves the efficiency of the parameter estimate of interest (see, e.g., Prokhorov and Schmidt (2009);Hitomi et al. (2008); and Hristache and Patilea (2017)). Our work provides a new insight to this problem by illustrating parametric estimation of the nuisance parameter may not achieve the full efficiency.
If we had both Y i (1) and Y i (0) for all individuals, we can simply estimate the average treatment effect using its sample analogue, but it is not feasible due to the missing data problem. One important way to circumvent this missing data problem in the literature is using the imputation method based on the propensity score, motivated by Rosenbaum andRubin (1983, 1984). The propensity score of an individual whose observable characteristics X i equals x is defined by According to Rosenbaum and Rubin, if (i) there exist covariates X i such that the treatment status T i is ignorable given X i and (ii) 0 < p * (x) < 1 for all x ∈ X ≡ Supp(X), then T i and (Y i (0), Y i (1) ) are independent of each other given the propensity score. This implies that This allows us to estimate the treatment effect using a sample analogue of Equation (1). To be precise, define where E[·|·]'s denote suitable conditional mean function estimators. Then, we can construct complete data using the imputation such that , and we can estimate the average treatment effect as . These nonparametric imputation methods were proposed by Hahn (1998), and he further shows that these treatment effect estimators achieve the semiparametric efficiency bound. 2 Hirano et al. (2003) propose an alternative estimator for which the propensity score is estimated using a logit series estimation, and the propensity score is given by p 1+exp(h 0 (x)) for some unknown function h 0 (x). In the logit series estimation, we approximate h 0 (x) using linear sieves and the estimated propensity score is given by , where h n (x) denotes the sieve Maximum Likelihood (ML) estimator. 3 The proposed treatment effect estimator is given by . This estimator also achieves the semiparametric efficiency bound, and improves over Hahn (1998)'s estimator in two practical ways. First, we do not need to estimate the conditional mean functions of Second, the estimated propensity score lies between zero and one by construction.
Estimation of average treatment effects using the estimated propensity score with a general link function that includes the logit or probit specification was proposed by Kim (2013). We will use this general setting to argue that the inefficiency of the treatment effect estimate with the estimated parametric propensity score is not specific to a particular functional form assumption like logit or probit. To obtain a sieve ML estimator for the propensity score with a general link function, we assume 2 Hahn (1998) , and E[T i |X i ] using series estimations (e.g., Newey (1997)). The resulting treatment effect estimators, however, are subject to some practical issues, e.g., the propensity score estimate E[T i |X i ] may lie outside the zero and one interval. the true function h 0 belongs to a class of bounded and smooth functions such as a Hölder ball, and let p * (x) = F(h 0 (x)) for a known link function F(·). 4 Then, based on a triangular sequence of orthonormal basis functions such as polynomials or splines, we construct a tensor-product sieve space H n as where · Λ γ 1 denotes a Hölder norm, and we let K(n) → ∞ as n → ∞. The sieve ML estimator is obtained by solving and the resulting propensity score estimator becomes p(x) = F( h n (x)). Finally, using the estimated propensity score, we estimate the average treatment effect aŝ For the general class of F(·), as long as the function is continuous and monotonic in h, Kim (2013) shows that this treatment effect estimator achieves the semiparametric efficiency bound such that The Hölder space is a space of functions g ∈ Λ γ (X ), g : X −→ R such that the first γ derivatives are bounded, and the γ-th derivatives are Hölder continuous with the exponent γ − γ ∈ (0, 1], where γ is the largest integer smaller than γ. The Hölder space becomes a Banach space when endowed with the Hölder norm: which is identical to the efficiency bound derived by Hahn (1998). This efficiency result with the general link function is obtained, similarly as in Hirano et al. (2003), following the influence function approach by Newey (1994). To see this, define where ψ p (·) denotes the derivative of the moment function for the treatment effect, ψ(·), with respect to the propensity score p(·), and s p (·) denotes its conditional expectation at the true parameter values. The asymptotic variance result of Equation (3) is obtained by showing that the estimator is asymptotically linear with influence function decomposed into two terms: The first term in Equation (5) is the influence function when we know the true propensity score p * (·), and the second term represents the contribution of the estimated propensity score on the asymptotic distribution ofτ. It follows that the asymptotic variance V in Equation (3) which derives the result.

Efficient Estimation When the True Propensity Score Is Parametric
As we discuss in the previous section, the efficiency of the treatment effect estimator depends on whether the estimator has the asymptotically linear representation as Equation (5). When the propensity score is estimated using a nonparametric sieve ML, we achieve this representation and hence the efficiency bound. Here, we pose the question of whether we can achieve this asymptotic linear representation if the true propensity is parametric, and is estimated under the correct parametric specification. We confirm that, in this case, the semiparametric efficiency bound is not achieved as can be inferred from the existing literature. This suggests that, even though we know the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method.
Our intuition behind this result is that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. For the purpose of illustration, without loss of generality, suppose p * (x) = Φ(x π 0 ), where Φ(·) denotes the standard normal cumulative distribution function (CDF), so the true propensity is a probit model. We then can estimate π 0 with MLE, denoted by π, and obtain the parametric convergence rate such that √ n ( π − π 0 ) = O p (1) and hence sup x∈X |p( For ease of notation without losing the main idea, we consider the special case that Y(0) = 0 with probability one. Define β 0 = E[Y(1)] as the average outcome of interest, where Y(1) is missing at random conditional on the covariates X. We estimate the average outcome asβ = 1 n ∑ n i=1 Y i T î p(X i ) . For this estimator, following the Equation (5), if we can obtain the asymptotic linear representation as then we will achieve the efficiency bound. To see whether this asymptotic linear representation is attainable with parametric estimation of the propensity score, we decompose being the standard normal density function, and If we can show that all terms (7)-(10) are o p (1), we then obtain the desirable result of Equation (6). Following the steps in Hirano et al. (2003) or Kim (2013), it is straightforward to bound the terms (7)-(9) as o p (1). We focus on the term (10), from which we derive our main finding.
In other words, where the projection coefficient is given by θ 0 ≡ − X , 5 we will have inf x∈X |δ * (x) − δ 0 (x)| > C > 0 for some positive constant C. It follows that Therefore, the term (10) remains as O p (1) and contributes to the asymptotic distribution of the treatment effect estimator. In other words, the asymptotic linear representation of Equation (6) is not obtained with parametric estimation of the propensity score in general, even when the true propensity score is parametric. In the Appendix, we derive the asymptotic variance of the treatment effect estimator with the estimated parametric propensity score, and characterize the efficiency loss due to parametric estimation of the propensity score. In particular, we show that this efficiency loss is exactly given by Equation (13). 5 When the true propensity score is the logit model instead, this term is replaced by p * (x)(1 − p * (x)) · x where p * (x) = exp(x π 0 )/(1 + exp(x π 0 )).
As the key difference, in the treatment effect estimation using the nonparametric sieve estimation of the propensity score like Equation (2), it can be shown that when δ 0 (x) is t-times continuously differentiable, we have sup where K denotes the number of approximating sieve terms used in δ * (x) (see Hirano et al. (2003) or Kim (2013)). Therefore, we can bound the term (10) as o p (1) for some large enough K. This is because Equation (12) becomes when the sieve estimation like Equation (2) is used to estimate the propensity score, where R K (x) denotes a vector of approximating basis functions, and hence the bound (14) is obtained due to some approximation theories of sieves for a class of smooth functions such as a Hölder class (see, e.g., Chen (2007)). We, however, note that, because p * (x) and δ 0 (x) are quite different in nature, the sieve approximation used to estimate the propensity score does not necessarily well approximate the latter in finite samples, which may contribute to the inefficiency of the treatment effect estimation. Finally, by inspecting Equations (4) and (5) for the case Y(0) = 0 along with Equation (10), note that the term δ 0 (x) is related to the conditional expectation of the derivative of the moment function with respect to the propensity score. This implies that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. It first approximates the true propensity score, and second approximates the conditional expectation of the derivative of the moment condition with respect to the propensity score. The parametric propensity score estimation can accomplish the first role, if the true one is parametric, but cannot achieve the second when some of covariates are continuous.
The asymptotic variance of a treatment effect estimator using parametric estimation of the propensity score can also depend on which parametric estimator is being used in practice. In this regard, given a parametric model of the propensity score, one can directly derive the asymptotic variance of the treatment effect estimator using the estimated parametric propensity score by combining two moments as a sequential estimation problem (see, e.g., Newey (1984)). The first moment is given by, e.g., the first order condition of the population ML objective function of the propensity score estimation such as the logit or probit ML, and the second moment is given by the moment condition to estimate the treatment effect E[ψ(Z i , τ, p(X i ))] = 0 defined in Equation (4). We can then directly compare the asymptotic variance of the treatment effect estimator resulting from using a specific parametric estimator of the propensity score to the semiparametric efficiency bound, instead of deriving the inefficiency term from the Equation (13). This joint moments approach for parametric estimation of the propensity score also allows us to explicitly derive the efficiency loss due to a specific parametric estimator of the propensity score, and hence compare different parametric models of the propensity score in terms of efficiency. 6 3.1. Reconsidering the Simple Example in Hirano et al. (2003) Hirano et al. (2003 present a simple example with a binary covariate, illustrating that, weighting by the inverse of the propensity score estimate, rather than the true one, we can improve the efficiency and indeed achieve the efficiency bound. Here, we reproduce the example and provide an intuition why in this case the efficiency bound is achieved in view of the results from the previous section. Consider a simple problem of estimating the population average of a variable Y, β 0 = E[Y], given a random sample of size n of the triple (T i , X i , T i · Y i ). Therefore, T i and X i are observed for all units in the sample, but Y i is only observed if T i = 1. Denote µ(x) = E[Y|X = x] and σ 2 (x) = Var(Y|X = x). Now let N tx denote the number of observations with T i = t and X i = x, for t, x ∈ {0, 1}. Further assume that the true selection probability is p(x) = π 0 + x(π 1 − π 0 ). 7 The estimated selection probability is then The true weights estimator is given by Hirano et al. (2003) show that β ew is more efficient than β tw , and β ew achieves the efficiency bound. Interestingly, one can easily see that p(x) in Equation (15) is a nonparametric estimator of p(x), and is also a parametric MLE of p(x) since we can write p(x) = π 0 + x( π 1 − π 0 ) with π 0 = N 10 /(N 00 + N 10 ) and π 1 = N 11 /(N 01 + N 11 ).
In this example, for the corresponding terms of δ * (x) and δ 0 (x) in Equation (10), we show below that indeed for all x, and hence the efficiency bound is achieved for the estimator β ew because the asymptotic linear representation like Equation (6) is obtained (i.e., the term (14) is simply equal to zero in this case). To derive the result, consider the following terms corresponding to δ * (x) and δ 0 (x) in Equation (10) for the stochastic expansion of β ew . Let where q(·) denotes the probability mass of X. By investigating δ * (x) and δ 0 (x), we can see that δ * (x) is the linear projection of δ 0 (x) on . In other words, δ * (x) = θ 1 for some constants θ 0 and θ 1 that are determined by the linear projection. Note that we have In the orignial example, we have π 0 = π 1 = 1/2. and therefore the efficiency result follows. This example clearly illustrates why the condition like (16) is crucial to achieve the efficiency bound. This suggests that, when the covariates are multinomial, we can always achieve the condition like (16) since the parametric ML estimation becomes equivalent to the nonparametric ML estimation. Therefore, we can achieve the efficiency bound. However, when the covariates or a subset of covariates are continuous, using the parametric propensity score estimation cannot achieve the efficiency bound even though the true one is parametric. This also suggests that the efficiency loss due to using the parametric propensity score estimator is attributed to the fact that some covariates are continuous.

Generalization to Estimating the Weighted Average Treatment Effect
We generalize the efficiency comparison between treatment effect estimators using nonparametric or parametric estimation of the propensity score to the weighted average treatment effect, τ * wate , defined as for a known weight function g(x). We estimate τ * wate using the moment condition that yields the estimator asτ given an estimator of the propensity scorep(x). Because the function g(x) is known and only appears as a weight in the moment function (17), following the same line of argument for the average treatment effect, one can obtain the asymptotic linear representation ofτ wate using the nonparametric propensity score estimator (2) in Equation (18) (1−p(X i )) 2 and s p (X i ) = E[ψ p (·)|X i ]. Therefore, the semiparametric efficiency bound is achieved for the weighted average treatment effect estimator τ wate using the nonparametric propensity score estimator (see Hirano et al. (2003)). On the other hand, for the parametric propensity score estimator, we can derive the inefficiency term similar to Equation (13), the inefficiency term derived for the average treatment effect, as and therefore a similar inefficiency result holds forτ wate using the parametric propensity score estimator.
Note that, under the unconfoundedness assumption (Rosenbaum andRubin (1983, 1984)), with the weight function g(x) being equal to the true propensity score p * (x), the weighted average treatment effect becomes the average treatment effect for the treated, Based on this equivalence, τ * treated can be estimated using the moment condition by replacing g(x) with p(x). However, an efficiency comparison between treatment effect estimators for τ * treated using the nonparametric or parametric propensity score estimator is more complicated because, in this case, the propensity score has two roles in the moment function. One is the inverse weighting to control for the self-selection and the other is the weighting function in place of g(x). To see this, letp(x) andp * (x) denote the nonparametric and the correctly specified parametric estimator of the propensity score, respectively. Then, we can consider three alternative estimators for the average treatment effect for the treated. One is using the parametric propensity scorep * (x) everywhere and solving the second one is using the nonparametric propensity scorep(x) everywhere and solving and the last one is using the parametric propensity scorep * (x) in place of g(x) while using the nonparametric propensity scorep(x) for the inverse weighting and solving From the efficiency argument of Hahn (1998) and Hirano et al. (2003) when the true propensity score is known, one can conjecture that the treatment effect estimator that solves Equation (21) will be more efficient than other estimators that solve Equations (19) and (20), respectively. However, in terms of efficiency, the two estimators solving Equations (19) and (20) (or other variations) cannot be uniformly ranked in general, and studying these alternative estimators is beyond the scope of this paper.

Conclusions
One can obtain efficient estimation of average treatment effects by the method of inverse propensity score weighting based on the estimated propensity score, rather than the true one, even when the true one is known. From the literature, we can infer that, even when the true propensity score is a parametric function, we still need to estimate the propensity score nonparametrically to achieve the efficiency. We formalize this argument and further identify the source of the efficiency loss due to parametric estimation of the propensity score. We also provide an intuition as to why this overfitting is necessary. The idea is that the nonparametric estimation of the propensity score plays two roles in the treatment effect estimation. It first replaces the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. The parametric propensity score estimation can achieve the first but cannot achieve the second when some of covariates are continuous. This also suggests that the finite sample performance of the treatment effect estimator, using the imputation method based on the estimated propensity score, may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same approximating sieves or regressors that are used to estimate the propensity score.