## 1. Introduction

Estimating treatment effects of a binary treatment or a policy has been one of the most important topics in evaluation studies. In estimating treatment effects, a subject’s selection into a treatment may contaminate the estimate, and two approaches are popularly used in the literature to remove the bias due to this sample selection. One is regression-based

control function method (see, e.g.,

Rubin (

1973);

Hahn (

1998); and

Imbens (

2004)) and the other is

matching method (see, e.g.,

Rubin and Thomas (

1996);

Heckman et al. (

1998); and

Abadie and Imbens (

2002,

2006)). When there are many covariates or pre-treatment variables that govern this selection, the matching method may be less practical. In this case, due to

Rosenbaum and Rubin (

1983,

1984), we can control for the sample selection bias using the propensity score to reduce the dimensionality problem.

Although adjusting for sub-population differences in the propensity score removes the bias, the resulting treatment effect estimators may not be all efficient.

Hahn (

1998) shows that, using a nonparametric series estimation of the propensity score, we can achieve the efficiency bound.

Hirano et al. (

2003) also develop an efficient estimation of average treatment effects using the logit series estimation of the propensity score overcoming some practical limitations of

Hahn (

1998)’s series estimator (see also

Li et al. (

2009)).

Based on these studies, empirical researchers are encouraged to estimate treatment effects using the imputation method of the inverse weighting of the estimated propensity score. However, a nonparametric method of estimating the propensity score may require a large data set, especially when covariates or pre-treatment variables are high dimensional. For this reason, many empirical researchers estimate the propensity score parametrically using the probit or logit specification, given the idea that these parametric models are still good approximations to the true propensity score. Also in the statistics literature such as

Rosenbaum (

1987);

Rubin and Thomas (

1996); and

Robins et al. (

1995), they show that using parametric estimates of the propensity score can improve the efficiency of the treatment effect estimation.

However, from the existing literature (

Hahn (

1998);

Hirano et al. (

2003);

Kang and Schafer (

2007); and

Tan (

2007)), we can infer that, even when the true propensity score is parametric and the parametric estimator is consistent, we still need to estimate the propensity score nonparametrically to achieve the full efficiency. The first contribution of this paper is to formalize this efficiency argument and confirm that parametric estimation of the propensity score yields an inefficient estimator of the average treatment effect if some or all of covariates are continuous.

1 The second contribution of this paper, which is more interesting, is to identify the source of this inefficiency, and formally characterize the efficiency loss due to parametric estimation of the propensity score.

For our results, we find that a nonparametric sieve estimation of the propensity score has two roles in the efficient estimation of average treatment effects. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment function for the treatment effect with respect to the propensity score. We show that parametric estimation of propensity score accomplishes the first role when the true propensity score is indeed parametric, but cannot achieve the second role, if some of covariates are continuous. In other words, consistent estimation of the propensity score alone is not enough to obtain the efficient estimation of average treatment effects.

This finding also suggests that the performance of the treatment effect estimator in finite samples may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same sieve basis functions or regressors used to estimate the propensity score. We note that the literature has focused on the former, but the latter has been somewhat ignored. Moreover, because these two objects are quite different in nature, a sieve approximation solely targeted for the propensity score does not necessarily well approximate the conditional expectation of the derivative of the moment function in finite samples.

The rest of the paper is organized as follows.

Section 2 outlines the average treatment effect estimation using the inverse propensity score weighting.

Section 3 examines the role of the nonparametric propensity score estimation when the true one is parametric. We also provide an illustrative example. We conclude in

Section 4. Some technical details are provided in

Appendix A.

## 2. Estimation of Average Treatment Effect

In this section, we review estimation of average treatment effects using the inverse propensity score weighting in a standard setting. For this purpose, suppose we have a random sample of size

n individuals where some of them received a treatment and others did not. Let

${T}_{i}$ denote the treatment status with

${T}_{i}=1$ if individual

i receives the treatment and

${T}_{i}=0$ otherwise. Using the same notation with

Rubin (

1973), denote

${Y}_{i}(0)$ as the potential outcome for each individual

i under control and

${Y}_{i}(1)$ as the outcome under treatment. We observe

${T}_{i}$,

${X}_{i}$, and

${Y}_{i}={T}_{i}{Y}_{i}(1)+(1-{T}_{i}){Y}_{i}(0)$ where

${X}_{i}$ is a vector of observable covariates of the individual. Here, we have a fundamental missing data problem since we observe only either

${Y}_{i}(1)$ or

${Y}_{i}(0)$ but not both for each individual depending on the treatment status.

The parameter of interest is the population average treatment effect defined as

If we had both

${Y}_{i}(1)$ and

${Y}_{i}(0)$ for all individuals, we can simply estimate the average treatment effect using its sample analogue, but it is not feasible due to the missing data problem. One important way to circumvent this missing data problem in the literature is using the imputation method based on the propensity score, motivated by

Rosenbaum and Rubin (

1983,

1984). The propensity score of an individual whose observable characteristics

${X}_{i}$ equals

x is defined by

According to Rosenbaum and Rubin, if (i) there exist covariates

${X}_{i}$ such that the treatment status

${T}_{i}$ is ignorable given

${X}_{i}$ and (ii)

$0<{p}^{*}(x)<1$ for all

$x\in \mathcal{X}\equiv \mathrm{Supp}(X)$, then

${T}_{i}$ and (

${Y}_{i}(0),{Y}_{i}(1)$) are independent of each other given the propensity score. This implies that

This allows us to estimate the treatment effect using a sample analogue of Equation (

1). To be precise, define

where

$\widehat{E}[\xb7|\xb7]$’s denote suitable conditional mean function estimators. Then, we can construct

complete data using the imputation such that

${\widehat{Y}}_{i}(1)\equiv {T}_{i}{Y}_{i}+(1-{T}_{i}){\widehat{\beta}}_{1}({X}_{i})$ and

${\widehat{Y}}_{i}(0)\equiv {T}_{i}{\widehat{\beta}}_{0}({X}_{i})+(1-{T}_{i}){Y}_{i}(0)$, and we can estimate the average treatment effect as

${\widehat{\tau}}_{1}=\frac{1}{n}{\sum}_{i=1}^{n}({\widehat{Y}}_{i}(1)-{\widehat{Y}}_{i}(0))$ or alternatively as

${\widehat{\tau}}_{2}=\frac{1}{n}{\sum}_{i=1}^{n}({\widehat{\beta}}_{1}({X}_{i})-{\widehat{\beta}}_{0}({X}_{i})).$ These nonparametric imputation methods were proposed by

Hahn (

1998), and he further shows that these treatment effect estimators achieve the semiparametric efficiency bound.

2Hirano et al. (

2003) propose an alternative estimator for which the propensity score is estimated using a logit series estimation, and the propensity score is given by

${p}^{*}(x)=\frac{exp({h}_{0}(x))}{1+exp({h}_{0}(x))}$ for some unknown function

${h}_{0}(x)$. In the logit series estimation, we approximate

${h}_{0}(x)$ using linear sieves and the estimated propensity score is given by

${\widehat{p}}_{L}(x)=\frac{exp({\widehat{h}}_{n}(x))}{1+exp({\widehat{h}}_{n}(x))},$ where

${\widehat{h}}_{n}(x)$ denotes the sieve Maximum Likelihood (ML) estimator.

3 The proposed treatment effect estimator is given by

${\widehat{\tau}}_{3}=\frac{1}{n}{\sum}_{i=1}^{n}\left(\frac{{T}_{i}{Y}_{i}}{{\widehat{p}}_{L}({X}_{i})}-\frac{\left(1-{T}_{i}\right){Y}_{i}}{1-{\widehat{p}}_{L}({X}_{i})}\right)$. This estimator also achieves the semiparametric efficiency bound, and improves over Hahn (1998)’s estimator in two practical ways. First, we do not need to estimate the conditional mean functions of

$\widehat{E}\left[{T}_{i}{Y}_{i}\right|{X}_{i}]$ and

$\widehat{E}\left[(1-{T}_{i}){Y}_{i}\right|{X}_{i}]$. Second, the estimated propensity score lies between zero and one by construction.

Estimation of average treatment effects using the estimated propensity score with a general link function that includes the logit or probit specification was proposed by

Kim (

2013). We will use this general setting to argue that the inefficiency of the treatment effect estimate with the estimated parametric propensity score is not specific to a particular functional form assumption like logit or probit. To obtain a sieve ML estimator for the propensity score with a general link function, we assume the true function

${h}_{0}$ belongs to a class of bounded and smooth functions such as a Hölder ball, and let

${p}^{*}(x)=F({h}_{0}(x))$ for a known link function

$F(\xb7)$.

4Then, based on a triangular sequence of orthonormal basis functions such as polynomials or splines, we construct a tensor-product sieve space

${\mathcal{H}}_{n}$ as

where

${\parallel \xb7\parallel}_{{\mathrm{\Lambda}}^{{\gamma}_{1}}}$ denotes a Hölder norm, and we let

$K(n)\to \infty $ as

$n\to \infty $. The sieve ML estimator is obtained by solving

or equivalently

${\widehat{\pi}}_{K}={\mathrm{argmax}}_{\pi ,{R}^{K}{(X)}^{\prime}\pi \in {\mathcal{H}}_{n}}\frac{1}{n}{\sum}_{i=1}^{n}log\left\{F{({R}^{K}{({X}_{i})}^{\prime}\pi )}^{{T}_{i}}{\left(1-F({R}^{K}{({X}_{i})}^{\prime}\pi )\right)}^{1-{T}_{i}}\right\}$ such that

${\widehat{h}}_{n}(x)={R}^{K}{(x)}^{\prime}{\widehat{\pi}}_{K}$, and the resulting propensity score estimator becomes

$\widehat{p}(x)=F({\widehat{h}}_{n}(x))$.

Finally, using the estimated propensity score, we estimate the average treatment effect as

Define

${\mu}_{t}(x)\equiv E\left[Y(t)\right|X=x]$ and

${\sigma}_{t}^{2}(x)\equiv $Var

$[Y(t)|X=x]$. For the general class of

$F(\xb7)$, as long as the function is continuous and monotonic in

h,

Kim (

2013) shows that this treatment effect estimator achieves the semiparametric efficiency bound such that

where

$\tau (X)=E[Y(1)-Y(0)|X]$ and

which is identical to the efficiency bound derived by

Hahn (

1998). This efficiency result with the general link function is obtained, similarly as in

Hirano et al. (

2003), following the influence function approach by

Newey (

1994). To see this, define

where

${\psi}_{p}(\xb7)$ denotes the derivative of the moment function for the treatment effect,

$\psi (\xb7)$, with respect to the propensity score

$p(\xb7)$, and

${s}_{p}(\xb7)$ denotes its conditional expectation at the true parameter values. The asymptotic variance result of Equation (

3) is obtained by showing that the estimator is asymptotically linear with influence function decomposed into two terms:

The first term in Equation (

5) is the influence function when we know the true propensity score

${p}^{*}(\xb7)$, and the second term represents the contribution of the estimated propensity score on the asymptotic distribution of

$\widehat{\tau}$. It follows that the asymptotic variance

V in Equation (

3) equal to

which derives the result.

## 3. Efficient Estimation When the True Propensity Score Is Parametric

As we discuss in the previous section, the efficiency of the treatment effect estimator depends on whether the estimator has the asymptotically linear representation as Equation (

5). When the propensity score is estimated using a nonparametric sieve ML, we achieve this representation and hence the efficiency bound. Here, we pose the question of whether we can achieve this asymptotic linear representation if the true propensity is parametric, and is estimated under the correct parametric specification. We confirm that, in this case, the semiparametric efficiency bound is not achieved as can be inferred from the existing literature. This suggests that, even though we know the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method.

Our intuition behind this result is that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. For the purpose of illustration, without loss of generality, suppose ${p}^{*}(x)=\mathrm{\Phi}({x}^{\prime}{\pi}_{0}),$ where $\mathrm{\Phi}(\xb7)$ denotes the standard normal cumulative distribution function (CDF), so the true propensity is a probit model. We then can estimate ${\pi}_{0}$ with MLE, denoted by $\widehat{\pi}$, and obtain the parametric convergence rate such that $\sqrt{n}\left(\widehat{\pi}-{\pi}_{0}\right)={O}_{p}(1)$ and hence ${sup}_{x\in \mathcal{X}}|\widehat{p}(x)-{p}^{*}(x)|={O}_{p}({n}^{-1/2})$ with $\widehat{p}(x)=\mathrm{\Phi}({x}^{\prime}\widehat{\pi})$.

For ease of notation without losing the main idea, we consider the special case that

$Y(0)=0$ with probability one. Define

${\beta}_{0}=E\left[Y(1)\right]$ as the average outcome of interest, where

$Y(1)$ is missing at random conditional on the covariates

X. We estimate the average outcome as

$\widehat{\beta}=\frac{1}{n}{\sum}_{i=1}^{n}\frac{{Y}_{i}{T}_{i}}{\widehat{p}({X}_{i})}$. For this estimator, following the Equation (

5), if we can obtain the asymptotic linear representation as

then we will achieve the efficiency bound. To see whether this asymptotic linear representation is attainable with parametric estimation of the propensity score, we decompose

$\sqrt{n}(\widehat{\beta}-{\beta}_{0})$ as

where

${p}^{*}(x)=\mathrm{\Phi}({x}^{\prime}{\pi}_{0})$,

$\widehat{p}(x)=\mathrm{\Phi}({x}^{\prime}\widehat{\pi})$,

${F}_{0}(\xb7)$ denotes the distribution function of

X,

$W=E[\frac{\varphi {({X}_{i}^{\prime}{\pi}_{0})}^{2}}{{p}^{*}({X}_{i})(1-{p}^{*}({X}_{i}))}{X}_{i}{{X}_{i}}^{\prime}]$ with

$\varphi (\xb7)$ being the standard normal density function, and

If we can show that all terms (

7)–(10) are

${o}_{p}(1)$, we then obtain the desirable result of Equation (

6). Following the steps in

Hirano et al. (

2003) or

Kim (

2013), it is straightforward to bound the terms (

7)–(9) as

${o}_{p}(1)$. We focus on the term (10), from which we derive our main finding.

By inspecting

${\delta}^{*}(x)$ and

${\delta}_{0}(x)$, we see that

${\delta}^{*}(x)$ is the linear projection of

${\delta}_{0}(x)$ on

$\frac{\varphi ({x}^{\prime}{\pi}_{0})x}{\sqrt{{p}^{*}(x)(1-{p}^{*}(x))}}$. In other words,

where the projection coefficient is given by

${\theta}_{0}^{\prime}\equiv -{\int}_{\mathcal{X}}\frac{{\mu}_{1}(z)}{{p}^{*}(z)}\varphi ({z}^{\prime}{\pi}_{0}){z}^{\prime}d{F}_{0}(z){W}^{-1}$. Therefore, unless

${\delta}_{0}(x)$ is indeed linear in

$\frac{\varphi ({x}^{\prime}{\pi}_{0})x}{\sqrt{{p}^{*}(x)(1-{p}^{*}(x))}}$,

5 we will have

${inf}_{x\in \mathcal{X}}\left|{\delta}^{*}(x)-{\delta}_{0}(x)\right|>C>0$ for some positive constant

C. It follows that

Therefore, the term (10) remains as

${O}_{p}(1)$ and contributes to the asymptotic distribution of the treatment effect estimator. In other words, the asymptotic linear representation of Equation (

6) is not obtained with parametric estimation of the propensity score in general, even when the true propensity score is parametric. In the Appendix, we derive the asymptotic variance of the treatment effect estimator with the estimated parametric propensity score, and characterize the efficiency loss due to parametric estimation of the propensity score. In particular, we show that this efficiency loss is exactly given by Equation (

13).

As the key difference, in the treatment effect estimation using the nonparametric sieve estimation of the propensity score like Equation (

2), it can be shown that when

${\delta}_{0}(x)$ is

t-times continuously differentiable, we have

where

K denotes the number of approximating sieve terms used in

${\delta}^{*}(x)$ (see

Hirano et al. (

2003) or

Kim (

2013)). Therefore, we can bound the term (10) as

${o}_{p}(1)$ for some large enough

K. This is because Equation (

12) becomes

when the sieve estimation like Equation (

2) is used to estimate the propensity score, where

${R}^{K}(x)$ denotes a vector of approximating basis functions, and hence the bound (

14) is obtained due to some approximation theories of sieves for a class of smooth functions such as a Hölder class (see, e.g.,

Chen (

2007)). We, however, note that, because

${p}^{*}(x)$ and

${\delta}_{0}(x)$ are quite different in nature, the sieve approximation used to estimate the propensity score does not necessarily well approximate the latter in finite samples, which may contribute to the inefficiency of the treatment effect estimation.

Finally, by inspecting Equations (

4) and (

5) for the case

$Y(0)=0$ along with Equation (10), note that the term

${\delta}_{0}(x)$ is related to the conditional expectation of the derivative of the moment function with respect to the propensity score. This implies that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. It first approximates the true propensity score, and second approximates the conditional expectation of the derivative of the moment condition with respect to the propensity score. The parametric propensity score estimation can accomplish the first role, if the true one is parametric, but cannot achieve the second when some of covariates are continuous.

The asymptotic variance of a treatment effect estimator using parametric estimation of the propensity score can also depend on which parametric estimator is being used in practice. In this regard, given a parametric model of the propensity score, one can directly derive the asymptotic variance of the treatment effect estimator using the estimated parametric propensity score by combining two moments as a sequential estimation problem (see, e.g.,

Newey (

1984)). The first moment is given by, e.g., the first order condition of the population ML objective function of the propensity score estimation such as the logit or probit ML, and the second moment is given by the moment condition to estimate the treatment effect

$E\left[\psi ({Z}_{i},\tau ,p({X}_{i}))\right]=0$ defined in Equation (

4). We can then directly compare the asymptotic variance of the treatment effect estimator resulting from using a specific parametric estimator of the propensity score to the semiparametric efficiency bound, instead of deriving the inefficiency term from the Equation (

13). This joint moments approach for parametric estimation of the propensity score also allows us to explicitly derive the efficiency loss due to a specific parametric estimator of the propensity score, and hence compare different parametric models of the propensity score in terms of efficiency.

6#### 3.1. Reconsidering the Simple Example in Hirano et al. (2003)

Hirano et al. (

2003) present a simple example with a binary covariate, illustrating that, weighting by the inverse of the propensity score estimate, rather than the true one, we can improve the efficiency and indeed achieve the efficiency bound. Here, we reproduce the example and provide an intuition why in this case the efficiency bound is achieved in view of the results from the previous section. Consider a simple problem of estimating the population average of a variable

Y,

${\beta}_{0}=E\left[Y\right]$, given a random sample of size

n of the triple (

${T}_{i},{X}_{i},{T}_{i}\xb7{Y}_{i}$). Therefore,

${T}_{i}$ and

${X}_{i}$ are observed for all units in the sample, but

${Y}_{i}$ is only observed if

${T}_{i}=1$. Denote

$\mu (x)=E\left[Y\right|X=x]$ and

${\sigma}^{2}(x)=\mathrm{Var}(Y|X=x)$.

Now let

${N}_{tx}$ denote the number of observations with

${T}_{i}=t$ and

${X}_{i}=x$, for

$t,x\in \{0,1\}$. Further assume that the true selection probability is

$p(x)={\pi}_{0}+x({\pi}_{1}-{\pi}_{0})$.

7 The estimated selection probability is then

The

true weights estimator is given by

${\widehat{\beta}}_{tw}=\frac{1}{n}{\sum}_{i=1}^{n}\frac{{Y}_{i}{T}_{i}}{p({X}_{i})}$ while the

estimated weights estimator is then

${\widehat{\beta}}_{ew}=\frac{1}{n}{\sum}_{i=1}^{n}\frac{{Y}_{i}{T}_{i}}{\widehat{p}({X}_{i})}$.

Hirano et al. (

2003) show that

${\widehat{\beta}}_{ew}$ is more efficient than

${\widehat{\beta}}_{tw}$, and

${\widehat{\beta}}_{ew}$ achieves the efficiency bound. Interestingly, one can easily see that

$\widehat{p}(x)$ in Equation (

15) is a nonparametric estimator of

$p(x)$, and is also a parametric MLE of

$p(x)$ since we can write

$\widehat{p}(x)={\widehat{\pi}}_{0}+x({\widehat{\pi}}_{1}-{\widehat{\pi}}_{0})$ with

${\widehat{\pi}}_{0}={N}_{10}/({N}_{00}+{N}_{10})$ and

${\widehat{\pi}}_{1}={N}_{11}/({N}_{01}+{N}_{11})$.

In this example, for the corresponding terms of

${\delta}^{*}(x)$ and

${\delta}_{0}(x)$ in Equation (10), we show below that indeed

for all

x, and hence the efficiency bound is achieved for the estimator

${\widehat{\beta}}_{ew}$ because the asymptotic linear representation like Equation (

6) is obtained (i.e., the term (

14) is simply equal to zero in this case). To derive the result, consider the following terms corresponding to

${\delta}^{*}(x)$ and

${\delta}_{0}(x)$ in Equation (10) for the stochastic expansion of

${\widehat{\beta}}_{ew}$. Let

where

$q(\xb7)$ denotes the probability mass of

X.

By investigating

${\delta}^{*}(x)$ and

${\delta}_{0}(x)$, we can see that

${\delta}^{*}(x)$ is the linear projection of

${\delta}_{0}(x)$ on

$\frac{\left(1-x,x\right)}{\sqrt{p(x)(1-p(x))}}$. In other words,

${\delta}^{*}(x)=\frac{1-x}{\sqrt{p(x)(1-p(x))}}{\theta}_{0}+\frac{x}{\sqrt{p(x)(1-p(x))}}{\theta}_{1}$ for some constants

${\theta}_{0}$ and

${\theta}_{1}$ that are determined by the linear projection. Note that we have

if

${\theta}_{x}=-\mu (x)(1-p(x))$ for

$x\in \{0,1\}$. Indeed, from the definition of

${\delta}^{*}(x)$, we find

and therefore the efficiency result follows.

This example clearly illustrates why the condition like (

16) is crucial to achieve the efficiency bound. This suggests that, when the covariates are multinomial, we can always achieve the condition like (

16) since the parametric ML estimation becomes equivalent to the nonparametric ML estimation. Therefore, we can achieve the efficiency bound. However, when the covariates or a subset of covariates are continuous, using the parametric propensity score estimation cannot achieve the efficiency bound even though the true one is parametric. This also suggests that the efficiency loss due to using the parametric propensity score estimator is attributed to the fact that some covariates are continuous.

#### 3.2. Generalization to Estimating the Weighted Average Treatment Effect

We generalize the efficiency comparison between treatment effect estimators using nonparametric or parametric estimation of the propensity score to the weighted average treatment effect,

${\tau}_{wate}^{*}$, defined as

for a known weight function

$g(x)$. We estimate

${\tau}_{wate}^{*}$ using the moment condition

that yields the estimator as

given an estimator of the propensity score

$\widehat{p}(x)$.

Because the function

$g(x)$ is known and only appears as a weight in the moment function (

17), following the same line of argument for the average treatment effect, one can obtain the asymptotic linear representation of

${\widehat{\tau}}_{wate}$ using the nonparametric propensity score estimator (

2) in Equation (

18) as

where

${\psi}_{p}({Z}_{i},\tau ,p({X}_{i}),g({X}_{i}))=-g({X}_{i})\left(\frac{{Y}_{i}{T}_{i}}{p{({X}_{i})}^{2}}+\frac{{Y}_{i}(1-{T}_{i})}{{(1-p({X}_{i}))}^{2}}\right)$ and

${s}_{p}({X}_{i})=E\left[{\psi}_{p}(\xb7)\right|{X}_{i}]$. Therefore, the semiparametric efficiency bound is achieved for the weighted average treatment effect estimator

${\widehat{\tau}}_{wate}$ using the nonparametric propensity score estimator (see

Hirano et al. (

2003)). On the other hand, for the parametric propensity score estimator, we can derive the inefficiency term similar to Equation (

13), the inefficiency term derived for the average treatment effect, as

and therefore a similar inefficiency result holds for

${\widehat{\tau}}_{wate}$ using the parametric propensity score estimator.

Note that, under the unconfoundedness assumption (

Rosenbaum and Rubin (

1983,

1984)), with the weight function

$g(x)$ being equal to the true propensity score

${p}^{*}(x)$, the weighted average treatment effect becomes the average treatment effect for the treated,

Based on this equivalence,

${\tau}_{treated}^{*}$ can be estimated using the moment condition

by replacing

$g(x)$ with

$p(x)$. However, an efficiency comparison between treatment effect estimators for

${\tau}_{treated}^{*}$ using the nonparametric or parametric propensity score estimator is more complicated because, in this case, the propensity score has two roles in the moment function. One is the inverse weighting to control for the self-selection and the other is the weighting function in place of

$g(x)$. To see this, let

$\widehat{p}(x)$ and

${\widehat{p}}^{*}(x)$ denote the nonparametric and the correctly specified parametric estimator of the propensity score, respectively. Then, we can consider three alternative estimators for the average treatment effect for the treated. One is using the parametric propensity score

${\widehat{p}}^{*}(x)$ everywhere and solving

the second one is using the nonparametric propensity score

$\widehat{p}(x)$ everywhere and solving

and the last one is using the parametric propensity score

${\widehat{p}}^{*}(x)$ in place of

$g(x)$ while using the nonparametric propensity score

$\widehat{p}(x)$ for the inverse weighting and solving

From the efficiency argument of

Hahn (

1998) and

Hirano et al. (

2003) when the true propensity score is known, one can conjecture that the treatment effect estimator that solves Equation (

21) will be more efficient than other estimators that solve Equations (

19) and (

20), respectively. However, in terms of efficiency, the two estimators solving Equations (

19) and (

20) (or other variations) cannot be uniformly ranked in general, and studying these alternative estimators is beyond the scope of this paper.