Abstract
It is well known that efficient estimation of average treatment effects can be obtained by the method of inverse propensity score weighting, using the estimated propensity score, even when the true one is known. When the true propensity score is unknown but parametric, it is conjectured from the literature that we still need nonparametric propensity score estimation to achieve the efficiency. We formalize this argument and further identify the source of the efficiency loss arising from parametric estimation of the propensity score. We also provide an intuition of why this overfitting is necessary. Our finding suggests that, even when we know that the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method in applications.
  JEL Classification:
                C14; C18; C21
            1. Introduction
Estimating treatment effects of a binary treatment or a policy has been one of the most important topics in evaluation studies. In estimating treatment effects, a subject’s selection into a treatment may contaminate the estimate, and two approaches are popularly used in the literature to remove the bias due to this sample selection. One is regression-based control function method (see, e.g.,  ();  (); and  ()) and the other is matching method (see, e.g.,  ();  (); and  (, )). When there are many covariates or pre-treatment variables that govern this selection, the matching method may be less practical. In this case, due to  (, ), we can control for the sample selection bias using the propensity score to reduce the dimensionality problem.
Although adjusting for sub-population differences in the propensity score removes the bias, the resulting treatment effect estimators may not be all efficient.  () shows that, using a nonparametric series estimation of the propensity score, we can achieve the efficiency bound.  () also develop an efficient estimation of average treatment effects using the logit series estimation of the propensity score overcoming some practical limitations of  ()’s series estimator (see also  ()).
Based on these studies, empirical researchers are encouraged to estimate treatment effects using the imputation method of the inverse weighting of the estimated propensity score. However, a nonparametric method of estimating the propensity score may require a large data set, especially when covariates or pre-treatment variables are high dimensional. For this reason, many empirical researchers estimate the propensity score parametrically using the probit or logit specification, given the idea that these parametric models are still good approximations to the true propensity score. Also in the statistics literature such as  ();  (); and  (), they show that using parametric estimates of the propensity score can improve the efficiency of the treatment effect estimation.
However, from the existing literature ( ();  ();  (); and  ()), we can infer that, even when the true propensity score is parametric and the parametric estimator is consistent, we still need to estimate the propensity score nonparametrically to achieve the full efficiency. The first contribution of this paper is to formalize this efficiency argument and confirm that parametric estimation of the propensity score yields an inefficient estimator of the average treatment effect if some or all of covariates are continuous.1 The second contribution of this paper, which is more interesting, is to identify the source of this inefficiency, and formally characterize the efficiency loss due to parametric estimation of the propensity score.
For our results, we find that a nonparametric sieve estimation of the propensity score has two roles in the efficient estimation of average treatment effects. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment function for the treatment effect with respect to the propensity score. We show that parametric estimation of propensity score accomplishes the first role when the true propensity score is indeed parametric, but cannot achieve the second role, if some of covariates are continuous. In other words, consistent estimation of the propensity score alone is not enough to obtain the efficient estimation of average treatment effects.
This finding also suggests that the performance of the treatment effect estimator in finite samples may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same sieve basis functions or regressors used to estimate the propensity score. We note that the literature has focused on the former, but the latter has been somewhat ignored. Moreover, because these two objects are quite different in nature, a sieve approximation solely targeted for the propensity score does not necessarily well approximate the conditional expectation of the derivative of the moment function in finite samples.
The rest of the paper is organized as follows. Section 2 outlines the average treatment effect estimation using the inverse propensity score weighting. Section 3 examines the role of the nonparametric propensity score estimation when the true one is parametric. We also provide an illustrative example. We conclude in Section 4. Some technical details are provided in Appendix A.
2. Estimation of Average Treatment Effect
In this section, we review estimation of average treatment effects using the inverse propensity score weighting in a standard setting. For this purpose, suppose we have a random sample of size n individuals where some of them received a treatment and others did not. Let  denote the treatment status with  if individual i receives the treatment and  otherwise. Using the same notation with  (), denote  as the potential outcome for each individual i under control and  as the outcome under treatment. We observe , , and  where  is a vector of observable covariates of the individual. Here, we have a fundamental missing data problem since we observe only either  or  but not both for each individual depending on the treatment status.
The parameter of interest is the population average treatment effect defined as
      
      
        
      
      
      
      
    If we had both  and  for all individuals, we can simply estimate the average treatment effect using its sample analogue, but it is not feasible due to the missing data problem. One important way to circumvent this missing data problem in the literature is using the imputation method based on the propensity score, motivated by  (, ). The propensity score of an individual whose observable characteristics  equals x is defined by
      
      
        
      
      
      
      
    According to Rosenbaum and Rubin, if (i) there exist covariates  such that the treatment status  is ignorable given  and (ii)  for all , then  and () are independent of each other given the propensity score. This implies that
      
      
        
      
      
      
      
    This allows us to estimate the treatment effect using a sample analogue of Equation (1). To be precise, define
      
      
        
      
      
      
      
    
      where ’s denote suitable conditional mean function estimators. Then, we can construct complete data using the imputation such that  and , and we can estimate the average treatment effect as  or alternatively as  These nonparametric imputation methods were proposed by  (), and he further shows that these treatment effect estimators achieve the semiparametric efficiency bound.2
 () propose an alternative estimator for which the propensity score is estimated using a logit series estimation, and the propensity score is given by  for some unknown function . In the logit series estimation, we approximate  using linear sieves and the estimated propensity score is given by  where  denotes the sieve Maximum Likelihood (ML) estimator.3 The proposed treatment effect estimator is given by . This estimator also achieves the semiparametric efficiency bound, and improves over Hahn (1998)’s estimator in two practical ways. First, we do not need to estimate the conditional mean functions of  and . Second, the estimated propensity score lies between zero and one by construction.
Estimation of average treatment effects using the estimated propensity score with a general link function that includes the logit or probit specification was proposed by  (). We will use this general setting to argue that the inefficiency of the treatment effect estimate with the estimated parametric propensity score is not specific to a particular functional form assumption like logit or probit. To obtain a sieve ML estimator for the propensity score with a general link function, we assume the true function  belongs to a class of bounded and smooth functions such as a Hölder ball, and let  for a known link function .4
Then, based on a triangular sequence of orthonormal basis functions such as polynomials or splines, we construct a tensor-product sieve space  as
      
      
        
      
      
      
      
    
      where  denotes a Hölder norm, and we let  as . The sieve ML estimator is obtained by solving
      
      
        
      
      
      
      
    
      or equivalently  such that , and the resulting propensity score estimator becomes .
Finally, using the estimated propensity score, we estimate the average treatment effect as
      
      
        
      
      
      
      
    Define  and Var. For the general class of , as long as the function is continuous and monotonic in h,  () shows that this treatment effect estimator achieves the semiparametric efficiency bound such that
      
      
        
      
      
      
      
    
      where  and
      
      
        
      
      
      
      
    
      which is identical to the efficiency bound derived by  (). This efficiency result with the general link function is obtained, similarly as in  (), following the influence function approach by  (). To see this, define
      
      
        
      
      
      
      
    
      where  denotes the derivative of the moment function for the treatment effect, , with respect to the propensity score , and  denotes its conditional expectation at the true parameter values. The asymptotic variance result of Equation (3) is obtained by showing that the estimator is asymptotically linear with influence function decomposed into two terms:
      
        
      
      
      
      
    The first term in Equation (5) is the influence function when we know the true propensity score , and the second term represents the contribution of the estimated propensity score on the asymptotic distribution of . It follows that the asymptotic variance V in Equation (3) equal to
      
      
        
      
      
      
      
    
      which derives the result.
3. Efficient Estimation When the True Propensity Score Is Parametric
As we discuss in the previous section, the efficiency of the treatment effect estimator depends on whether the estimator has the asymptotically linear representation as Equation (5). When the propensity score is estimated using a nonparametric sieve ML, we achieve this representation and hence the efficiency bound. Here, we pose the question of whether we can achieve this asymptotic linear representation if the true propensity is parametric, and is estimated under the correct parametric specification. We confirm that, in this case, the semiparametric efficiency bound is not achieved as can be inferred from the existing literature. This suggests that, even though we know the true propensity score belongs to a parametric class, we still need to estimate the propensity score by a nonparametric method.
Our intuition behind this result is that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. First, it approximates the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. For the purpose of illustration, without loss of generality, suppose  where  denotes the standard normal cumulative distribution function (CDF), so the true propensity is a probit model. We then can estimate  with MLE, denoted by , and obtain the parametric convergence rate such that  and hence  with .
For ease of notation without losing the main idea, we consider the special case that  with probability one. Define  as the average outcome of interest, where  is missing at random conditional on the covariates X. We estimate the average outcome as . For this estimator, following the Equation (5), if we can obtain the asymptotic linear representation as
      
      
        
      
      
      
      
    
      then we will achieve the efficiency bound. To see whether this asymptotic linear representation is attainable with parametric estimation of the propensity score, we decompose  as
      
      
        
      
      
      
      
    
      
        
      
      
      
      
    
      
        
      
      
      
      
    
      
        
      
      
      
      
    
      
        
      
      
      
      
    
      where , ,  denotes the distribution function of X,  with  being the standard normal density function, and
      
      
        
      
      
      
      
    
      If we can show that all terms (7)–(10) are , we then obtain the desirable result of Equation (6). Following the steps in  () or  (), it is straightforward to bound the terms (7)–(9) as . We focus on the term (10), from which we derive our main finding.
By inspecting  and , we see that  is the linear projection of  on . In other words,
      
      
        
      
      
      
      
    
      where the projection coefficient is given by . Therefore, unless  is indeed linear in ,5 we will have  for some positive constant C. It follows that
      
      
        
      
      
      
      
    
      Therefore, the term (10) remains as  and contributes to the asymptotic distribution of the treatment effect estimator. In other words, the asymptotic linear representation of Equation (6) is not obtained with parametric estimation of the propensity score in general, even when the true propensity score is parametric. In the Appendix, we derive the asymptotic variance of the treatment effect estimator with the estimated parametric propensity score, and characterize the efficiency loss due to parametric estimation of the propensity score. In particular, we show that this efficiency loss is exactly given by Equation (13).
As the key difference, in the treatment effect estimation using the nonparametric sieve estimation of the propensity score like Equation (2), it can be shown that when  is t-times continuously differentiable, we have
      
      
        
      
      
      
      
    
      where K denotes the number of approximating sieve terms used in  (see  () or  ()). Therefore, we can bound the term (10) as  for some large enough K. This is because Equation (12) becomes
      
      
        
      
      
      
      
    
      when the sieve estimation like Equation (2) is used to estimate the propensity score, where  denotes a vector of approximating basis functions, and hence the bound (14) is obtained due to some approximation theories of sieves for a class of smooth functions such as a Hölder class (see, e.g.,  ()). We, however, note that, because  and  are quite different in nature, the sieve approximation used to estimate the propensity score does not necessarily well approximate the latter in finite samples, which may contribute to the inefficiency of the treatment effect estimation.
Finally, by inspecting Equations (4) and (5) for the case  along with Equation (10), note that the term  is related to the conditional expectation of the derivative of the moment function with respect to the propensity score. This implies that the nonparametric sieve estimation of the propensity score plays two roles in the estimation of the treatment effect. It first approximates the true propensity score, and second approximates the conditional expectation of the derivative of the moment condition with respect to the propensity score. The parametric propensity score estimation can accomplish the first role, if the true one is parametric, but cannot achieve the second when some of covariates are continuous.
The asymptotic variance of a treatment effect estimator using parametric estimation of the propensity score can also depend on which parametric estimator is being used in practice. In this regard, given a parametric model of the propensity score, one can directly derive the asymptotic variance of the treatment effect estimator using the estimated parametric propensity score by combining two moments as a sequential estimation problem (see, e.g.,  ()). The first moment is given by, e.g., the first order condition of the population ML objective function of the propensity score estimation such as the logit or probit ML, and the second moment is given by the moment condition to estimate the treatment effect  defined in Equation (4). We can then directly compare the asymptotic variance of the treatment effect estimator resulting from using a specific parametric estimator of the propensity score to the semiparametric efficiency bound, instead of deriving the inefficiency term from the Equation (13). This joint moments approach for parametric estimation of the propensity score also allows us to explicitly derive the efficiency loss due to a specific parametric estimator of the propensity score, and hence compare different parametric models of the propensity score in terms of efficiency.6
3.1. Reconsidering the Simple Example in Hirano et al. (2003)
 () present a simple example with a binary covariate, illustrating that, weighting by the inverse of the propensity score estimate, rather than the true one, we can improve the efficiency and indeed achieve the efficiency bound. Here, we reproduce the example and provide an intuition why in this case the efficiency bound is achieved in view of the results from the previous section. Consider a simple problem of estimating the population average of a variable Y, , given a random sample of size n of the triple (). Therefore,  and  are observed for all units in the sample, but  is only observed if . Denote  and .
Now let  denote the number of observations with  and , for . Further assume that the true selection probability is .7 The estimated selection probability is then
        
      
        
      
      
      
      
    
        The true weights estimator is given by  while the estimated weights estimator is then .  () show that  is more efficient than , and  achieves the efficiency bound. Interestingly, one can easily see that  in Equation (15) is a nonparametric estimator of , and is also a parametric MLE of  since we can write  with  and .
In this example, for the corresponding terms of  and  in Equation (10), we show below that indeed
        
      
        
      
      
      
      
    
        for all x, and hence the efficiency bound is achieved for the estimator  because the asymptotic linear representation like Equation (6) is obtained (i.e., the term (14) is simply equal to zero in this case). To derive the result, consider the following terms corresponding to  and  in Equation (10) for the stochastic expansion of . Let
        
      
        
      
      
      
      
    
        where  denotes the probability mass of X.
By investigating  and , we can see that  is the linear projection of  on . In other words,  for some constants  and  that are determined by the linear projection. Note that we have
        
      
        
      
      
      
      
    
        if  for . Indeed, from the definition of , we find
        
      
        
      
      
      
      
    
        and therefore the efficiency result follows.
This example clearly illustrates why the condition like (16) is crucial to achieve the efficiency bound. This suggests that, when the covariates are multinomial, we can always achieve the condition like (16) since the parametric ML estimation becomes equivalent to the nonparametric ML estimation. Therefore, we can achieve the efficiency bound. However, when the covariates or a subset of covariates are continuous, using the parametric propensity score estimation cannot achieve the efficiency bound even though the true one is parametric. This also suggests that the efficiency loss due to using the parametric propensity score estimator is attributed to the fact that some covariates are continuous.
3.2. Generalization to Estimating the Weighted Average Treatment Effect
We generalize the efficiency comparison between treatment effect estimators using nonparametric or parametric estimation of the propensity score to the weighted average treatment effect, , defined as
        
      
        
      
      
      
      
    
        for a known weight function . We estimate  using the moment condition
        
      
        
      
      
      
      
    
        that yields the estimator as
        
      
        
      
      
      
      
    
        given an estimator of the propensity score .
Because the function  is known and only appears as a weight in the moment function (17), following the same line of argument for the average treatment effect, one can obtain the asymptotic linear representation of  using the nonparametric propensity score estimator (2) in Equation (18) as
        
      
        
      
      
      
      
    
        where  and . Therefore, the semiparametric efficiency bound is achieved for the weighted average treatment effect estimator  using the nonparametric propensity score estimator (see  ()). On the other hand, for the parametric propensity score estimator, we can derive the inefficiency term similar to Equation (13), the inefficiency term derived for the average treatment effect, as
        
      
        
      
      
      
      
    
        and therefore a similar inefficiency result holds for  using the parametric propensity score estimator.
Note that, under the unconfoundedness assumption ( (, )), with the weight function  being equal to the true propensity score , the weighted average treatment effect becomes the average treatment effect for the treated,
        
      
        
      
      
      
      
    
        Based on this equivalence,  can be estimated using the moment condition
        
      
        
      
      
      
      
    
        by replacing  with . However, an efficiency comparison between treatment effect estimators for  using the nonparametric or parametric propensity score estimator is more complicated because, in this case, the propensity score has two roles in the moment function. One is the inverse weighting to control for the self-selection and the other is the weighting function in place of . To see this, let  and  denote the nonparametric and the correctly specified parametric estimator of the propensity score, respectively. Then, we can consider three alternative estimators for the average treatment effect for the treated. One is using the parametric propensity score  everywhere and solving
        
      
        
      
      
      
      
    
        the second one is using the nonparametric propensity score  everywhere and solving
        
      
        
      
      
      
      
    
        and the last one is using the parametric propensity score  in place of  while using the nonparametric propensity score  for the inverse weighting and solving
        
      
        
      
      
      
      
    
        From the efficiency argument of  () and  () when the true propensity score is known, one can conjecture that the treatment effect estimator that solves Equation (21) will be more efficient than other estimators that solve Equations (19) and (20), respectively. However, in terms of efficiency, the two estimators solving Equations (19) and (20) (or other variations) cannot be uniformly ranked in general, and studying these alternative estimators is beyond the scope of this paper.
4. Conclusions
One can obtain efficient estimation of average treatment effects by the method of inverse propensity score weighting based on the estimated propensity score, rather than the true one, even when the true one is known. From the literature, we can infer that, even when the true propensity score is a parametric function, we still need to estimate the propensity score nonparametrically to achieve the efficiency. We formalize this argument and further identify the source of the efficiency loss due to parametric estimation of the propensity score. We also provide an intuition as to why this overfitting is necessary. The idea is that the nonparametric estimation of the propensity score plays two roles in the treatment effect estimation. It first replaces the true propensity score, and second it approximates the conditional expectation of the derivative of the moment condition for the treatment effect with respect to the propensity score. The parametric propensity score estimation can achieve the first but cannot achieve the second when some of covariates are continuous. This also suggests that the finite sample performance of the treatment effect estimator, using the imputation method based on the estimated propensity score, may depend not only on how precisely the propensity score is estimated, but also how well the conditional expectation of the derivative of the moment condition is approximated by the same approximating sieves or regressors that are used to estimate the propensity score.
Funding
This research received no external funding.
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Efficiency Loss Due to Parametric Estimation of the Propensity Score
For ease of notation, we assume  with probability one, and define  as the average outcome of interest. In the main text, we have established the following:8
      
        
      
      
      
      
    
        where  and . Define
        
      
        
      
      
      
      
    
        The first term  is the moment function that would be obtained when we do not estimate the propensity score . The second and the third term,  and , are the contribution of estimating  using the parametric ML estimator to the asymptotic distribution of . If we estimated the propensity score using a nonparametric ML estimation, even when the true propensity score is parametric, we would not have the third term since we can replace  with  without affecting the asymptotic distribution.
The asymptotic variance of  is equal to the variance of the sum of , , and . We obtain for each component that potentially determines the asymptotic variance:
      
        
      
      
      
      
    Combining these results, we obtain
        
      
        
      
      
      
      
    
        where the first term in the asymptotic variance is identical to the efficiency bound derived by  (). Therefore, the efficiency loss due to parametric estimation of the propensity score is given by .
References
- Abadie, Alberto, and Guido W. Imbens. 2002. Simple and Bias-Corrected Matching Estimators for Average Treatment Effects. NBER Working Paper. Cambridge: National Bureau of Economic Research, vol. 283. [Google Scholar]
 - Abadie, Alberto, and Guido W. Imbens. 2006. Large Sample Properties of Matching Estimators for Average Treatment Effects. Econometrica 74: 235–67. [Google Scholar] [CrossRef]
 - Chen, Xiaohong. 2007. Large Sample Sieve Estimation of Semi-Nonparametric Models. Handbook of Econometrics 6: 5549–632. [Google Scholar]
 - Chen, Xiaohong, and Xiaotong Shen. 1998. Sieve Extremum Estimates for Weakly Dependent Data. Econometrica 66: 289–314. [Google Scholar] [CrossRef]
 - Hahn, Jinyong. 1998. On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects. Econometrica 66: 315–31. [Google Scholar] [CrossRef]
 - Heckman, James J., Hidehiko Ichimura, and Petra Todd. 1998. Matching as an Econometric Evaluations Estimator. Review of Economic Studies 65: 605–54. [Google Scholar] [CrossRef]
 - Hirano, Keisuke, Guido W. Imbens, and Geert Ridder. 2003. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score. Econometrica 71: 1161–89. [Google Scholar] [CrossRef]
 - Hitomi, Kohtaro, Yoshihiko Nishiyama, and Ryo Okui. 2008. A Puzzling Phenomenon in Semiparametric Estimation Problems with Infinite-Dimensional Nuisance Parameters. Econometric Theory 24: 1717–28. [Google Scholar] [CrossRef]
 - Hristache, Marian, and Valentin Patilea. 2017. Conditional Moment Models with Data Missing at Random. Biometrika 104: 735–42. [Google Scholar] [CrossRef]
 - Imbens, Guido W. 2004. Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review. The Review of Economics and Statistics 86: 4–29. [Google Scholar] [CrossRef]
 - Kang, Joseph D. Y., and Joseph L. Schafer. 2007. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science 22: 523–39. [Google Scholar] [CrossRef]
 - Kim, Kyoo Il. 2013. An Alternative Efficient Estimation of Average Treatment Effects. Journal of Market Economy 42: 1–41. [Google Scholar]
 - Li, Qi, Jeffrey S. Racine, and Jeffrey M. Wooldridge. 2009. Efficient Estimation of Average Treatment Effects with Mixed Categorical and Continuous Data. Journal of Business and Economics Statistics 27: 206–23. [Google Scholar] [CrossRef]
 - Newey, Whitney K. 1984. A Method of Moments Interpretation of Sequential Estimators. Economics Letters 14: 201–6. [Google Scholar] [CrossRef]
 - Newey, Whitney K. 1994. The Asymptotic Variance of Semiparametric Estimators. Econometrica 62: 1349–82. [Google Scholar] [CrossRef]
 - Newey, Whitney K. 1997. Convergence Rates and Asymptotic Normality for Series Estimators. Journal of Econometrics 79: 147–68. [Google Scholar] [CrossRef]
 - Prokhorov, Artem, and Peter Schmidt. 2009. GMM Redundancy Results for General Missing Data Problems. Journal of Econometrics 151: 47–55. [Google Scholar] [CrossRef]
 - Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1995. Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association 90: 106–21. [Google Scholar] [CrossRef]
 - Rosenbaum, Paul R. 1987. Model-Based Direct Adjustment. Journal of the American Statistical Association 82: 387–94. [Google Scholar] [CrossRef]
 - Rosenbaum, Paul R., and Donald B. Rubin. 1983. The Central Role of the Propensity Score in Observational Studies for Casual Effects. Biometrika 70: 41–55. [Google Scholar] [CrossRef]
 - Rosenbaum, Paul R., and Donald B. Rubin. 1984. Reducing Bias in Observational Studies Using Subclassification on the Propensity Score. Journal of the American Statistical Association 79: 516–24. [Google Scholar] [CrossRef]
 - Rubin, Donald B. 1973. The Use of Matched Sampling and Regression Adjustments to Remove Bias in Observational Studies. Biometrics 29: 185–203. [Google Scholar] [CrossRef]
 - Rubin, Donald B., and Neal Thomas. 1996. Matching Using Estimated Propensity Scores: Relating Theory to Practice. Biometrics 52: 249–64. [Google Scholar] [CrossRef] [PubMed]
 - Shen, Xiaotong, and Wing Hung Wong. 1994. Convergence Rate of Sieve Estimates. The Annals of Statistics 22: 580–615. [Google Scholar] [CrossRef]
 - Tan, Zhiqiang. 2007. Comment: Understanding OR, PS and DR. Statistical Science 22: 560–68. [Google Scholar] [CrossRef]
 
| 1 | As it is pointed out by one referee, in the literature, there have been several studies, related to our findings, that show using an estimated nuisance parameter rather than the true value improves the efficiency of the parameter estimate of interest (see, e.g.,  ();  (); and  ()). Our work provides a new insight to this problem by illustrating parametric estimation of the nuisance parameter may not achieve the full efficiency.  | 
| 2 |  () proposes to estimate , , and  using series estimations (e.g.,  ()). The resulting treatment effect estimators, however, are subject to some practical issues, e.g., the propensity score estimate  may lie outside the zero and one interval.  | 
| 3 | See, e.g.,  () and  () for further details on the sieve extremum estimations that include the sieve ML estimation.  | 
| 4 | The Hölder space is a space of functions ,  such that the first  derivatives are bounded, and the -th derivatives are Hölder continuous with the exponent , where  is the largest integer smaller than . The Hölder space becomes a Banach space when endowed with the Hölder norm:
            | 
| 5 | When the true propensity score is the logit model instead, this term is replaced by  where   | 
| 6 | We thank the referee for this useful suggestion.  | 
| 7 | In the orignial example, we have .  | 
| 8 | This is because the terms (7)–(9) are  and only the terms (10) and (11) remain in the stochastic expansion.  | 
© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).