Next Article in Journal
Machine Learning at the Service of Survival Analysis: Predictions Using Time-to-Event Decomposition and Classification Applied to a Decrease of Blood Antibodies against COVID-19
Next Article in Special Issue
On a Low-Rank Matrix Single-Index Model
Previous Article in Journal
Frequency Domain Filtered Residual Network for Deepfake Detection
Previous Article in Special Issue
Representation Theorem and Functional CLT for RKHS-Based Function-on-Function Regressions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Non-Asymptotic Bounds of AIPW Estimators for Means with Missingness at Random

1
College of Science, Minzu University of China, Beijing 100081, China
2
School of Mathematical Sciences, Peking University, Beijing 100871, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(4), 818; https://doi.org/10.3390/math11040818
Submission received: 31 December 2022 / Revised: 1 February 2023 / Accepted: 3 February 2023 / Published: 6 February 2023
(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Abstract

:
The augmented inverse probability weighting is well known for its double robustness in missing data and causal inference. If either the propensity score model or the outcome regression model is correctly specified, the estimator is guaranteed to be consistent. Another important property of the augmented inverse probability weighting is that it can achieve first-order equivalence to the oracle estimator in which all nuisance parameters are known, even if the fitted models do not converge at the parametric root-n rate. We explore the non-asymptotic properties of the augmented inverse probability weighting estimator to infer the population mean with missingness at random. We also consider inferences of the mean outcomes on the observed group and on the unobserved group.

1. Introduction

Missingness is an important issue in statistics. Suppose we are interested in the population mean of an outcome. To estimate the population mean, we randomly draw n independent units to form a sample. It is well known that the empirical average of the outcome in this sample is an unbiased estimator of the population mean. However, if there is missingness, we would only have measures of outcomes on r n units. Missingness may rely on the background characters of units, so the observed units and unobserved units may be different in covariates. As a result, estimating the population mean by the observed sample average is biased. Rubin [1] formalized the concept of “missing-at-random”, saying that missingness depends on observed values but not on unobserved values. More specifically, missingness of an outcome depends on the observed covariates rather than outcomes [2]. Under the missing-at-random assumption, the population mean is identifiable.
With missingness at random, the population mean can be estimated by inverse probability weighting [3]. This approach is based on correct model specification on propensity scores and involves the estimation of propensity scores. Hirano et al. [4] discussed the consequence of using estimated propensity scores in inverse probability weighting. Using estimated propensity scores could lead to a more efficient estimator than using the true propensity score, but the exact properties of the former are complex. To improve the inverse probability weighting, Robins et al. [5] proposed augmented inverse probability weighting (AIPW) that combines the inverse probability weighting with outcome regression. If both the propensity score model and the outcome regression model are known, the AIPW estimator is the most efficient estimator in a set of regular and asymptotic linear estimators [6,7]. The AIPW estimator has double robustness, in that the estimator is consistent if either the propensity score model or outcome regression model is correctly specified [8]. Farrell [9] discussed AIPW with high-dimensional covariates.
The asymptotic properties of the AIPW estimator have been well-studied in recent years. Chernozhukov et al. [10] found that if both the fitted propensity score model and the outcome regression model are sup-norm consistent and converge not too slowly—for example, if all models converge at rate o p ( n 1 / 4 ) —, then, the AIPW estimator with estimated model parameters can achieve first-order equivalence with the oracle estimator in which both models are known. The (empirical) AIPW estimator and oracle estimator differ by o p ( n 1 / 2 ) . Therefore, the large sample asymptotic law of the AIPW estimator is the same as that of the oracle one, converging in distribution to a normal distribution. Newey and Robins [11] and Kennedy [12] discussed more general conditions to achieve root-n convergence with estimated nuisance models by cross-fitting (also called sample-splitting).
Existing works mainly focus on the asymptotic properties of the AIPW estimator. In this paper, we study the AIPW estimator from a non-asymptotic view. The main focus is on the non-asymptotic error bounds of the estimated mean outcomes by AIPW. Zhang and Chen [13] and Zhang and Wei [14] reviewed a series of concentration inequalities that can be used to bound the tail probability of an estimator. We apply the concentration inequalities to study the behavior of doubly robust estimators of the population mean, the population mean in the observed group and the population mean in the unobserved group by AIPW. We will first discuss bounded outcomes, and then extend the results to subGaussian outcomes. Furthermore, we conduct a simulation study to compare the non-asymptotic error bounds for bounded outcomes and sub-Gaussian outcomes.

2. Notations

2.1. Missingness at Random

Consider a sample I = { 1 , , n } randomly drawn from a super-population. For each unit i I , let Z i be the missing indicator, where Z i = 1 if the outcome Y i is observed and Z i = 0 if the outcome Y i is not observed. We collect a vector of covariates X i for each unit, which is predictable of the missing propensity or outcomes. The observed variables of the ith unit is O i = ( Z i , Z i Y i , X i ) , for i = 1 , , n . The observed data { O i } i = 1 n are n independent and identically distributed (iid) copies of O = ( Z , Z Y , X ) . Define the propensity score
e ( x ) = P ( Z = 1 X = x )
and the outcome regression
m ( x ) = E ( Y X = x ) .
In many scenarios in biostatistics, economics and social sciences, we are interested in the mean outcome in the overall population τ = E ( Y ) . We assume missing at random (MAR), that is, whether a unit is missing is independent of its outcome.
Assumption 1 
(Missingness at random). Z Y X .
Under Assumption 1, m ( x ) = E ( Y Z = 1 , X = x ) . Moreover, we also assume the propensity score is bounded far from 0. Each unit has a positive probability of being observed.
Assumption 2 
(Positivity). e ( X ) > η > 0 , where η is a known constant.

2.2. Augmented Inverse Probability Weighting

The information of X can be summarized into a one-dimensional scalar, the propensity score e ( X ) . Conditioning on the propensity score rather than all the covariates, we have Z Y e ( X ) [3]. The inverse probability weighting (Horvitz-Thompson [15]) utilizes this property and identifies the target estimand by
τ = E Z Y e ( X ) .
Another approach to identifying τ is outcome regression. Since we can estimate the mean outcome in observed units, this information can be generalized to the whole population at each level of covariates X. Therefore, we can simply express τ as
τ = E [ E { Y Z = 1 , X } ] = E { m ( X ) } ,
which is also identifiable from observed data. The inner expectation in the second formula is taken over Y given X in the observable population, and the outer expectation is taken over X in the whole population.
The inverse probability weighting and outcome regression can be combined into the following expressions by appending an augmented term, referred to as the augmented inverse probability weighting (AIPW),
τ = E Z Y e ( X ) + 1 Z e ( X ) m ( X )
= E Z ( Y m ( X ) ) e ( X ) + m ( X ) .
Equation (3) can be understood as the ordinary inverse probability weighting, plus an augmented term, to correct for the potential bias of the propensity score. With a little transformation, Equation (4) can be understood as the ordinary outcome regression plus a correction for the potential bias of the outcome regression model. The augmented inverse probability weighting has a well-known property of double robustness, in that the equations above hold if either the propensity score model e ( x ) or the outcome regression model m ( x ) is correctly specified. That is,
τ = E Z Y e ( X ) + 1 Z e ( X ) m * ( X )
= E Z ( Y m ( X ) ) e * ( X ) + m ( X )
for any functions e * ( x ) and m * ( x ) . Another important property of AIPW is that the term in the expectation is the efficient influence function of τ (differing by a constant τ ), so that estimation based on AIPW is the most efficient.
Define the oracle AIPW estimator as
τ ^ * = 1 n i = 1 n Z i Y i e ( X i ) + 1 Z i e ( X i ) m ( X i ) .
It is an average of n independent random variables. We can easily prove that τ ^ * is an unbiased and consistent estimator for τ , with E ( τ ^ * ) = τ . In practice, the propensity score and the outcome regression models are unknown. Suppose we estimate them by e ^ ( x ) and m ^ ( x ) . The (empirical) AIPW estimator becomes
τ ^ = 1 n i = 1 n Z i Y i e ^ ( X i ) + 1 Z i e ^ ( X i ) m ^ ( X i ) .
Since Equation (8) involves fitted models, it is not an average of n independent random variables anymore.

3. Construction of Error Bounds for Bounded Outcomes

Lemma 1 
(McDiarmid’s inequality [16]). Suppose O 1 , , O n are independent random variables taking values in the set O , and assume f : O n R satisfies the bounded difference condition (BDC)
sup o 1 , , o n , o k O | f ( o 1 , , o n ) f ( o 1 , , o k 1 , o k , o k + 1 , , o n ) | c k .
Then,
P | f ( O 1 , , O n ) E { f ( O 1 , , O n ) } | t 2 exp 2 t 2 i = 1 n c i 2 , t > 0 .
We assume that | Y | M , where M is a positive constant. In fact, if Y is not bounded, we can transform Y to be a bounded random variable. In the following sections, we consider the non-asymptotic inferences of τ = E ( Y ) , τ 1 = E ( Y Z = 1 ) and τ 0 = E ( Y Z = 0 ) , respectively.

3.1. Mean Outcome

We first consider the oracle estimator τ ^ * . To verify the bounded difference condition, suppose we replace the observation of the kth unit O k = ( Z k , Z k Y k , X k ) with O k = ( Z k , Z k Y k , X k ) . Then,
| D * | : = | τ ^ * ( O 1 , , O k , , O n ) τ ^ * ( O 1 , , O k , , O n ) | = | 1 n Z k Y k e ( X k ) + 1 Z k e ( X k ) m ( X k ) 1 n Z k Y k e ( X k ) + 1 Z k e ( X k ) m ( X k ) | 2 n · 2 η 1 M = : c * .
By McDiarmid’s inequality, since E ( τ ^ * ) = τ , we have
P n | τ ^ * τ | t 2 exp 2 t 2 n 2 c * 2 = 2 exp t 2 2 ( 2 / η 1 ) 2 M 2 .
To study the properties of the AIPW estimator (8) with estimated nuisance models, Chernozhukov et al. [10] proposed an approach by cross-fitting. Suppose the full sample is randomly divided into two halves, I 1 and I 2 , with similar sample sizes | I 1 | and | I 2 | . The models e ^ ( x ) and m ^ ( x ) are estimated using a half sample and then fitted on the other half. To be more specific, for each unit with covariates x i in the ( 3 l ) th half, e ( x i ) and m ( x i ) are estimated as e ^ ( l ) ( x i ) and m ^ ( l ) ( x i ) , which are fitted using the lth half of sample ( l = 1 , 2 ). The AIPW estimator τ ^ is given by averaging the estimated mean outcomes in these two halves of the full sample,
τ ^ = | I 1 | n τ ^ ( I 1 ) + | I 2 | n τ ^ ( I 2 ) ,
where
τ ^ ( I l ) = 1 | I l | i I l Z i Y i e ^ ( 3 l ) ( X i ) + 1 Z i e ^ ( 3 l ) ( X i ) m ^ ( 3 l ) ( X i ) .
As a comparison, the oracle estimator can be decomposed by
τ ^ * = | I 1 | n τ ^ * ( I 1 ) + | I 2 | n τ ^ * ( I 2 ) ,
where
τ ^ * ( I l ) = 1 | I l | i I l Z i Y i e ( X i ) + 1 Z i e ( X i ) m ( X i ) .
We follow the assumptions in Chernozhukov et al. [10] as follows.
Assumption 3 
(Sup-norm Consistency). sup x | m ^ ( x ) m ( x ) | p 0 , sup x | e ^ ( x ) e ( x ) | p 0 for any x on its support.
Assumption 4 
(Risk Decay). E { m ^ ( X ) m ( X ) } 2 · E { e ^ ( X ) e ( X ) } 2 = o ( n 1 ) .
Assumption 3 states that the fitted values of the propensity score and outcome regression models are consistent to the true value at any data point. Assumption 4 states that the convergence rates of models are not too slow. A wide range of nonparametric estimators can satisfy Assumptions 3 and 4. The deviation of τ ^ is bounded as follows.
Theorem 1. 
Let τ ^ be the AIPW estimator of τ = E ( Y ) by cross-fitting. Under Assumptions 1–4, for any 0 < ϵ < 1 ,
| τ ^ τ | 2 ( 2 / η 1 ) 2 M 2 log ( 2 / ϵ ) n + o ( n 1 )
with probability larger than 1 ϵ .
Proof. 
For l = 1 , 2 ,
τ ^ ( I l ) τ ^ * ( I l ) = 1 | I l | i I l Z i 1 e ^ ( 3 l ) ( X i ) 1 e ( X i ) { Y i m ( X i ) } 1 | I l | i I l { m ^ ( 3 l ) ( X i ) m ( X i ) } Z i e ( X i ) 1 1 | I l | i I l Z i 1 e ^ ( 3 l ) ( X i ) 1 e ( X i ) { m ^ ( 3 l ) ( X i ) m ( X i ) } .
Note that the first two terms are sums of cross products of a mean zero random variable multiplied by an independent o p ( 1 ) random variable (since I l can be considered as fixed by conditioning on). The first two terms are o p ( n 1 / 2 ) random variables. Assumption 4 implies that the third term is also o p ( n 1 / 2 ) by Cauchy–Schwarz. Therefore, | τ ^ ( I l ) τ ^ * ( I l ) | = o p ( n 1 / 2 ) and thus n | τ ^ τ ^ * | = o p ( 1 ) . From the concentration inequality (9), we have
P n | τ ^ τ | t + o ( 1 ) 2 exp t 2 2 ( 2 / η 1 ) 2 M 2 .
Let the right hand side be ϵ , so
t = 2 ( 2 / η 1 ) 2 M 2 log ( 2 / ϵ ) .
The first-order equivalence of τ ^ and τ ^ * is important. If there are high-dimensional covariates, the convergence rates of fitted models usually cannot achieve the rate of O p ( n 1 / 2 ) . In addition, the risk decay assumption allows nonparametric estimation of models, for example, by spline or kernel regression. Provided that the estimated models do not converge too slowly, the AIPW estimator τ ^ can enjoy good asymptotic and non-asymptotic properties similar to those of the oracle estimator τ ^ * .

3.2. Mean Outcome in the Observed Group

We can also study the non-asymptotic bound for τ 1 = E ( Y Z = 1 ) . This estimate can be estimated by
τ ^ 1 = i = 1 n Z i Y i i = 1 n Z i ,
where no nuisance models are involved. This empirical average estimator τ ^ 1 is unbiased and consistent for τ 1 , because
E ( τ ^ 1 ) = E i = 1 n Z i Y i i = 1 n Z i = E E i = 1 n Z i Y i i = 1 n Z i | Z 1 , , Z n = E i = 1 n Z i · E ( Y i Z i ) i = 1 n Z i = E i = 1 n Z i · E ( Y i Z i = 1 ) i = 1 n Z i = τ 1 .
The double robustness of τ ^ 1 is trivial because the propensity score model and the outcome regression model are not involved.
To verify the bounded difference condition, suppose we replace the observation of the kth unit O k = ( Z k , Z k Y k , X k ) with O k = ( Z k , Z k Y k , X k ) . If Z k = Z k = 1 ,
| D 1 | : = | τ ^ 1 ( O 1 , , O k , , O n ) τ ^ 1 ( O 1 , , O k , , O n ) | = | i k Z i Y i + Y k i k Z i + 1 i k Z i Y i + Y k i k Z i + 1 | = 1 i k Z i + 1 · | Y k Y k | 1 n η · 2 M = : c 1
with probability larger than
δ n = P i = 1 n Z i n η = m = [ n η ] + 1 n C n m η m ( 1 η ) 1 m .
By the weak law of large numbers, we know that δ n 1 as n because E ( Z i ) > η . If Z k = Z k = 0 ,
| D 1 | : = | τ ^ 1 ( O 1 , , O k , , O n ) τ ^ 1 ( O 1 , , O k , , O n ) | = | i k Z i Y i i k Z i i k Z i Y i i k Z i | = 0 .
If Z k = 1 and Z k = 0 , using the equality
c d a b = 1 d ( c a ) a b ( d b )
for any nonzero a , b , c , d R ,
| D 1 | : = | τ ^ 1 ( O 1 , , O k , , O n ) τ ^ 1 ( O 1 , , O k , , O n ) | = | i k Z i Y i + Y k i k Z i + 1 i k Z i Y i i k Z i | = 1 i k Z i + 1 · | Y k i k Z i Y i i k Z i | 1 n η · 2 M = c 1
with probability larger than δ n . The result is similar for the case with Z k = 0 and Z k = 1 . In summary, | D 1 | c 1 . According to the McDiarmid’s inequality,
P n | τ ^ 1 τ 1 | t 2 exp η 2 t 2 2 M 2 + 1 δ n , t > 0 .
Theorem 2. 
Let τ ^ 1 be the empirical average estimator of τ 1 = E ( Y Z = 1 ) in (17). Under Assumption 2, for any 1 δ n < ϵ < 1 ,
| τ ^ 1 τ 1 | 2 M 2 n η 2 log 2 ϵ + δ n 1
with probability larger than 1 ϵ .
Proof. 
Let the right hand side of (20) be ϵ , so
t = 2 M 2 η 2 log 2 ϵ + δ n 1 .

3.3. Mean Outcome in the Unobserved Group

We further assume e ( X ) < 1 η , so that there would be missing units conditioning on X. Otherwise, the estimand τ 0 = E ( Y Z = 0 ) would be meaningless. To estimate τ 0 , we must use the information of the observed units because there are no observations of Y in the Z = 0 group. The estimand τ 0 should address a covariate shift from the overall population to the Z = 0 group, which imposes a weight { 1 e ( x ) } / P ( Z = 0 ) . The inverse probability weighting formula of τ 0 is
τ 0 = E Z Y e ( X ) · 1 e ( X ) P ( Z = 0 ) .
By appending an augmented term and estimating P ( Z = 0 ) by empirical average, the oracle doubly robust estimator for τ 0 is given by [17]
τ ^ 0 * = 1 i = 1 n ( 1 Z i ) · i = 1 n Z i Y i ( 1 e ( X i ) ) ( Z i e ( X i ) ) m ( X i ) e ( X i ) .
It can be shown that E ( τ ^ 0 * ) = τ 0 because
E ( τ ^ 0 * ) = E 1 i = 1 n ( 1 Z i ) · i = 1 n Z i Y i ( 1 e ( X i ) ) ( Z i e ( X i ) ) m ( X i ) e ( X i ) = E [ E { 1 i = 1 n ( 1 Z i ) · i = 1 n Z i Y i ( 1 e ( X i ) ) ( Z i e ( X i ) ) m ( X i ) e ( X i ) | Z 1 , , Z n , X 1 , , X n } ] = E 1 i = 1 n ( 1 Z i ) · i = 1 n Z i m ( X i ) ( 1 e ( X i ) ) ( Z i e ( X i ) ) m ( X i ) e ( X i ) = E 1 i = 1 n ( 1 Z i ) · i = 1 n ( 1 Z i ) m ( X i ) = E E 1 i = 1 n ( 1 Z i ) · i = 1 n ( 1 Z i ) m ( X i ) | Z 1 , , Z n = E 1 i = 1 n ( 1 Z i ) · i = 1 n ( 1 Z i ) E { m ( X i ) Z i = 0 } = E 1 i = 1 n ( 1 Z i ) · i = 1 n ( 1 Z i ) E ( Y i Z i = 0 ) = τ 0 .
In fact, E ( τ ^ 0 * ) = τ 0 if either e ( x ) or m ( x ) is correctly specified.
To verify the bounded difference condition, suppose we replace the observation of the kth unit O k = ( Z k , Z k Y k , X k ) with O k = ( Z k , Z k Y k , X k ) . If Z k = Z k ,
| D 0 * | : = | τ ^ 0 ( O 1 , , O k , , O n ) τ ^ 0 ( O 1 , , O k , , O n ) | = 1 i = 1 n ( 1 Z i ) · | Z k Y k ( 1 e ( X k ) ) ( Z k e ( X k ) ) m ( X k ) e ( X k ) Z k Y k ( 1 e ( X k ) ) ( Z k e ( X k ) ) m ( X k ) e ( X k ) | 2 n η · 2 M η = : c 0 *
with probability larger than
δ n = P i = 1 n ( 1 Z i ) n η = m = 0 [ n ( 1 η ) ] C n m η m ( 1 η ) n m .
By the weak law of large numbers, δ n 1 as n because E ( 1 Z i ) > η . If Z k Z k , without loss of generality, suppose Z k = 0 and Z k = 1 . Use Equation (19),
| D 0 * | : = | τ ^ 0 * ( O 1 , , O k , , O n ) τ ^ 0 * ( O 1 , , O k , , O n ) | = 1 i = 1 n ( 1 Z i ) · | Z k Y k ( 1 e ( X k ) ) ( Z k e ( X k ) ) m ( X k ) e ( X k ) Z k Y k ( 1 e ( X k ) ) ( Z k e ( X k ) ) m ( X k ) e ( X k ) τ ^ 0 * | 1 n η · 2 η 1 + 1 M c 0 *
with probability larger than δ n . By McDiarmid’s inequality,
P n | τ ^ 0 * τ 0 | t 2 exp η 4 t 2 8 M 2 + 1 δ n .
If the models e ( x ) and m ( x ) are unknown, we use cross-fitting to obtain the AIPW estimator
τ ^ 0 = 1 i = 1 n ( 1 Z i ) · i = 1 n Z i Y i ( 1 e ^ ( X i ) ) ( Z i e ^ ( X i ) ) m ^ ( X i ) e ^ ( X i ) .
This estimator by cross-fitting can achieve first-order equivalence with the oracle estimator τ ^ 0 * (which we will prove later). In fact, the estimator (27) can be further expressed as
τ ^ 0 = | I 1 | i = 1 n ( 1 Z i ) τ ^ 0 ( I 1 ) + | I 2 | i = 1 n ( 1 Z i ) τ ^ 0 ( I 2 ) ,
with
τ ^ 0 ( I l ) = 1 | I l | · i I l Z i Y i ( 1 e ^ ( 3 l ) ( X i ) ) ( Z i e ^ ( 3 l ) ( X i ) ) m ^ ( 3 l ) ( X i ) e ^ ( 3 l ) ( X i ) ,
where m ^ ( 3 l ) ( x ) and e ^ ( 3 l ) are models fitted by the ( 3 l ) th half of the sample ( l = 1 , 2 ). As a comparison, the oracle estimator can be decomposed by
τ ^ 0 * = | I 1 | i = 1 n ( 1 Z i ) τ ^ 0 * ( I 1 ) + | I 2 | i = 1 n ( 1 Z i ) τ ^ 0 * ( I 2 ) ,
with
τ ^ 0 * ( I l ) = 1 | I l | · i I l Z i Y i ( 1 e ( X i ) ) ( Z i e ( X i ) ) m ( X i ) e ( X i ) .
Theorem 3. 
Let τ ^ 0 be the AIPW estimator of τ 0 = E ( Y Z = 0 ) by cross-fitting given in (27). Under Assumptions 1–4 and e ( x ) < 1 η , for any 1 δ n < ϵ < 1 ,
| τ ^ 0 τ 0 | 8 M 2 n η 4 log 2 ϵ + δ n 1 + o ( n 1 )
with probability larger than 1 ϵ .
Proof. 
For l = 1 , 2 ,
τ ^ 0 ( I l ) τ ^ 0 * ( I l ) = 1 | I l | i I l Z i 1 e ^ ( 3 l ) ( X i ) 1 e ( X i ) { Y i m ( X i ) } 1 | I l | i I l { m ^ ( 3 l ) ( X i ) m ( X i ) } Z i e ( X i ) 1 1 | I l | i I l Z i 1 e ^ ( 3 l ) ( X i ) 1 e ( X i ) { m ^ ( 3 l ) ( X i ) m ( X i ) } .
Note that the first two terms are sums of cross products of a mean zero random variable multiplied by an independent o p ( 1 ) random variable (since I l can be considered as fixed by conditioning on), so the first two terms are o p ( n 1 / 2 ) random variables. Assumption 4 implies that the third term is also o p ( n 1 / 2 ) by Cauchy–Schwarz. Therefore, | τ ^ 0 ( I l ) τ ^ 0 * ( I l ) | = o p ( n 1 / 2 ) ; Thus, n | τ ^ 0 τ ^ 0 * | = o p ( 1 ) . From (26), we have
P n | τ ^ 0 τ 0 | t + o ( 1 ) 2 exp η 4 t 2 8 M 2 + 1 δ n .
Let the right-hand-side be ϵ , so
t = 8 M 2 η 4 log 2 ϵ + δ n 1 .

4. Construction of Error Bounds for SubGaussian Outcomes

Recently the McDiarmid’s inequality has been extended to boundless situations [18]. In this section, we assume that Y is sub-Gaussian, i.e.,
E e t Y e t 2 σ 2 / 2 , t R ,
for some σ 2 > 0 [13]. Define the sub-Gaussian norm
Y G : = sup k 1 E ( Y 2 k ) ( 2 k 1 ) ! ! 1 / 2 k
for a random variable Y. The random variable Y is sub-Gaussian if Y G < .
Lemma 2 
(Zhang–Lei’s inequality [18]). Suppose O 1 , , O n are independent random variables taking values in the set O , and assume f : O n R is a function. Define
D f , O k ( o ) = f ( o 1 , , o k 1 , O k , o k + 1 , , o n ) E { f ( o 1 , , o k 1 , O k , o k + 1 , , o n ) } .
If { D f , O k ( o ) } i = 1 n have finite · G -norm, then for t > 0 ,
P | f ( O 1 , , O n ) E { f ( O 1 , , O n ) } | t 2 exp t 2 16 sup o O n i = 1 n D f , O i ( o ) G 2 .
Now we assume that the conditional mean outcome m ( X ) and residual Y m ( X ) are sub-Gaussian, so that Y is also subGaussian. For τ = E ( Y ) , let τ ^ * be the oracle AIPW estimator. Then,
D τ ^ * , O k ( o ) = 1 n Z k Y k e ( X k ) + 1 Z k e ( X k ) m ( X k ) τ .
To verify that D τ ^ * , O k ( o ) is sub-Gaussian, it suffices to prove that the 2 k -th moment norm of D τ ^ * , O k ( o ) is finite for every k 1 . By the triangle inequality,
D τ ^ * , O k ( o ) 2 k = 1 n Z k Y k e ( X k ) + 1 Z k e ( X k ) m ( X k ) τ 2 k Z k { Y k m ( X k ) } n e ( X k ) 2 k + m ( X k ) τ n 2 k Y k m ( X k ) n η 2 k + m ( X k ) τ n 2 k < ,
where W 2 k : = [ E | W | 2 k ] 1 / 2 k for a random variable W.
Let ν = m ( X ) G and σ = Y m ( X ) G , so D τ ^ * , O k ( o ) G ( σ / η + ν ) / n . Therefore,
P n | τ ^ * τ | t 2 exp t 2 16 ( σ / η + ν ) 2 .
Considering that the AIPW estimator τ ^ is first-order equivalent to the oracle version τ ^ * , we have the following error bound for τ ^ .
Theorem 4. 
Let τ ^ be the AIPW estimator of τ = E ( Y ) by cross-fitting. Under Assumptions 1–4, for any 0 < ϵ < 1 ,
| τ ^ τ | 16 ( σ / η + ν ) 2 log ( 2 / ϵ ) n + o ( n 1 )
with probability larger than 1 ϵ .
The proof is similar to the proof of Theorem 1. Since τ ^ and τ ^ * are first-order equivalent, we can replace the left-hand-side of Inequality (35) with P ( n | τ ^ τ | t + o ( 1 ) ) , and then solve this inequality. Furthermore, we can apply Lemma 2 to study the non-asymptotic error bounds for τ ^ 1 and τ ^ 0 , the (doubly robust) estimators for τ 1 = E ( Y Z = 1 ) and τ 0 = E ( Y Z = 0 ) . Given that Y is sub-Gaussian, D τ ^ z , O k ( o ) G is a weighted average of sub-Gaussian random variables where the weights are bounded ( z = 1 , 0 ). Whenever the event with high probability that i = 1 n Z i is bounded away from 0 or 1 occurs, the error bounds of τ ^ z would then be obtained.

5. Simulation

Suppose there are two independent covariates, X 1 and X 2 , both following a uniform distribution on [ 1 , 1 ] . Denote X = ( 1 , X 1 , X 2 ) . The propensity score is e ( x ) = e x p i t ( 1 0.5 X 1 + 0.5 X 2 ) , where e x p i t ( x ) = 1 / { 1 + exp ( x ) } . Since X is bounded, the propensity score is also bounded away from zero and one. Next, we consider two sorts of outcomes. The first is the binary case: P ( Y = 1 X ) = e x p i t ( X 1 + X 2 ) . The second is the continuous case: Y follows a normal distribution with mean X 1 + X 2 and standard deviation of value one.
We set the sample size n from 100 to 1000 with step 100. Under each sample size, we generate data 1000 times and calculate the average width of error bounds. Figure 1 displays the widths of error bounds in Theorem 1 (based on McDiarmid’s inequality) and Theorem 4 (based on Zhang–Lei’s inequality) respectively. We set the parameters in the formulas of error bounds at their true values.
In practice, the parameters η , σ and ν are unknown, so we need to estimate them. We fit a logistic regression of Z on X as the propensity score, denoted by e ^ ( x ) . We can estimate η by η ^ = min i { e ^ ( X i ) } . An outcome regression is fitted by a logistic regression for binary outcomes and linear regression for continuous outcomes, denoted by m ^ ( x ) . If Y { 0 , 1 } is binary, we can transform Y to Y 0.5 , so that M = 0.5 . If Y is continuous, the bound of Y can be estimated by M ^ = max i { Y i j = 1 n Y j / n } . The sub-Gaussian norms σ and ν are estimated based on the empirical 2 k -th moments of Y i m ^ ( X i ) and m ^ ( X i ) , respectively, denoted by σ ^ and ν ^ . Then, we can obtain two error bounds according to Theorem 1 and Theorem 4, respectively.
Figure 2 displays the widths of error bounds in Theorem 1 and Theorem 4. When the sample size n is small, the empirical error bounds might be slightly shorter than the oracle versions. In the binary case, the error bound based on Theorem 1 (McDiarmid’s inequality) performs better. In the continuous case, the error bound based on Theorem 4 (Zhang–Lei’s inequality) performs better. In fact, if the outcome is binary, it is naturally bounded. We do not need information about higher-order moments. If the outcome is continuous, the estimation of the bound M would be unstably affected by extreme values of outcomes, so the error bound by McDiarmid’s inequality could be unstable and too wide. On the contrary, Zhang–Lei’s inequality just requires finite moments of the outcomes without putting restrictions on the maximum value, so Zhang–Lei’s inequality is more appropriate for boundless outcomes.

6. Discussion: Relation to Causal Inference

Compared with asymptotic arguments, the non-asymptotic inference provides more accurate error bounds when the sample size is not large enough. However, the non-asymptotic error bounds may be conservative. It is of great interest to investigate the performance of different types of non-asymptotic error bounds. In the simulation study, we find that one type of error bound may outperform another in certain scenarios. Massive efforts have been devoted to shortening the error bounds with weaker conditions on the distribution (e.g., moments) of outcome variables [13,19,20,21,22]. In practice, we can construct several error bounds as long as the conditions to derive concentration inequalities are satisfied and choose the shortest one.
This work is closely related to causal inference. As a missing data problem, causal inference aims to make statistical inferences on the difference of the average potential outcomes under the treated and under the control [23]. Let Z { 1 , 0 } be the treatment indicator and Y ( z ) be the potential outcome associated with the treatment z { 1 , 0 } . For each unit, only one of these two potential outcomes, either Y ( 1 ) or Y ( 0 ) , is observable, while the other is missing. To identify the average causal effect, we usually assume causal consistency Y ( Z ) = Y and unconfoundedness ( Y ( 1 ) , Y ( 0 ) ) Z X . Unconfoundedess, similar to missingness at random, says that the treatment assignment Z can only rely on observed baseline covariates X, rather than potential outcomes ( Y ( 1 ) , Y ( 0 ) ) , which could be missing by design [2].
We take the inference on Y ( 1 ) as an example. E { Y ( 1 ) } is the population mean, E { Y ( 1 ) Z = 1 } is the population mean in the observed group (where Y ( 1 ) is observable) by treating Z = 1 as observed and Z = 0 as missing, and E { Y ( 1 ) Z = 0 } is the population mean in the unobserved group (where Y ( 1 ) is unobservable) by treating Z = 0 as observed and Z = 1 as missing. It is also straightforward to infer the average causal effect E { Y ( 1 ) Y ( 0 ) } , since the AIPW estimator just becomes a combination of two parts, corresponding to E { Y ( 1 ) } and E { Y ( 0 ) } , respectively. Under causal consistency, uncounfoundedness, positivity, boundedness (sub-Gaussian), sup-norm consistency and risk decay, the non-asymptotic bounds of the tail probability of AIPW estimators can be similarly established by McDiarmid’s inequality or Zhang–Lei’s inequality.
Other estimands that may be of interest are the average causal effect on the treated (ATT) E { Y ( 1 ) Y ( 0 ) Z = 1 } and the average causal effect on the control (ATC) E { Y ( 1 ) Y ( 0 ) Z = 0 } . Take the ATT as an example. It can be decomposed as E { Y ( 1 ) Z = 1 } and E { Y ( 0 ) Z = 1 } . The former corresponds to the mean outcome in the observed group, as if Y ( 1 ) is observed in the Z = 1 group, and the latter corresponds to the mean outcome in the unobserved group, as if Y ( 0 ) is unobserved in the Z = 1 group. By examining the expression of the estimator of ATT given in [17], the bounded difference condition can hold with a high probability, so the non-asymptotic law can be established.

Author Contributions

Conceptualization, F.W.; Methodology, Y.D.; Investigation, F.W. and Y.D.; Writing—original draft, F.W. and Y.D.; Writing—review & editing, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant No. 2021YFF0901400), National Natural Science Foundation of China (Grant No. 12026606) and Fundamental Research Funds for the Central Universities (Grant No. 2022QNPY73). This work is also partly supported by Novo Nordisk A/S.

Data Availability Statement

Data availability is not applicable to this article as no new data were created or analysed in this study.

Acknowledgments

We thank the editor of this special issue “New Advances in High-Dimensional and Non-asymptotic Statistics” and reviewers for their kind comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
  2. Mealli, F.; Rubin, D.B. Clarifying missing at random and related definitions, and implications when coupled with exchangeability. Biometrika 2015, 102, 995–1000. [Google Scholar] [CrossRef]
  3. Rosenbaum, P.R.; Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
  4. Hirano, K.; Imbens, G.W.; Ridder, G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 2003, 71, 1161–1189. [Google Scholar] [CrossRef]
  5. Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 1995, 90, 106–121. [Google Scholar] [CrossRef]
  6. Hahn, J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 1998, 66, 315–331. [Google Scholar] [CrossRef]
  7. Tsiatis, A.A. Semiparametric Theory and Missing Data; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  8. Bang, H.; Robins, J.M. Doubly robust estimation in missing data and causal inference models. Biometrics 2005, 61, 962–973. [Google Scholar] [CrossRef]
  9. Farrell, M.H. Robust inference on average treatment effects with possibly more covariates than observations. J. Econom. 2015, 189, 1–23. [Google Scholar] [CrossRef]
  10. Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
  11. Newey, W.K.; Robins, J.R. Cross-fitting and fast remainder rates for semiparametric estimation. arXiv 2018, arXiv:1801.09138. [Google Scholar]
  12. Kennedy, E.H. Optimal doubly robust estimation of heterogeneous causal effects. arXiv 2020, arXiv:2004.14497. [Google Scholar]
  13. Zhang, H.; Chen, S.X. Concentration inequalities for statistical inference. Commun. Math. Res. 2020, 37, 1–85. [Google Scholar]
  14. Zhang, H.; Wei, H. Sharper sub-weibull concentrations. Mathematics 2022, 10, 2252. [Google Scholar] [CrossRef]
  15. Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
  16. McDiarmid, C. On the method of bounded differences. Surv. Comb. 1989, 141, 148–188. [Google Scholar]
  17. Mercatanti, A.; Li, F. Do debit cards increase household spending? evidence from a semiparametric causal analysis of a survey. Ann. Appl. Stat. 2014, 8, 2485–2508. [Google Scholar] [CrossRef]
  18. Zhang, H.; Lei, X. Non-asymptotic optimal prediction error for growing-dimensional partially functional linear models. arXiv 2022, arXiv:2009.04729. [Google Scholar]
  19. Hsu, D.J.; Kakade, S.M.; Zhang, T. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 2012, 17, 1–6. [Google Scholar] [CrossRef]
  20. Bennett, G. Probability inequalities for the sum of independent random variables. J. Am. Stat. Assoc. 1962, 57, 33–45. [Google Scholar] [CrossRef]
  21. Bernstein, S. On a modification of Chebyshev’s inequality and of the error formula of Laplace. Ann. Sci. Inst. Sav. Ukr. Sect. Math 1924, 1, 38–49. [Google Scholar]
  22. Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
  23. Ding, P.; Li, F. Causal inference. Stat. Sci. 2018, 33, 214–237. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Width of the error bounds with the increasing of sample sizes (using true parameters).
Figure 1. Width of the error bounds with the increasing of sample sizes (using true parameters).
Mathematics 11 00818 g001
Figure 2. Width of the error bounds with the increasing of sample sizes (using estimated parameters).
Figure 2. Width of the error bounds with the increasing of sample sizes (using estimated parameters).
Mathematics 11 00818 g002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, F.; Deng, Y. Non-Asymptotic Bounds of AIPW Estimators for Means with Missingness at Random. Mathematics 2023, 11, 818. https://doi.org/10.3390/math11040818

AMA Style

Wang F, Deng Y. Non-Asymptotic Bounds of AIPW Estimators for Means with Missingness at Random. Mathematics. 2023; 11(4):818. https://doi.org/10.3390/math11040818

Chicago/Turabian Style

Wang, Fei, and Yuhao Deng. 2023. "Non-Asymptotic Bounds of AIPW Estimators for Means with Missingness at Random" Mathematics 11, no. 4: 818. https://doi.org/10.3390/math11040818

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop