Next Article in Journal
A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping
Next Article in Special Issue
Parameter Choice, Stability and Validity for Robust Cluster Weighted Modeling
Previous Article in Journal
A Bayesian Approach to Linking a Survey and a Census via Small Areas
Previous Article in Special Issue
Smart Visualization of Mixed Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Causal Estimation from Observational Studies Using Penalized Spline of Propensity Score for Treatment Comparison

by
Tingting Zhou
1,*,
Michael R. Elliott
2 and
Roderick J. A. Little
2
1
U.S. Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA
2
Department of Biostatistics, School of Public Health, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Stats 2021, 4(2), 529-549; https://doi.org/10.3390/stats4020032
Submission received: 29 April 2021 / Revised: 29 May 2021 / Accepted: 6 June 2021 / Published: 10 June 2021
(This article belongs to the Special Issue Robust Statistics in Action)

Abstract

:
Without randomization of treatments, valid inference of treatment effects from observational studies requires controlling for all confounders because the treated subjects generally differ systematically from the control subjects. Confounding control is commonly achieved using the propensity score, defined as the conditional probability of assignment to a treatment given the observed covariates. The propensity score collapses all the observed covariates into a single measure and serves as a balancing score such that the treated and control subjects with similar propensity scores can be directly compared. Common propensity score-based methods include regression adjustment and inverse probability of treatment weighting using the propensity score. We recently proposed a robust multiple imputation-based method, penalized spline of propensity for treatment comparisons (PENCOMP), that includes a penalized spline of the assignment propensity as a predictor. Under the Rubin causal model assumptions that there is no interference across units, that each unit has a non-zero probability of being assigned to either treatment group, and there are no unmeasured confounders, PENCOMP has a double robustness property for estimating treatment effects. In this study, we examine the impact of using variable selection techniques that restrict predictors in the propensity score model to true confounders of the treatment-outcome relationship on PENCOMP. We also propose a variant of PENCOMP and compare alternative approaches to standard error estimation for PENCOMP. Compared to the weighted estimators, PENCOMP is less affected by inclusion of non-confounding variables in the propensity score model. We illustrate the use of PENCOMP and competing methods in estimating the impact of antiretroviral treatments on CD4 counts in HIV+ patients.

1. Introduction

Observational studies are important for evaluating causal effects, especially when randomization of treatments is unethical or expensive. Valid inferences about causal effects from observational studies can only be drawn by controlling for all confounders, that is, pre-treatment variables that are related to both treatment allocation and the outcome, because the treated subjects generally differ systematically from the control subjects. For example, sicker HIV patients are more likely to take antiretroviral treatments to control the virus when their CD4 cell counts drop too low. The CD4 cell count, a measure of how well the immune system functions, is one clinical measure of the effectiveness of an antiretroviral treatment. Direct comparison of the CD4 counts between the treated and the control would lead to the false conclusion that antiretroviral treatments result in lower CD4 counts. Thus, to assess the effects of using antiretroviral treatments on the CD4 counts from an observational study, such as the Multicenter AIDS Cohort study (MACS) [1], appropriate statistical methods are needed to remove confounding by patient characteristics.
To deal with confounding by patient characteristics, the propensity score, the conditional probability of assignment to a treatment given the observed covariates, is commonly used. Rosenbaum and Rubin (1983) [2] showed that controlling for the propensity score is sufficient to remove bias due to differences in the observed covariates between treatment groups. The propensity score summarizes the observed covariates into a single measure and serves as a dimension reduction technique. Due to the balancing property of the propensity score, the treated and control subjects with similar propensity scores can be directly compared [2]. For example, in our application, because sicker patients were more likely to be treated, we can adjust for that by controlling for the patient’s probability of receiving treatment given all observed histories prior to treatment. After controlling for the propensity score, the distribution of the observed covariates, in this case, the proportion of sicker patients, will be similar between the treated and the control subjects, so the CD4 counts between the treated and the control subjects with similar propensity scores can be compared.
More generally, propensity-score-based methods first estimate the probability of treatment assignment given potential confounding variables, and then use the estimated treatment probability in weighting, or as a predictor in regression models for the outcome under alternative treatment assignments. Inverse-probability of treatment weighting (IPTW) controls for confounding by weighting subjects by the inverse of the estimated probability of receiving the observed treatment. The weights in effect create a pseudo-population that is free of treatment confounders. The IPTW estimator is consistent if the propensity score model is correct. Like IPTW, augmented IPTW estimation (AIPTW) uses the estimated propensity score as a weight but incorporates predictions from a regression model for the outcome. The AIPTW estimator consistently estimates causal effects if the propensity score model is correctly, or the outcome model is correctly specified. Both IPTW and AIPTW estimators are based on the Rubin (1974) [3] causal model framework. As such, the estimators are consistent under the causal model assumptions that there is no interference across units (stable unit treatment value assumption, SUTVA), that each unit has a non-zero probability of being assigned to either treatment group (positivity) and there are no unmeasured confounders (ignorability) [3]. Here, we mean robustness to mis-specification of the covariates in regression models, rather than robustness to outliers in the residuals. That might be achieved by replacing the assumption of normality in the distribution of errors by a longer-tailed distribution, such as Student’s t [4].
Another recently developed method, Penalized Spline of Propensity Methods for Treatment Comparison (PENCOMP), imputes missing potential outcomes using regression models that include splines on the logit of the estimated probability to be assigned that treatment, as well as other covariates that are predictive of the outcome. The idea is based on the potential outcome framework of the Rubin causal model [3]. In the Rubin causal model, potential outcomes are defined as potentially observable outcomes under different treatments or exposure groups. Individual causal effects are defined as comparisons of the potential outcomes for that subject. Only the potential outcome corresponding to the treatment assigned is observed for any subject. Thus, causal inferences are based on comparisons of the imputed and the observed outcomes. Under the Rubin causal model assumptions, PENCOMP has a double robustness property for estimating treatment effects [5]. Specifically, under these standard causal inference assumptions, PENCOMP consistently estimates the causal effects if the propensity score model is correctly specified and the relationship between the outcome and the logit of the propensity score is modeled correctly, or if the relationship between the outcome and other covariates is modeled correctly.
In this paper, we study important, unresolved questions concerning how to generate robust causal inferences from observational studies. As mentioned above, common approaches to robust causal inference involve fitting two models: (a) a propensity score model, where the outcome is the indicator for which treatment is assigned and the predictors are potential confounding variables; (b) the outcome model, which relates the outcome to the treatment, and includes the propensity score as a predictor variable or as a weight, usually the inverse of the estimated propensity to be selected. Our paper concerns practical strategies for how these regression models are specified.
For valid inferences, all true confounders should be included in the propensity score model. Ideally, we would know the set of true confounders, but in observational studies this information is rarely if ever known. Given this fact, the question of how to select variables to be included in the propensity score model is important and controversial. Some researchers have argued that all pre-treatment potential confounders should be included in the propensity model prior to seeing the outcome data, to avoid “data snooping” and mimic, as closely as possible, a randomized trial, where randomization occurs prior to observing the outcomes [6].
On the other hand, this strategy may lead to inclusion of variables that are associated with treatment selection but are not associated with the outcome and, hence, are not true confounders; including them in the propensity score model can lead to highly inefficient and non-robust inferences. The reason is the including these variables shrinks the overlap region of the propensity score distributions for the treatments, leading to weighted estimators that have extreme weights, or regression estimators that are vulnerable to model mis-specification—for example, mis-specifying a nonlinear relationship as linear. Limiting this problem argues that variable selection should consider the relationship between the variable and the outcome, provided it is not done in a way that prejudices the estimated treatment effect [7,8,9,10].
Another consideration is that including variables in the outcome model that are not associated with treatment allocation—and, hence, are not true confounders—but are related to the outcome can improve the efficiency of the causal estimate [11].
Our paper examines these aspects in detail with both simulation studies and an application, and, offers a broad discussion with a lot of important takeaways for both researchers who believe all pre-treatment confounders should be included and those who believe variable selection is always necessary. Specifically, we examine the performance of alternative confounder selection methods in PENCOMP, IPTW, and AIPTW, with and without considering the relationships between the covariates and the outcome. We also address issues of model selection and model uncertainty. For PENCOMP, we propose a new variant based on bootstrap smoothing, also called bagging. For AIPTW and IPTW, we consider an alternative approach for estimating standard errors and confidence intervals that accounts for model uncertainty.
In Section 2, we describe estimands and causal inference assumptions. In Section 3, we describe two versions of PENCOMP for estimating causal effects: one based on multiple imputation, and the other based on bootstrap smoothing, and two estimation procedures for AIPTW and IPTW. In Section 4, we describe model selection for the propensity score model and the outcome model. In Section 5, we examine using simulation studies how specification of propensity score model affects the performance of PENCOMP, AIPTW, and IPTW. In Section 6, we illustrate our methods using the Multicenter AIDS Cohort study (MACS) to estimate the effect of antiretroviral treatment on CD4 counts in HIV-infected patients. We conclude with a discussion of the results and some possible future work.

2. Materials and Methods

2.1. Estimands and Assumptions

Let X i denote the vector of baseline covariates, and Z i { 0 , 1 } denote a binary treatment with Z i = 1 for treatment and Z i = 0 for control, for subject i = 1 , , N , respectively. Under Rubin’s potential outcome framework [3], causal effects at subject level are defined as the difference between the potential outcome for a subject under treatment and the potential outcome under control. Only one of the potential outcomes is observed for each subject. Let Y i Z i be the potential outcome under treatment Z i . Here, we focus on inference about the average treatment effect (ATE), E ( Y 1 Y 0 ) , obtained by averaging subject-level causal effects across the entire population of interest.
We make the following assumptions in order to estimate causal effects.
(1)
The stable unit-treatment value assumption (SUTVA) states that (a) the potential outcome under a subject’s observed treatment is precisely the subject’s observed outcome. In other words, there are no different versions of potential outcomes under a given treatment for each subject, and (b) the potential outcomes for a subject are not influenced by the treatment assignments of other subjects [12,13].
(2)
Positivity states that each subject has a positive probability of being assigned to either treatment of interest: 0 < Pr ( Z i = z i X i ) < 1 , where P r ( Z i = z i X i ) denotes the probability of being assigned to the treatment z i , given the observed covariates x i .
(3)
The ignorable treatment assumption states that ( Y i 1 , Y i 0 ) Z i X i ; that is, treatment assignment is as if randomized conditional on the set of covariates X i .

2.2. PENCOMP and Multiple Imputation

Because each subject only receives one treatment, we observe the potential outcome under the observed treatment but not the potential outcome under the alternative treatment. PENCOMP imputes the missing potential outcomes using regression models that include splines on the logit of the estimated probability to be assigned that treatment, as well as other covariates that are predictive of the outcome. We then draw inferences based on comparisons of the imputed and observed outcomes. PENCOMP, which builds on the Penalized Spline of Propensity Prediction method (PSPP) for missing data problems [14,15], relies on the balancing property of propensity score, in combination with the outcome model. Under the assumptions stated above, PENCOMP has a double robustness property for causal effects. Specifically, if either (1) the model for the propensity score and the relationship between the outcome and the propensity score are correctly specified through penalized spline, or (2) the outcome model is correct, the causal effect of the treatment will be consistently estimated [5].
Here, we briefly describe the estimation procedures for PENCOMP based on multiple imputation with Rubin’s combining rules [16].
(a)
For d = 1 , ⋯, D, generate a bootstrap sample S d from the original data S by sampling units with replacement, stratified based on treatment group. Then, carry out steps (b)–(d) for each sample S d :
(b)
Select and estimate the propensity score model for the distribution of Z given X, with regression parameters α . The estimated probability to be assigned treatment Z = z is denoted as P ^ z ( X ) = Pr ( Z = z | X , α ^ d ) , where α ^ d is the ML estimate of α . Define P ^ z = log [ P ^ z ( X ) / ( 1 P ^ z ( X ) ) ] .
In practice, it is often unknown how treatments are assigned to subjects. There are several approaches that can be used to select the covariates to be included in the propensity score model. One approach is to include all the potential confounders from a large collection of pretreatment variables. Variables might also be selected based on how well they predict the treatment assignment. Lastly, variables can be selected based on how well they are predictive of the outcome, whether they are related to the treatment. For a binary treatment, a logistic regression is often used to model the treatment assignment.
(c)
For each z = 0 , 1 , use the cases assigned to treatment group z to estimate a normal linear regression of Y z on X, with mean
E ( Y z | X , Z = z , θ z , β z ) = s ( P ^ z | θ z ) + g z ( X ; β z ) .
s ( P ^ z | θ z ) denotes a penalized spline with fixed knots [17,18,19], indexed by parameters θ z , and g z ( ) represents a parametric function of covariates predictive of the outcome, indexed by parameters β z . The spline model can be formulated and estimated as a linear mixed model [19].
(d)
Impute the missing potential outcomes Y z for subjects in treatment group 1 z in the original dataset S with draws from the predictive distribution of Y z given X from the regression in (c), with ML estimates θ ^ z d ,   β ^ z d substituted for the parameters θ z ,   β z , respectively. Repeat the above procedures to produce D complete datasets.
(e)
Let Δ ^ d and W d denote the difference in treatment means and associated pooled variance estimate, based on the observed and imputed values of Y in each treatment group. The MI estimate of Δ is then Δ ¯ D = 1 D d = 1 D Δ ^ d , and the MI estimate of the variance of Δ ¯ D
T D = W ¯ D + ( 1 + 1 / D ) B D ,
where W ¯ D = d = 1 D W d / D , B D = d = 1 D Δ ^ d Δ ¯ D 2 / ( D 1 ) . The estimate Δ is t distributed with degree of freedom v, ( Δ Δ ¯ D ) T D 1 2 t v , where v = ( D 1 ) ( 1 + W ¯ D / ( ( D + 1 ) B D ) ) 2 .

2.3. PENCOMP and Bagging

As an alternative to multiple imputation combining rules, we can draw inference about the ATE using the bagging estimator, a form of model averaging that accounts for model uncertainty. Let S = ( S 1 , S 2 , , S N ) denote the original dataset consisting of N subjects. A nonparametric bootstrap sample with replacement is denoted as S d = ( S 1 d , S 2 d , , S N d ) . The procedures for PENCOMP are similar as described above, except in step (e). In step (e), the imputations are carried out on each bootstrap sample S d , instead of the original data S.
(a)
For d = 1 , ⋯, D, generate a bootstrap sample S d . Repeat steps (b)–(d) for each bootstrap sample S d to produce D complete datasets.
(b)
Select and estimate the propensity score model as described in Section 2.2 (b).
(c)
Estimate the outcome model Y z on X and a penalized spline on the logit of the propensity to the treatment z using the cases assigned to treatment z.
(d)
Impute the missing potential outcomes Y z for subjects in treatment group 1 z in the bootstrap sample S d with draws from the predictive distribution estimated in (c).
(e)
Let Δ ˜ and s d ˜ D denote the estimate and standard error of the causal effect, respectively. The causal estimate Δ ˜ = d = 1 D Δ ^ s d / D , where Δ ^ s d is the mean difference in the potential outcomes obtained from bootstrap sample S d . The standard error s d ˜ D is calculated as follows.
s d ˜ D = ( j = 1 N c o v ^ j 2 ) 1 / 2
c o v ^ j = d = 1 D ( Q d j Q . j ) ( Δ ^ s d Δ ˜ ) / D ,
where Q . j = d = 1 D Q d j / D and Q d j = # { S d = S j } is the number of times that observation j of the original data S was selected into the dth bootstrap sample S d [20]. c o v j ^ estimates the bootstrap covariance between Q d j and Δ ^ s d . To estimate the standard error of the smoothed bootstrap causal estimate, a brute force approach would be to use a second level of bootstrapping that requires an enormous number of computations. The formula provides an approximation to such an estimate of the standard error.
Inference is made using the bootstrap smoothed estimator Δ ˜ and confidence interval Δ ˜ ± 1.96 s d ˜ D , instead of the Rubin’s multiple imputation combining rules.

2.4. Inverse Probability Treatment Weighted Estimator IPTW

Unlike PENCOMP, IPTW does not impute potential outcomes but uses only the observed outcomes. IPTW controls for confounding by weighting subjects based on their probabilities of receiving their observed treatments. Let P ^ 1 ( X i , α ^ ) denote the estimated probability of being assigned to treatment Z i = 1 given the set of observed covariates X i = x i , obtained from the propensity score model for the distribution of Z i given X i = x i , with regression parameters α ^ . The treated subjects are assigned weights 1 / P ^ 1 ( X i , α ^ ) , and the control subjects are assigned weights 1 / { 1 P ^ 1 ( X i , α ^ ) } . Thus, the subjects who are under-represented in a given treatment arm are given higher weights. The weights in effect create a pseudo-population where treatment groups are balanced with respective to covariate distributions. The IPTW estimator is consistent if the propensity score model is correct under the assumptions stated in Section 2.1.
The IPTW estimator is defined as
Δ ^ I P T W = i = 1 N Z i Y i P ^ 1 ( X i , α ^ ) i = 1 N ( 1 Z i ) Y i 1 P ^ 1 ( X i , α ^ ) .
Let Δ ^ I P T W denote the causal estimate on the original data S. Here, we consider bootstrap methods for computing its standard errors and confidence intervals. The procedures are as follows.
(a)
For d = 1 , ⋯, D, generate a bootstrap sample S d . Then, repeat steps (b)–(d) for each sample S d :
(b)
Select and estimate the propensity score model as described in Section 2.2 (b).
(c)
Compute Δ ^ I P T W d for each bootstrap sample S d .
(d)
Estimate the standard error s d ^ I P T W , D for Δ ^ I P T W based on D bootstrap samples as
s d ^ I P T W , D = d = 1 D ( Δ ^ I P T W d Δ ˜ I P T W ) 2 / ( D 1 ) ,
where Δ ˜ I P T W = d = 1 D Δ ^ I P T W d / D . The standard 95 % confidence intervals Δ ^ I P T W ± 1.96 s d ^ I P T W , D . Alternatively, the bagging estimate of the causal effect is Δ ˜ I P T W and the 95% smoothed confidence interval is Δ ˜ I P T W ± 1.96 s d ˜ I P T W , D , where the smoothed standard error s d ˜ I P T W , D is computed based on Equation (2) [20].

2.5. Augmented Inverse Probability Treatment Weighted Estimator (AIPTW)

An alternative to IPTW is augmented IPTW estimation (AIPTW). AIPTW uses the estimated propensity score as a weight like IPTW but also incorporates predictions from a regression model for the outcome. Incorporating covariates predictive of the outcome in the outcome model can improve efficacy and reduce variability, especially when the weights are variable. The AIPTW estimator consistently estimates causal effects if the propensity score model or the outcome model is correctly specified under the assumptions stated in Section 2.1.
Each subject i is weighted by the balancing weight W i = 1 / Z i P 1 ( X i , α ^ ) + ( 1 Z i ) ( 1 P 1 ( X i , α ^ ) ) . The AIPTW estimate is calculated on the original dataset S [21]:
Δ ^ A I P T W = i = 1 n { m 1 ( X i , β 1 ) m 0 ( X i , β 0 ) } n + i = 1 n W i Z i { Y i m 1 ( X i , β 1 ) } i = 1 n W i Z i i = 1 n W i ( 1 Z i ) { Y i m 0 ( X i , β 0 ) } i = 1 n W i ( 1 Z i ) ,
where m 1 ( X i , β 1 ) = E ( Y i | X i , Z i = 1 , β 1 ) and m 0 ( X i , β 0 ) = E ( Y i | X i , Z i = 0 , β 0 ) . Similar procedures as in IPTW can be used to obtain point estimates and standard 95 % confidence intervals. Alternatively, the bagging estimate of the causal effect is Δ ˜ A I P T W and the 95% smoothed confidence interval Δ ˜ A I P T W ± 1.96 s d ˜ A I P T W , D , can be obtained using the smoothed standard error s d ˜ A I P T W , D from Equation (2) [20].

3. Model Selection

We consider scenarios where there are some pre-treatment variables that are predictors of the outcome, some that are predictors of the treatment, some that are predictors of both the treatment and the outcome, and some that are spurious, in the sense that they affect neither the treatment or the outcome. We consider two strategies for building the propensity score model: (1) without seeing the outcome [6], and (2) taking into account the relationships between the covariates and the outcome.
For strategy 1, one simple approach is to use the stepwise variable selection algorithm with the Bayesian Information Criterion (BIC) to select the variables that are predictive of the treatment, regardless of how well they predict the outcome. Separately, we use the same stepwise algorithm to select the outcome model for PENCOMP and AIPTW. The algorithm, abbreviated as SW, does not use outcome data and, hence, satisfies Rubin’s recommendation of separating analysis from design.
In strategy 2, we use the outcome adaptive lasso approach proposed by Shortreed and Ertefaie (2017) [9]. By penalizing each covariate according to the strength of the relationship between the covariate and the outcome, the outcome adaptive lasso tends to select covariates that are predictive of the outcome and excludes covariates that are associated only with the treatment. The outcome adaptive lasso estimates for the propensity score model are:
α ^ O A L = a r g m i n α i = 1 n Z i ( X i T α ) + l o g ( 1 + e X i T α ) + λ n j = 1 p w ^ α j | α j | ,
where w ^ α j = 1 / | β j ^ | γ such that γ > 1 and minimizes the mean weighted standardized difference between the treated and control. β j ^ is the coefficient estimate for covariate X j from ordinary least square or ridge regression by regressing the outcome Y on the covariates and the treatment. Similarly, the outcome model can be selected via adaptive lasso. The adaptive lasso estimates are given as follows [22].
β ^ A L = a r g m i n β | | y j = 1 p X j β j | | 2 + λ j = 1 p w ^ j | β j | ,
where w ^ j = 1 / | β j ^ | γ and γ > 0 .
This method is subject to Rubin’s criticism. Excluding the treatment variable in the regressions during variable selection might reduce the potential for biasing results.

4. Simulation

We simulate each dataset as described in Zigler and Dominici (2014) and Shortreed and Ertefaie (2017) [9,23]. Each simulated dataset contains N subjects and p covariates X. The treatment Z 1 is Bernoulli distributed with logit of P ( Z 1 = 1 | X ) = j = 1 p α j X j . The outcome of interest Y is normally distributed with a mean of η Z 1 + j = 1 p β j X j and a variance of 1. The treatment effect η is equal to 2, without loss of generality. We set all the coefficients 0, except the first 6 covariates X 1 , , X 6 . X 1 and X 2 are true confounders. X 3 and X 4 are predictors of the outcome only. X 5 and X 6 are predictors of the treatment only. All the other d 6 covariates are spurious. We vary the strength of the relationships among the covariates, the outcome and the treatment. In the first scenario, β and α are set as: β = ( 0.6 , 0.6 , 0.6 , 0.6 , 0 , 0 , 0 , , 0 ) , and α = ( 1 , 1 , 0 , 0 , 1 , 1 , 0 , , 0 ) . In the second scenario, confounders X 1 and X 2 have a weaker relationship with the treatment: β = ( 0.6 , 0.6 , 0.6 , 0.6 , 0 , 0 , 0 , , 0 ) and α = ( 0.4 , 0.4 , 0 , 0 , 1 , 1 , 0 , , 0 ) . In the third scenario, confounders X 1 and X 2 have a weaker relationship with the outcome: β = ( 0.2 , 0.2 , 0.6 , 0.6 , 0 , 0 , 0 , , 0 ) and α = ( 1 , 1 , 0 , 0 , 1 , 1 , 0 , , 0 ) . We also vary the sample sizes: N = 200 and N = 1000.
We consider four different specifications of the propensity score model: (1) True includes the true propensity score model used to generate the data; (2) trueConf includes only the true confounders; (3) outcomePred includes both the confounders and the predictors of the outcome; (4) allPoten includes all 20 variables. For these four specifications, the outcome models for PENCOMP and AIPTW are correctly specified. In addition, we consider the following variable selection techniques for the propensity score model and the outcome model.
(a)
SW: stepwise variable selection algorithm with the Bayesian Information Criterion (BIC) separately for the propensity score mode and the outcome model.
(b)
OAL: outcome adaptive lasso [9] for the propensity score model, and adaptive lasso for the outcome model [22].
(c)
Step-ALT: outcome adaptive lasso for the propensity score model at the first stage and then adaptive lasso for the outcome model at the second stage using only the variables that are selected at the first stage.
(d)
Step-ALY: adaptive lasso for the outcome model at the first stage and then logistic regression model with all the variables selected at the first stage for the propensity score model.
We evaluate the performance of the methods based on root mean squared error (RMSE), empirical non-coverage rate of the 95% confidence interval, empirical bias, and average length of 95% confidence intervals over 500 simulated datasets. For each dataset, the standard errors and confidence intervals are estimated using 1000 bootstrap samples. Within PENCOMP, we compare the multiple imputation approach and the bagging approach. Within AIPTW and IPTW, we compare the standard approach with the bagging approach.

5. Results

Table 1 shows the results on RMSEs for sample size of 200. By comparing the four propensity score models that were fixed within each bootstrap sample: true, trueConf, outcomePred, and allPotent, we can see that excluding spurious variables or variables that were associated only with the treatment reduced the RMSEs, and including variables associated only with the outcome reduced the RMSEs. Incorporating the outcome model as in PENCOMP and AIPTW attenuated the negative effect of including nonfounding variables on RMSEs. For example, using the standard approach in scenario 1, IPTW had RMSEs of 0.19, 0.22, 0.36, and 0.41 under outcomePred, trueConf, true, and allPotent, respectively. AIPTW had RMSEs of 0.16, 0.16, 0.22, and 0.28, respectively. Using the Rubin’s approach, PENCOMP had RMSEs of 0.16, 0.16, 0.21, and 0.21, respectively. Similar patterns were observed in scenario 2 and 3.
Bagging reduced the RMSEs for IPTW and AIPTW, especially when irrelevant covariates were included in the propensity score model, as in true and allPotent. The standard approach and the bagging approach yielded similar RMSEs under outcomePred and trueConf but different RMSEs under true and allPotent. For example, in scenario 1 under allPotent, IPTW had an RMSE of 0.34 when the bagging approach was used, but an RMSE of 0.41 when the standard approach was used. AIPTW had an RMSE of 0.25 when the bagging approach was used, but an RMSE of 0.28 when the standard approach was used. PENCOMP had an RMSE of 0.22 when the bagging approach was used, but an RMSE of 0.21 when Rubin’s combining rule was used. For PENCOMP, the bagging approach had slightly higher RMSEs than Rubin’s multiple imputation combining rule when many irrelevant variables were included as in allPotent. Similar patterns were observed in scenario 2 and 3.
The results in the variable selection cases were similar to the results without variable selection. The outcome adaptive selection procedure, such as OAL, resulted in smaller RMSEs than the variable selection procedure, such as SW, that selected variables solely based on how well they predicted the treatment. Figure A1 in the Appendix A presents the results on variable selection. For example, in scenario 1 with sample size of 200, all the variable selection procedures selected the confounders X 1 and X 2 about 99% of the time. OAL, Step-ALT, and Step-ALY selected the non-confounders X 3 and X 4 about 99% of the time, while SW selected them about 40% of the time. OAL selected X 5 and X 6 about 30% of the time; Step-ALT and Step-ALY about 8% of the time; and SW about 99% of the time. SW, OAL, Step-ALT, and Step-ALY selected spurious variables about 40%, 34%, 8%, and 8% of the times, respectively.
An outcome adaptive selection procedure can fail to select confounders that are weakly associated with the outcome, as seen in scenario 3 in Figure A1 in the Appendix A. Similarly, the stepwise variable selection algorithm can fail to select confounders that are weakly associated with the treatment, as seen in scenario 2. Excluding weak confounders increased the bias, as seen in Table A1 in the Appendix A. However, the reduction in variance by excluding irrelevant variables when using outcome adaptive selection procedure could offset the bias and the RMSEs could still be smaller, as seen in Table 1. For example, in scenario 3, the empirical bias for IPTW was 0.146 (7%) under Step-ALT and 0.033 (2%) under SW. The RMSE for IPTW was 0.24 under Step-ALT and 0.33 under SW.
If the chosen selection procedure selects many irrelevant variables, especially the ones that are strong predictors of the treatment only, the bagging approach could reduce the RMSEs for IPTW and AIPTW. For example, in scenario 3, IPTW had an RMSE of 0.33 under SW when the standard approach was used, and an RMSE of 0.28 when the bagging approach was used. AIPTW had an RMSE of 0.25 under SW when the standard approach was used, and an RMSE of 0.23 when the bagging approach was used. PENCOMP had an RMSE of 0.21 under SW when the Rubin’s combining rule was used, and an RMSE of 0.22 when the bagging approach was used. In addition, performing variable selection within each bootstrap sample could increase the chance that weak confounders were selected in some bootstrap samples. Thus, in scenario 3, the bagging IPTW and AIPTW estimators had smaller empirical biases. For example, the empirical bias for IPTW under Step-ALT was 0.146 (7%) when the standard approach was used, but 0.083 (4%) when the bagging approach was used.
Table 2 shows the results on coverage probability for sample size of 200. The bagging approach tended to have coverage rates closer to the nominal coverage than the multiple imputation approach (PENCOMP) and the standard approach (AIPTW, IPTW) for small samples. The smoothed standard errors (SE) were closer the empirical SEs so the coverage rates were closer the nominal 95% coverage, and confidence interval widths were smaller. When there were many spurious variables in the propensity score model and/or when the different models could be selected across bootstrap samples, the distribution of the bootstrap estimates could become “jumpy and erratic”. Consequently, the bagging approach provided tighter confidence intervals.
As the sample size increased to 1000, the gain of using bootstrap smoothing attenuated, as seen in Table 3 and Table 4. Using the standard approach of calculating the confidence intervals in the case of IPTW and AIPTW, or using multiple imputation combining rules in the case of PENCOMP, performed better than using the bagging approach. In large sample sizes, each covariate had less impact on the estimates and the selected models across the bootstrap samples were similar, so there was little variability in the bootstrap estimates. In such scenario, bagging led to greater confidence interval widths and overcoverage. In summary, bagging was advantageous when the sample size was small and the data were noisy.
Overall, both PENCOMP and AIPTW had smaller RMSEs than IPTW. PENCOMP had smaller RMSEs than AIPTW, when the propensity score model included many irrelevant covariates. Even when there was no model selection, but the sample size was small, and the propensity score model included many irrelevant variables, especially variables that were strong predictors of the treatment only, the bagging approach could stabilize the IPTW and AIPTW estimators.

6. Application

The Multicenter AIDS Cohort study (MACS) was started in 1984 [1]. A total of 4954 gay and bisexual men were enrolled in the study and followed up semi-annually. At each visit, data from physical examination, questionnaires about medical and behavioral history, and blood test results were collected. The primary outcome of interest was the CD4 count, a continuous measure of how well the immune system functions. We used this dataset to analyze the short term effects of using antiretroviral treatment. We restricted our analyses to visit 12. Treatment was coded as 1 if the patient reported taking any of antiretroviral treatment (ART) or enrolling in clinical trials of such drugs. We estimated the short-term (6-month) effects of using any antiretroviral treatment for HIV+ subjects. We excluded subjects with missing values on any of the covariates included in the models. We log-transformed the blood counts in this analysis.
We treated each visit as a single time point treatment. Let t = 1 denote the time when the treatment was administered, and t = 2 the time 6-month later when the outcome was measured. In addition, let t = 1 , 2 , 3 denote 1, 2, and 3 visits before the current visit t = 1 , respectively. Let X ( t = 1 , 1 , 2 , 3 ) denote the blood count histories prior to treatment assignment. Let Z be the binary treatment indicator. Let Y ( t = 2 ) be the CD4 count 6 months after the treatment. For the outcome model, we considered blood counts-CD4, CD8, white blood cell (WBC), red blood cell (RBC), and platelets and treatment histories from the last 4 visits. For the propensity score model, we considered the same covariates as those in the outcome model, as well as demographic variables-college education, age, and race. The treatment Z was modeled using a logistic regression. We estimated the mean CD4 count difference between the treated and the control, denoted as Δ . For PENCOMP, we replaced the simulated/imputed transformed CD4 values that were < 0 with 0 (i.e., below detection level). A total of 15 equally spaced knots and B spline were used.
Figure 1 shows that the propensity score distributions were skewed, as the treated had propensity of treatment close 1 and the control close to 0. Here, we considered the variable selection methods in the simulation studies to select the relevant variables for the propensity score model. To quantify the amount of overlap, we measured the proportion of subjects in the control group whose propensity scores were between the 95th and 5th quantiles of the propensity score distribution of the treated group, denoted as π z = 0 0.95 = F z = 0 ( F z = 1 1 ( 0.95 ) ) F z = 0 ( F z = 1 1 ( 0.05 ) ) , where F is the cumulative distribution. Similarly, π z = 1 0.95 denotes the proportion of the treated subjects whose propensity scores were between the 95th and 5th quantiles of the propensity score distribution of the control group. Including only the covariates that were selected more than 20% of times by Step_ALT among 1000 bootstrap samples improved the overlap, as shown in Figure 1.
Table A5 in the Appendix B shows the proportion that each variable was selected across 1000 bootstrap samples. Because subjects who were treated during the recent visits were more likely to get treated again, prior treatments were highly predictive of future treatment. However, prior CD4 counts were more predictive of future CD4 count because those earlier antiretrovial treatments were not as effective. Thus, when we accounted for the outcome-covariate relationship when selecting propensity score model, prior treatment variables were selected less than 10% of the times, compared to close to 100% of the time in SW, and 58% of the time in OAL.
We estimated the short term effect of antiretroviral treatment on CD4 count using PENCOMP, AIPTW, and IPTW, shown in Table 5. The standard errors were obtained using 1000 bootstrap samples. For PENCOMP, 1000 complete datasets were created. Overall, the IPTW estimates had the biggest confidence interval widths. Incorporating the outcome models as in AIPTW and PENCOMP decreased the standard errors and interval widths significantly. PENCOMP tended to have slightly smaller interval widths than AIPTW. The IPTW bootstrap estimates were much more variable, compared to the PENCOMP or AIPTW bootstrap estimates. As seen in the simulation studies, the bagging approach helped stabilize the IPTW and AIPTW estimators when the weights were variable. For PENCOMP, the multiple imputation approach and the bagging approach yielded similar results. Excluding irrelevant covariates from the propensity score model, as seen in Step-ALT and Step-ALY, improved the performance of IPTW significantly, in terms of the standard errors and confidence interval widths. Incorporating the outcome models in the AIPTW and PENCOMP attenuated some of the effect of including many irrelevant covariates.

7. Discussion

We propose a new version of PENCOMP via bagging that could improve confidence interval width and coverage, compared to PENCOMP with Rubin’s multiple imputation combining rules, when the sample size is small, and the data are noisy. However, when the sample size is large and there is little variability in the bootstrap estimates, the bagging approach seems to overcover. The bagging approach and the multiple imputation approach in PENCOMP have similar RMSEs because both incorporate model selection and smooth over the estimates. Similarly, bagging improves the performance of IPTW and AIPTW in terms of RMSE, coverage and confidence interval width, especially when the sample size is small, and the data are noisy. In practice, the propensity score model is often selected, and inferences based on the selected model. This simple approach ignores model uncertainty. Compared to the standard approach for AIPTW and IPTW, the bagging approach could perform better because it incorporates model selection effects.
Our simulation studies show that excluding strong predictors of the treatment but not of the outcome, or spurious variables, helps improve the performance of the propensity score methods, especially for the weighted estimators. However, PENCOMP is less affected by inclusion of many non-confounding variables in the propensity score model than the weighted estimators because a propensity score model with many irrelevant non-confounding variables could lead to extreme propensity scores and extreme weights.
An outcome adaptive approach could help exclude strong predictors of the treatment only. However, one shortcoming of using the outcome adaptive approach is that it can miss many weak confounders. While the outcome adaptive approach can decrease the standard errors of the estimates, by excluding spurious variables and strong predictors of the treatment only, it can potentially increase bias by excluding variables that are weakly associated with the outcome, especially in small samples. This is also a shortcoming in variable selection procedures that blind the outcome, such as stepwise variable selection algorithm, because it can fail to select confounders that are weakly associated with the treatment.
Whether using an outcome adaptive approach can be beneficial depends on specific studies. When there are many weak confounders in the data, the reduction in variance from using an outcome adaptive approach might not offset the increase in bias. For example, when studying a new disease, researchers might decide to include all pretreatment variables. When there are many potential confounding variables, including those that are strongly associated with the treatment only, in the propensity score model and the weights are highly variable, PENCOMP provides a valuable approach for estimating causal effects. When variable selection is involved, smoothing over bootstrap samples can reduce the chance of excluding important confounders, which results in bias.
In high dimensional settings, including all the observed variables in the propensity score model can lead to highly unstable or even infeasible estimation. One criticism of focusing on confounders rather than just predictors of treatment assignment (i.e., balancing covariates between the treatment arms) is that incorporating the outcome in the estimation procedure, whether via prognostic score [24] or as we have done here, violates the principle that causal inference methods using observational data should mimic, as closely as possible, randomized trial designs, where outcomes are not considered until the final estimation step. Following such a rule avoids both overt and inadvertent attempts to bias model building toward preferred outcomes (“the garden of forking paths” [25], per Gelman and Loken, 2013). One approach to reducing this potential for bias is to select variables into the propensity model based a regression on the outcome that excludes variables indicating the treatments. However, with the advent of advanced “automatic” penalized regression methods, such as adaptive lasso, the risk of such “model shopping” may be sufficiently reduced, though not eliminated, so that analysts that follow the approach outlined here should endeavor to pre-specify the procedures before the analysis begins.

Author Contributions

Conceptualization, T.Z., R.J.A.L. and M.R.E.; methodology, T.Z., R.J.A.L. and M.R.E.; software, T.Z.; validation, T.Z.; formal analysis, T.Z.; investigation, T.Z.; resources, T.Z.; data curation, T.Z.; writing—original draft preparation, T.Z.; writing—review and editing, R.J.A.L. and M.R.E.; visualization, T.Z.; supervision, R.J.A.L. and M.R.E.; project administration, T.Z.; funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable as this was secondary use of publicly available data.

Informed Consent Statement

Not applicable as this was secondary use of publicly available data.

Data Availability Statement

The application datasets are publicly available. Our R code and the datasets used are available for download at https://github.com/TingtingKayla (accessed on 9 June 2021).

Acknowledgments

The authors thank the Multicenter AIDS Cohort Study (MACS) for providing us the datasets for analyses.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Additional Simulation Results

Figure A1. Proportions of each variable selected for propensity model across 500 simulated datasets and 1000 bootstrap samples for each simulated dataset for sample size of 200 and 1000. X1 and X2 are the true confounders; X3 and X4 are predictors of the outcome but not of the treatment; and X5 and X6 are predictors of the treatment but not of the outcome; all the other 14 covariates are spurious. Average across the spurious variables.
Figure A1. Proportions of each variable selected for propensity model across 500 simulated datasets and 1000 bootstrap samples for each simulated dataset for sample size of 200 and 1000. X1 and X2 are the true confounders; X3 and X4 are predictors of the outcome but not of the treatment; and X5 and X6 are predictors of the treatment but not of the outcome; all the other 14 covariates are spurious. Average across the spurious variables.
Stats 04 00032 g0a1
Figure A2. Proportions of each variable selected for prediction model across 500 simulated datasets and 1000 bootstrap samples for each simulated dataset for sample size of 200 and 1000. X1 and X2 are the true confounders; X3 and X4 are predictors of the outcome but not of the treatment; and X5 and X6 are predictors of the treatment but not of the outcome; all the other 14 covariates are spurious. Average across the spurious variables.
Figure A2. Proportions of each variable selected for prediction model across 500 simulated datasets and 1000 bootstrap samples for each simulated dataset for sample size of 200 and 1000. X1 and X2 are the true confounders; X3 and X4 are predictors of the outcome but not of the treatment; and X5 and X6 are predictors of the treatment but not of the outcome; all the other 14 covariates are spurious. Average across the spurious variables.
Stats 04 00032 g0a2
Table A1. 1000 × empirical bias with sample size of 200. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table A1. 1000 × empirical bias with sample size of 200. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
1000 × Empirical Bias
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent5−25242601126
BaggingallPotent2−6222263526
Standard/Rubintrue646636611127
Baggingtrue313535821833
Standard/RubinoutcomePred1071087833919
BaggingoutcomePred74787839921
Standard/RubintrueConf87887832618
BaggingtrueConf54587839620
Standard/RubinSW5−264−47663933
BaggingSW2−53105683727
Standard/RubinOAL60254−11735226
BaggingOAL5−1222−12347533
Standard/RubinStep-ALT2-46576132338146
BaggingStep-ALT3−3653−36640283
Standard/RubinStep-ALY2−47087138339160
BaggingStep-ALY2−4702−37036190
Table A2. 100 × mean 95% confidence interval width with sample size of 200. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table A2. 100 × mean 95% confidence interval width with sample size of 200. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
100 × Mean 95% Confidence Width
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent129101129131112131195158166
BaggingallPotent958195988898138117119
Standard/Rubintrue10285102837583120103109
Baggingtrue928092837683118105110
Standard/RubinoutcomePred705870665866766070
BaggingoutcomePred716371696269786372
Standard/RubintrueConf695869655765897684
BaggingtrueConf716371696269948290
Standard/RubinSW129101128121104121180147154
BaggingSW958194928392126107111
Standard/RubinOAL9076908475871108899
BaggingOAL787179757077897582
Standard/RubinStep-ALT776681736578866982
BaggingStep-ALT736575716474806675
Standard/RubinStep-ALY776580736578876882
BaggingStep-ALY726575716474806675
Table A3. 1000 × empirical bias with sample size of 1000. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table A3. 1000 × empirical bias with sample size of 1000. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
1000 × Empirical Bias
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent41441420411
BaggingallPotent51541424413
Standard/Rubintrue42452525514
Baggingtrue52552533717
Standard/RubinoutcomePred0−002−021417
BaggingoutcomePred1012−021617
Standard/RubintrueConf0−002−021608
BaggingtrueConf1012021719
Standard/RubinSW41 3131749
BaggingSW51541425513
Standard/RubinOAL2034151729
BaggingOAL31431521312
Standard/RubinStep-ALT1−0202−02214139
BaggingStep-ALT0−1202−02116136
Standard/RubinStep-ALY1−0212−02314140
BaggingStep-ALY0−1212−02216136
Table A4. 100 × mean 95% confidence interval width with sample size of 1000. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table A4. 100 × mean 95% confidence interval width with sample size of 1000. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
100 × Mean 95% Confidence Interval Width
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent403440453645644554
BaggingallPotent554855615061856172
Standard/Rubintrue403340423542634855
Baggingtrue554755584858876777
Standard/RubinoutcomePred292529292529332630
BaggingoutcomePred423842413641463643
Standard/RubintrueConf292529292529403338
BaggingtrueConf423842413641564753
Standard/RubinSW403440453645654755
BaggingSW554855604960866273
Standard/RubinOAL333034343035413239
BaggingOAL464247454147534251
Standard/RubinStep-ALT292635292635332638
BaggingStep-ALT423848413647463650
Standard/RubinStep-ALY292635302635332638
BaggingStep-ALY423848413647463650

Appendix B. Application

Table A5. Proportion of each variable selected for prediction model across 1000 bootstrap samples.
Table A5. Proportion of each variable selected for prediction model across 1000 bootstrap samples.
Outcome ModelPropensity Model
CovariateSWALSWOALStep_ALTStep_ALY
CD4 t = −110010026100100100
CD4 t = 1100100100100100100
CD8 t = −1712020772020
RBC t = 1652835763028
RBC t = −2647418187
WBC t = 1592416612325
college579193889
CD4 t = −2523619583236
platelet t = −1491437651214
CD8 t = 1461362561413
treat t = −3437385966
treat t = −14211100581211
treat t = −2417804297
platelet t = −3374213834
WBC t = −1301174021
age242281512
CD8 t = −2231113521
RBC t = −1223174553
white211251311
platelet t = 1191203611
CD4 t = −3183123933
CD8 t = −3172282522
WBC t = −2141193011
WBC t = −3131292511
platelet t = −2121152711
RBC t = −3100211510

References

  1. Kaslow, R.A.; Ostrow, D.G.; Detels, R.; Phair, J.P.; Polk, B.F.; Rinaldo, C.R., Jr. The Multicenter AIDS Cohort Study: Rationale, Organization, and Selected Characteristics of the Participants. Am. J. Epidemiol. 1987, 126, 310–318. [Google Scholar] [CrossRef]
  2. Rosenbaum, P.R.; Rubin, D.B. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
  3. Rubin, D.B. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. J. Educ. Psychol. 1974, 66, 688–701. [Google Scholar] [CrossRef] [Green Version]
  4. Lange, K.; Little, R.J.A.; Taylor, J.M.G. Robust Statistical Modeling using the T Distribution. J. Am. Stat. Assoc. 1989, 84, 881–896. [Google Scholar] [CrossRef] [Green Version]
  5. Zhou, T.; Elliott, M.R.; Little, R.J.A. Penalized Spline of Propensity Methods for Treatment Comparison. J. Am. Stat. Assoc. 2019, 114, 1–19. [Google Scholar] [CrossRef]
  6. Rubin, D.B. The Design Versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials. Stat. Med. 2007, 26, 20–36. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Brookhart, M.A.; Schneeweiss, S.; Rothman, K.J.; Glynn, R.J.; Avorn, J.; Stürmer, T. Variable Selection for Propensity Score Models. Am. J. Epidemiol. 2006, 163, 1149–1156. [Google Scholar] [CrossRef] [Green Version]
  8. de Luna, X.; Waernbaum, I.; Richardson, T.S. Covariate Selection for the Nonparametric Estimation of an Average Treatment Effect. Biometrika 2011, 98, 861–875. [Google Scholar] [CrossRef]
  9. Shortreed, S.M.; Ertefaie, A. Outcome-adaptive Lasso: Variable Selection for Causal Inference. Biometrics 2017, 73, 1111–1122. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. VanderWeele, T.J.; Shpitser, I. A New Criterion for Confounder Selection. Biometrics 2011, 67, 1406–1413. [Google Scholar] [CrossRef] [PubMed]
  11. Rubin, D.B.; Thomas, N. Matching Using Estimated Propensity Scores: Relating Theory to Practice. Biometrics 1996, 52, 249–264. [Google Scholar] [CrossRef] [PubMed]
  12. Angrist, J.D.; Imbens, G.W.; Rubin, D.B. Identification of Causal Effects Using Instrumental Variables. J. Am. Stat. Assoc. 1996, 91, 444–455. [Google Scholar] [CrossRef]
  13. Rubin, D.B. Discussion of “Randomization Analysis of Experimental Data: The Fisher Randomization Test” by D. Basu. J. Am. Stat. Assoc. 1980, 75, 591–593. [Google Scholar]
  14. Little, R.J.A.; An, H. Robust Likelihood-Based Analysis of Multivariate Data with Missing Values. Stat. Sin. 2004, 14, 949–968. [Google Scholar]
  15. Zhang, G.; Little, R.J.A. Extensions of the Penalized Spline of Propensity Prediction Method of Imputation. Biometrics 2009, 65, 911–918. [Google Scholar] [CrossRef] [Green Version]
  16. Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; Wiley: New York, NY, USA, 1987. [Google Scholar]
  17. Eilers, P.H.C.; Marx, B.D. Flexible Smoothing with b-splines and Penalties. Stat. Sci. 1996, 11, 89–121. [Google Scholar] [CrossRef]
  18. Ngo, L.; Wand, M.P. Smoothing with Mixed Model Software. J. Stat. Softw. 2004, 9, 1–54. [Google Scholar] [CrossRef]
  19. Wand, M.P. Smoothing and mixed models. Comput. Stat. 2003, 18, 223–249. [Google Scholar] [CrossRef]
  20. Efron, B. Estimation and Accuracy after Model Selection (with discussion). J. Am. Stat. Assoc. 2014, 109, 991–1007. [Google Scholar] [CrossRef] [Green Version]
  21. Mao, H.; Li, L.; Greene, T. Propensity Score Weighting Analysis and Treatment Effect Discovery. Stat. Methods Med. Res. 2019, 28, 2439–2454. [Google Scholar] [CrossRef]
  22. Zou, H. The Adaptive Lasso and its Oracle Properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  23. Zigler, c.M.; Dominici, F. Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model Averaged Causal Effects. J. Am. Stat. Assoc. 2014, 109, 95–107. [Google Scholar] [CrossRef] [PubMed]
  24. Hansen, B.B. The Prognostic Analogue of the Propensity Score. Biometrika 2008, 95, 481–488. [Google Scholar] [CrossRef]
  25. Gelman, A.; Loken, E. The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem Even When There Is No “Fishing Expectation” or “p-Hacking” and the Research Hypothesis Was Posited Ahead of Time. 2013. Available online: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf (accessed on 9 June 2021).
Figure 1. Propensity score distributions between the treated (grey) and control (black) if (A) including all covariates in the propensity score model, π z = 1 0.95 = 18 % and π z = 0 0.95 = 22 % ; (B) if including only the covariates that were selected more than 20% of times by Step_ALT among 1000 bootstrap samples, π z = 1 0.95 = 33 % and π z = 0 0.95 = 49 % .
Figure 1. Propensity score distributions between the treated (grey) and control (black) if (A) including all covariates in the propensity score model, π z = 1 0.95 = 18 % and π z = 0 0.95 = 22 % ; (B) if including only the covariates that were selected more than 20% of times by Step_ALT among 1000 bootstrap samples, π z = 1 0.95 = 33 % and π z = 0 0.95 = 49 % .
Stats 04 00032 g001
Table 1. 1000 × RMSE with sample size of 200. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table 1. 1000 × RMSE with sample size of 200. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
1000 × Empirical RMSE
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent215190215278242278412313344
BaggingallPotent221196221251232251344294299
Standard/Rubintrue207189207222203222356291308
Baggingtrue207190207206193206329278286
Standard/RubinoutcomePred159142159163143163187145171
BaggingoutcomePred159142159161143161186145170
Standard/RubintrueConf159143159161144161219186209
BaggingtrueConf159143159159144159219186208
Standard/RubinSW214194213249231250382317327
BaggingSW217196216230217230326280283
Standard/RubinOAL177166183183172193217180202
BaggingOAL178167184179168185206178193
Standard/RubinStep-ALT164149181165145234189147242
BaggingStep-ALT164148181166149182189151188
Standard/RubinStep-ALY164148182164145236187146246
BaggingStep-ALY164148182166149183190150191
Table 2. 1000 × noncoverage rate (5%) with sample size of 200. The nominal coverage is 95%. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table 2. 1000 × noncoverage rate (5%) with sample size of 200. The nominal coverage is 95%. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
1000 × Noncoverage Rate
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent816816161614148
BaggingallPotent344034425242602840
Standard/Rubintrue163416445244746460
Baggingtrue323632343834666250
Standard/RubinoutcomePred283628444444564058
BaggingoutcomePred242624363236482848
Standard/RubintrueConf304030404640543834
BaggingtrueConf243024324232382030
Standard/RubinSW6186102010121214
BaggingSW323832385442604246
Standard/RubinOAL162620203224181624
BaggingOAL304032383836383038
Standard/RubinStep-ALT2426323636964022106
BaggingStep-ALT242446383056443266
Standard/RubinStep-ALY24263432341043622108
BaggingStep-ALY262254363058443066
Table 3. 1000 × RMSE with sample size of 1000. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table 3. 1000 × RMSE with sample size of 1000. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
1000 × Empirical RMSE
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent94819413093130180112146
BaggingallPotent95819512292122172111142
Standard/Rubintrue94809411789117186124157
Baggingtrue94809411287112178122152
Standard/RubinoutcomePred726472746474866578
BaggingoutcomePred726472746474866578
Standard/RubintrueConf7264727464741048498
BaggingtrueConf7264727364731048498
Standard/RubinSW94809412792127191117156
BaggingSW94819411890118171109141
Standard/RubinOAL766778786781916987
BaggingOAL766878776780916884
Standard/RubinStep-ALT7264807464948665108
BaggingStep-ALT726480746481866590
Standard/RubinStep-ALY7264817464948665109
BaggingStep-ALY726481746481866590
Table 4. 1000 × noncoverage rate (5%) with sample size of 1000. The nominal coverage is 95%. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
Table 4. 1000 × noncoverage rate (5%) with sample size of 1000. The nominal coverage is 95%. The treatment effects η = 2. S1, S2, and S3 denote scenario 1, 2, and 3, respectively.
1000 × Noncoverage Rate
PENCOMPAIPTWIPTW
Model SelectS1S2S3S1S2S3S1S2S3
Standard/RubinallPotent344434503850705452
BaggingallPotent610610121028104
Standard/Rubintrue383438583658904854
Baggingtrue484612628410
Standard/RubinoutcomePred484648484248624862
BaggingoutcomePred42482812610
Standard/RubintrueConf464846524452486052
BaggingtrueConf404646668
Standard/RubinSW344034544454764254
BaggingSW68681282866
Standard/RubinOAL243426302230281826
BaggingOAL424646646
Standard/RubinStep-ALT443624484060604484
BaggingStep-ALT4228441266
Standard/RubinStep-ALY443624484056604484
BaggingStep-ALY4228441266
Table 5. Treatment effect estimates and 95% confidence intervals.
Table 5. Treatment effect estimates and 95% confidence intervals.
IPTWAIPTWPENCOMP
allPotentRubin/standard7.5 (−2.2, 17.1)1.3 (−0.7, 3.3)0.7 (−1.4, 2.7)
Bagging5.8 (−3.0, 14.6)0.9 (−0.9, 2.6)0.7 (−1.0, 2.4)
SWRubin/standard11.9 (1.4, 22.4)2.7 (−0.03, 5.4)0.9 (−1.9, 3.7)
Bagging6.7 (−2.7, 16.0)1.7 (−0.5, 3.9)0.9 (−1.3, 3.1)
OALRubin/standard2.5 (−6.6, 11.5)0.9 (−2.1, 3.9)0.6 (−1.5, 2.7)
Bagging4.9 (−3.2, 13.0)1.6 (−0.9, 4.1)0.6 (−1.3, 2.5)
Step-ALTRubin/standard0.5 (−6.9, 7.9)−0.4 (−2.5, 1.7)−0.05 (−1.8, 1.7)
Bagging2.0 (−5.0, 9.0)0.4 (−1.6, 2.3)−0.04 (−1.6, 1.5)
Step-ALYRubin/standard0.5 (−7.0, 7.9)−0.4 (−2.4, 1.6)−0.09 (−1.8, 1.7)
Bagging1.9 (−5.3, 9.0)0.3 (−1.6, 2.2)−0.08 (−1.7, 1.6)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhou, T.; Elliott, M.R.; Little, R.J.A. Robust Causal Estimation from Observational Studies Using Penalized Spline of Propensity Score for Treatment Comparison. Stats 2021, 4, 529-549. https://doi.org/10.3390/stats4020032

AMA Style

Zhou T, Elliott MR, Little RJA. Robust Causal Estimation from Observational Studies Using Penalized Spline of Propensity Score for Treatment Comparison. Stats. 2021; 4(2):529-549. https://doi.org/10.3390/stats4020032

Chicago/Turabian Style

Zhou, Tingting, Michael R. Elliott, and Roderick J. A. Little. 2021. "Robust Causal Estimation from Observational Studies Using Penalized Spline of Propensity Score for Treatment Comparison" Stats 4, no. 2: 529-549. https://doi.org/10.3390/stats4020032

Article Metrics

Back to TopTop