A Simulation-Based Comparison of Covariate Adjustment Methods for the Analysis of Randomized Controlled Trials

Covariate adjustment methods are frequently used when baseline covariate information is available for randomized controlled trials. Using a simulation study, we compared the analysis of covariance (ANCOVA) with three nonparametric covariate adjustment methods with respect to point and interval estimation for the difference between means. The three alternative methods were based on important members of the generalized empirical likelihood (GEL) family, specifically on the empirical likelihood (EL) method, the exponential tilting (ET) method, and the continuous updated estimator (CUE) method. Two criteria were considered for the comparison of the four statistical methods: the root mean squared error and the empirical coverage of the nominal 95% confidence intervals for the difference between means. Based on the results of the simulation study, for sensitivity analysis purposes, we recommend the use of ANCOVA (with robust standard errors when heteroscedasticity is present) together with the CUE-based covariate adjustment method.


Introduction
When baseline covariate information is available for randomized controlled trials in the areas of environmental research and public health, statistical methods that perform covariate adjustment are usually employed. There are two main reasons to use covariate adjustment methods for the statistical analysis of randomized experiments: one is variance reduction for the estimators for the parameters of interest, which will lead to narrower confidence intervals and more powerful statistical tests; the other is to achieve the equivalence of the treatment groups that is expected as a consequence of randomization [1]. We note that under Neyman's causal model for randomization inference, the use of ordinary least squares regression covariate adjustment may increase the asymptotic variance in some cases [2]. This issue can be addressed by the inclusion of treatment by covariate interactions, or by the use of robust standard error estimators [3].
An example of a randomized controlled trial is the randomized study from Lanphear et al. (2000) [4] that investigated the long-term effect of dust control on blood lead concentrations. The participants were 275 children from Rochester, New York, who were randomized (together with their families) at six months of age to an intervention group (that received cleaning equipment and up to eight visits by a trained lead hazard control advisor) or to a control group. The intervention was terminated when the children were 24-months of age. The outcome for this experimental study was the natural log transformed blood lead concentration at the 48-month follow-up, while the natural log transformed blood lead concentration at the six-month baseline may be used as a covariate.
The analysis of randomized controlled trials, like the one described above, is usually performed using the classic analysis of covariance (ANCOVA). ANCOVA is a method that combines features of the analysis of variance (ANOVA) and the linear regression [5]. It is a popular parametric method used to compare the means of the outcome variables for different treatment groups while controlling for the covariates. ANCOVA may involve one or more covariates, and compared to ANOVA, it reduces the variance for the estimators of interest. Recently, Wu and Ying [6] proposed the use of the empirical likelihood (EL) method to perform covariate adjustment for randomized clinical trials, as a nonparametric alternative to ANCOVA. This method allows the efficient incorporation of side information, such as the expected balance of the covariates between the treatment groups in a randomized study. Related nonparametric covariate adjustment methods can be developed by using the exponential tilting (ET) and continuous updated estimator (CUE) methods instead of the EL method.
In this paper, we evaluated the usefulness of three important members of the generalized empirical likelihood (GEL) family, including the EL, ET, and CUE methods, with respect to performing covariate adjustment for randomized studies in environmental research and public health. We have used these three methods because they are important members of the GEL family, and are implemented in the R package gmm [7,8]. Using a simulation study, we compared these three nonparametric covariate adjustment methods and ANCOVA. In addition to comparing ANCOVA with the three GEL methods, the paper also compared the three GEL methods among themselves, to identify if there is one among them that performs best in a consistent way. The evaluation of the performance of these four methods was based on the estimated root mean squared error (RMSE) and the empirical coverage for nominal 95% confidence intervals (CIs), for varying sample sizes, covariance structures, underlying distributions, and number of covariates, using 10,000 simulations per scenario.

Covariate Adjustment Methods
To compare outcome means between treatment groups, we use ANOVA (when we do not perform covariate adjustment) or ANCOVA (when we perform covariate adjustment), assuming that the error terms are independent, normally distributed, and with equal variance. For sensitivity analysis purposes, we may also want to use alternative statistical methods that do not make these parametric assumptions, to evaluate how robust the results of the ANOVA/ANCOVA methods are to their specific assumptions. In our paper, the covariate adjustment was performed using three GEL methods-EL estimation, ET estimation, and CUE-in addition to the ANCOVA method. The technical details regarding the GEL methods and the three nonparametric covariate adjustment methods based on the EL, ET, and CUE methods are included in the sections of the Appendix. Here, we are providing only a simplified description of these covariate adjustment methods to allow the reader to understand the main ideas underlying them.
For simplicity, let us consider a randomized study where we have two treatment groups-one outcome, and one covariate. We want to estimate the outcome mean difference between the two treatment groups with adjustment for the covariate. The GEL-based covariate adjustment methods start with all observations having uniform weights 1/n, where n is the total sample size. To estimate the outcome mean difference, we reweigh the observations as little as possible, as measured by a "distance" between the uniform weights 1/n and the new weights, such that the weighted means (using the new weights) for the covariate for the two treatment groups are equal (i.e., covariate balance). The estimate of the outcome mean difference is the difference between the weighted means (using the new weights that provide covariate balance) for the outcome.
To construct the 95% confidence interval for the outcome mean difference by using the test inversion method, for each hypothesized value for the outcome mean difference, we reweigh the observations to achieve covariate balance and to have the outcome (weighted) mean difference equals the hypothesized value. If the new weights are "too far" from the uniform weights, we do not include that specific hypothesized value (for the outcome mean difference) in the 95% confidence interval. Conceptually, to construct the 95% confidence interval, we perform this for all possible values for the outcome mean difference. It is important to note that the only difference between the three GEL-based covariate adjustment methods is the specific measure used to quantify the "distance" between the uniform weights and the new weights.

Simulation Study
The simulation study had two goals. The first goal was to estimate the root mean squared error (RMSE) for each method using 10,000 simulations for each scenario. The second goal was to evaluate how well the nominal 95% CIs for the difference between means constructed by these methods cover the true mean difference (0, in our simulation study) by calculating the empirical coverage based on 10,000 simulations. The point estimates and corresponding 95% confidence intervals for the difference between means using the EL, ET, and CUE methods were constructed using the R package gmm [7,8]. These confidence intervals for the GEL methods that are constructed based on test inversion are only available starting with version 1.6 of the R package gmm.
Our simulation study is divided into three parts. In the first part, we consider situations involving equal sample sizes for the treatment groups, homoscedasticity, and no interaction between covariates and the treatment group. In the second part, we consider situations involving unequal sample sizes for the treatment groups, heteroscedasticity, and/or interactions between covariates and the treatment group. For both the first and the second part of the simulation study, we consider only the case when the true outcome mean difference is zero. In the third part of the simulation study, we use real data from Lanphear et al. [4] to investigate situations involving equal sample sizes for the treatment groups, homoscedasticity, and no interaction between covariates and the treatment group, similar to the first part of the simulation study, while considering situations where the true outcome mean difference is different from zero. We note that our simulation study is comprehensive by covering a broad range of possible situations and also by including simulations based on real data.
The general setup for the simulation study was as follows: 1. We estimated the difference between means and constructed corresponding 95% CIs, without adjustment and with adjustment for one covariate or two covariates; 2. We performed 10,000 simulations for each scenario under investigation; 3. We considered a sample size of 200 from which 200(1 − δ) are assigned to group 1 (z = 0) and 200δ are assigned to group 2 (z = 1), where δ is between 0 and 1. Without loss of generality, the vector z is generated by setting the first 200(1 − δ) elements to 0 and the remaining ones to 1; 4. For the underlying distributions of the data, we considered the following three types of multivariate distributions for (y, x 1 , x 2 ), where y is the outcome and x 1 and x 2 are the covariates: (a) Normal (generated using the R package mvtnorm [9]); (b) t with three degrees of freedom (generated using the R package mnormt [10]); (c) Centered lognormal (generated using the R package mvtnorm [9]).
For each distribution, Var(y) = Var(x 1 ) = Var(x 2 ) = 1, Cor(y, x 1 ) = Cor(y, x 2 ) = Cor(x 1 , x 2 ) = ρ, and the three variables have mean 0. For the lognormal, which is the exponential of a multivariate normal with mean 0 and covariance matrix Σ, the multivariate normal was selected as to obtain the desired variances and correlations. We also subtracted from each variable its expected value. 5. In the simulation, we want to evaluate different scenarios. In particular, we want to allow for unequal assignment to the treatment groups, Var(y|z = 1) = Var(y|z = 0), and/or Cor(y, x i |z = 0) = Cor(y, x i |z = 1). In order to accomplish that, after generating the 200 observations, the outcome is modified as follows: Every y i with z i = 1 is multiplied by v 1 , and then β 2 (x 1i + x 2i ) is added, where v 1 is a parameter that affects the variance of y when z = 1, and β 2 is another parameter that affects the correlation between y and the covariates when z = 1. This modification has no effect on y when z = 0, but it affects the variance of y and its correlation with the covariates when z = 1 in the following way: Var(y|z = 1) .

Equal Sample Sizes, Homoscedasticity, and No Interaction
For the first part of our simulation, we set δ = 0.5, v 1 = 1, and β 2 = 0, which implies Var(y|z) = 1 and Cor(y, x i |z) = ρ for the two treatment groups. In this set of simulations, we want to compare the properties of the four methods for different values of the correlation coefficient ρ. In particular, we consider ρ being equal to one of the following values: {0, 0.1, 0.3, 0.5, 0.7, 0.9}.
We note that the simulated data satisfies the moment conditions for the GEL methods for all three distributions considered. The data simulated using the normal distribution satisfies the ANOVA/ANCOVA assumptions. The data simulated using the t distribution with three degrees of freedom and the lognormal distribution satisfies the ANOVA/ANCOVA assumptions except the normality assumption for the error terms, although the use of treatment groups with equal sample sizes makes the ANOVA/ANCOVA method robust to violations of the normality assumption, see [5] and [11]. Because of the randomization, there is no confounding due to the covariates. We are adjusting for covariates only to increase the efficiency of our estimators for the outcome mean difference between the two treatment groups.

Unequal Sample Sizes, Heteroscedasticity, and/or Interaction
For the second part of the simulation study, we consider scenarios involving unbalanced treatment groups, heteroscedasticity, and/or interactions between covariates and treatment group. Specifically, Case 1 involves unequal group sizes, heteroscedasticity and interaction (Var(y|z = 1) = 4.16 and Cor(y, x i |z = 1) = 0.71), Case 2 involves equal group sizes, homoscedasticity and interaction (Var(y|z = 1) = 2.75 and Cor(y, x i |z = 1) = 0.75), Case 3 involves unequal group sizes, homoscedasticity and no interaction (Var(y|z = 1) = 1 and Cor(y, x i |z = 1) = 0.5), Case 4 involves equal group sizes, heteroscedasticity and no interaction (Var(y|z = 1) = 2 and Cor(y, x i |z = 1) = 0.5), and Case 5 involves unequal group sizes, heteroscedasticity and no interaction (Var(y|z = 1) = 2 and Cor(y, x i |z = 1) = 0.5). We note that the validity of the GEL moment conditions is not affected by these changes, while the validity of the ANCOVA assumptions (i.e., homoscedasticity, no covariate by treatment interaction) is affected.

Real Data and Non-Null Effect Sizes
To enhance the paper, we have used real data from the randomized controlled trial described in Lanphear et al. [4] to perform additional simulations that are close to a real life situation, and also to illustrate the use of the four covariate adjustment methods with real data. We have used for this paper the data for the 169 children for whom both six-months baseline and 48-months follow-up blood lead concentrations are available. This includes 89 children randomized to the intervention group (group 2 or z = 1, using the above terminology) and 80 children randomized to the control group (group 1 or z = 0). Similar to the original study, we have used the natural log transformed blood lead concentration values instead of the original blood lead concentration values. The outcome for this experimental study was the natural log transformed blood lead concentration at the 48-month follow-up, while the covariate was the natural log transformed blood lead concentration at the six-months baseline.
For the third part of the simulation study, we have used descriptive statistics (means, standard deviations, and correlation coefficient) from the real data, to consider scenarios where the mean difference is not null, to allow us to compare the statistical power of the four different methods. We have expressed the mean difference in standard deviation units. The setup of the simulations was as follows: n = 100 for each treatment group, {y, x} is a bivariate normal with mean {1.82 + D(z), 1.07} and covariance matrix Σ, with Σ 11 = 0.58 2 , Σ 22 = 0.52 2 , and Σ 12 = Σ 21 = (0.35)(0.58)(0.52), where Σ ij is the element of Σ on the ith row and jth column, and the number of replications equals to 10,000. For the third part of the simulation study, we have δ = 0.5, v 1 = 1, and β 2 = 0, which implies the same variance and correlation for the two groups. The correlation between x and y is therefore equal to 0.35. Here D(z) = ∆ when z = 0 (i.e., the control group or group 1) and D(z) = 0 when z = 1 (i.e., the intervention group or group 2), where ∆ = {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. For ∆ = 0, we evaluate the size of the statistical tests, while, for all other values, we estimate the statistical power. We have simulated normal data because the distributions of the natural log transformed blood lead concentrations for the two treatment groups were approximately normal.

Results
Tables 1-3 show the estimated RMSE and the empirical coverage of nominal 95% confidence intervals for each one of the four covariate adjustment methods, separately for each scenario from the first part of the simulation study with a specified correlation structure and underlying distribution, based on 10,000 simulations. For completeness purposes, we are presenting the results without covariate adjustment, with adjustment for one covariate, and with adjustment for two covariates. It is important to note that even if baseline covariates are available, given the randomization, it is not required to adjust for the covariates.  Table 1 presents the simulation results for the situation involving no covariate adjustment (i.e., either no covariate information is available, or the covariates are not used for adjustment although they are available). Overall, ANOVA and CUE perform equally well with respect to empirical coverage, and better than the EL and ET methods. The estimated RMSE for the EL method is smaller for the t-distribution and the lognormal distribution cases but that is associated with empirical coverage much below the nominal level.  Table 2 presents the simulation results for the situation involving adjustment for one covariate. For the normal distribution case, the estimated RMSE and empirical coverage are similar for the four methods. For the t-distribution case, CUE and ANCOVA perform equally well and better than the EL and ET methods, while having smaller estimated RMSE. For the lognormal distribution case, the performance of the EL and ET methods with respect to empirical coverage is even worst, while the estimated RMSE for CUE and ANCOVA tend to be smaller. We note that for completeness we have included the case when we adjust for a covariate that is uncorrelated with the outcome, although this situation is more of theoretical than practical interest. For each distribution under consideration, the results are consistent across the different correlation values. Similar conclusions apply to the simulation results from Table 3, where we adjust for two covariates that are correlated with the outcome and among themselves with the same correlation ρ. Tables 4-6 present the results for the five different cases and the three distributions. Overall, CUE is the best method in terms of empirical coverage. ANCOVA performs poorly in Case 1 and Case 5, which are characterized by a high variance of the response variable for the smaller treatment group. This result indicates that CUE is robust to heteroscedasticity, while ANCOVA is not. However, we can see from the results from Table 7 that using the robust standard errors makes ANCOVA comparable to CUE in terms of empirical coverage.
The results from Table 8 indicate that CUE and ANCOVA provided the best control of the type I error, which corresponds to ∆ = 0, while having similar statistical power. It is important to note that the patterns of the estimated RMSE and empirical coverage results for these scenarios involving non-null mean differences were similar to those from the previous set of simulations that involved only null mean differences.    The robust standard errors are computed using the HC3 type of heteroscedasticity consistent covariance matrices [12,13]. Case 1: unequal group sizes, heteroscedasticity and interaction. Case 2: equal group sizes, homoscedasticity and interaction. Case 3: unequal group sizes, homoscedasticity and no interaction. Case 4: equal group sizes, heteroscedasticity and no interaction. Case 5: unequal group sizes, heteroscedasticity and no interaction. The results of the statistical analysis of the real data are presented in Table 9. The results of the four different covariate adjustment methods were similar for the parameter of main interest ∆, i.e., the mean difference between the control group and the intervention group with respect to the natural log transformed blood lead concentration at the follow-up, adjusted for the natural log transformed blood lead concentration at the baseline. Given the very small estimates for ∆, we have provided all the results with five decimal places. Although, in the original study [4], there was no adjustment for the natural log transformed blood lead concentration at the baseline, these covariate adjusted results provide additional support for the conclusion that there was no significant effect of the intervention on the blood lead concentration. The results of the four methods were also similar for µ 1 , i.e., the mean of the natural log transformed blood lead concentration at the follow-up for the intervention group, and for µ x , i.e., the common mean of the natural log transformed blood lead concentration at the baseline. The table also illustrates the difference between the types of results provided by ANCOVA versus the GEL-based methods: the GEL methods provide an estimate for the common covariate mean (µ x ), while ANCOVA provides an estimate for the slope for the linear relationship between the covariate and the outcome (β x ). Table 9. Results of the statistical analysis of the data from the randomized study described in Lanphear et al. (2000) [4]. 0.38249 (0.22383; 0.54116) µ 1 : mean of the natural log transformed blood lead concentration at the follow-up for the intervention group; ∆: mean difference between the groups with respect to the natural log transformed blood lead concentration at the follow-up; µ x : common mean of the natural log transformed blood lead concentration at the baseline; β x : slope for the linear relationship between the natural log transformed blood lead concentration at the baseline and the natural log transformed blood lead concentration at the follow-up.

Conclusions
For our simulation study, we performed 10,000 simulations at different levels of treatment group sample size, covariance structure, underlying distribution, and number of covariates. We have also considered cases in which the variance of the outcome and its correlations with the covariates were different for the two treatment groups. We compared a parametric method, ANOVA/ANCOVA, and three GEL methods: the EL method, the ET method, and the CUE method. The main difference between ANCOVA and the GEL methods is that the former imposes an arbitrary parametric structure, while the latter methods only assume treatment randomization.
The results of the simulation study showed that, overall, the CUE-based covariate adjustment method and ANCOVA (with robust standard errors when heteroscedasticity is present) performed equally well and better than the covariate adjustment methods based on EL and ET. In terms of computational complexity, however, ANCOVA is clearly the simpler method since it relies on the least squares estimation method. Among the GEL methods considered here, EL is the least computationally stable, especially when the distribution of the variables has heavy tails. For example, for our scenarios involving the t-distribution, 20 to 30 simulations out of the total of 10,000 simulations per scenario involved lack of convergence. We should note that the results for the EL method may be improved by using the Bartlett correction or bootstrap calibration [14]. We have not investigated the usefulness of those two approaches in the current paper due to the additional computational complexity involved. In future research, we will consider alternative methods based on GEL which are less sensitive to distributions with heavy tails [15] and more computationally stable [16]. In addition, exploring other forms of heteroscedasticity and considering a more general set of moment conditions for general functions f () may help us identify situations for which the benefits of using GEL outweigh the computational complexity.
Based on the results of our simulation study, for sensitivity analysis purposes, we recommend the use of ANCOVA (with robust standard errors when heteroscedasticity is present) together with the CUE-based covariate adjustment method. This recommendation is based on the similar overall good performance in our simulation study of these two different statistical methods. If the results of the ANCOVA and the CUE-based covariate adjustment method imply similar conclusions, then the robustness of these conclusions is supported. If the results of these two different methods imply qualitatively different conclusions, then the conclusion implied by the CUE-based covariate adjustment method may be preferred given that this method only assumes that the treatment has been randomly assigned.

Acknowledgments:
The authors would like to acknowledge Bruce Lanphear from Simon Fraser University for providing a copy of the data set used in this paper.
Author Contributions: Pierre Chaussé, Jin Liu and George Luta designed the simulation study; Pierre Chaussé and Jin Liu performed the simulations; Pierre Chaussé, Jin Liu and George Luta wrote the paper; all authors read and approved the final manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Minimum Discrepancy Methods
Cressie and Read [17] have unified the goodness-of-fit statistics for multinomial data using their power-divergence family of statistics. Estimation methods for discrete multivariate data based on the Cressie and Read (C-R) power-divergence family involve the constrained minimization of a divergence measure between the observed and the expected probability distributions. This class of estimation methods includes many important special cases, such as the empirical likelihood (EL) estimation, exponential tilting (ET) estimation, and the continuous updated estimation (CUE) [14]. Newey and Smith [18] describe the related generalized empirical likelihood (GEL) family.
Before presenting the GEL methods, we first provide a brief introduction to the minimum discrepancy (MD) methods. In particular, we are focusing on the MD methods based on the C-R family, because the GEL methods are closely related to them. The C-R power-divergence family of statistics, proposed in [19] and unified in [17] by Cressie and Read for multinomial data, can be written as (Note that we use the parametrization from Newey and Smith [18].) where a is a parameter that identifies the specific member of the family, and for a = 0 and a = −1 the function is defined as the limit. As noted by Owen [14], the limiting cases CR(π; −1) and CR(π; 0), correspond to the EL method and the ET method, respectively. Furthermore, the value a = 1 corresponds to Neyman's minimum chi-square method or the Euclidean empirical likelihood method [14,20]. This CR(π; a) function measures the discrepancy between the probability distribution π and the uniform probability distribution that assigns a probability 1/n to each observation. We can easily verify that CR(1/n; a) = 0, and also that it is strictly positive for any other probability distribution π, for all a. In many applications, we are interested in estimating an unknown parameter vector θ, using a set of estimating equations that we know or assume to be valid. In most cases, these estimating equations can be expressed as a vector of moment conditions E[g(x; θ)] = 0. When the number of moment conditions is equal to the dimension of the parameter vector, θ can be estimated simply by solving the system of possibly nonlinear equations ∑ n i=1 (1/n)g(x i ; θ) = 0. In other words, we replace the population moment conditions by their sample versions, using the uniform probability distribution as an estimate of the true probability distribution. On the other hand, if we have more conditions than the number of parameters, there is no solution to the system of estimating equations. However, under some regularity conditions, there exist infinitely many sets of non-negative probabilities {π 1 , ..., π n } satisfying the condition ∑ n i=1 π i = 1, for which the system of equations ∑ n i=1 π i g(x i ; θ) = 0 has a solution. The MD methods involve finding a set of probabilities {π 1 , ...,π n } and an estimateθ, by minimizing CR(π; a) subject to the constraint ∑ n i=1 π i g(x i ; θ) = 0 for a fixed a. The objective is therefore to be as close as possible to the uniform probability distribution, which is the best nonparametric estimate of the true distribution, while satisfying the moment conditions. Of course, our estimatorθ depends on the value of a which specifies the discrepancy.

Appendix B. Generalized Empirical Likelihood Methods
All MD estimators based on CR(π; a) belong to the family of GEL estimators [18] (More specifically, the estimatorθ is the solution to min where the function ρ(v) depends on a. It is the dual of the MD problem and its objective function is numerically identical to the CR function at the optimum.). In particular, Smith [21] shows that this is the case for the EL and ET methods, while Newey and Smith [18] show that the Euclidean empirical likelihood estimator [20] is the same as the CUE estimator [22]. The advantage of the GEL methods is that they offer a numerically more tractable way of solving the MD estimation problem, and they also make it easier to derive the theoretical properties of the resulting estimators. Because it is more common to refer to the GEL methods than to the MD methods when the estimating equations are based on moment conditions, the former term will be preferred for the rest of this paper.
Although all GEL estimators are asymptotically equivalent to the generalized method of moments (GMM) estimator [23], their small sample properties are different [18]. For example, Newey and Smith [18] show that in some cases the bias of the EL estimator converges to zero faster than for the other GEL methods. However, those results are only valid in large samples. Small sample properties can only be evaluated through simulation studies.
One advantage of the GEL methods over the ordinary least squares (OLS) method used by ANCOVA for covariate adjustment is that it allows us to incorporate more information through additional moment conditions. This is an important advantage since more information usually translates into higher statistical efficiency for the estimators. The GEL methods also allow us to test whether the moment conditions are valid. Suppose E[g(x; θ)] = 0 represents our q moment conditions, and suppose θ is a parameter vector of dimension k < q. In this case the model is said to be over-identified. If the moment conditions are valid, 2nCR(π; a) is asymptotically distributed as a chi-squared distribution with (q − k) degrees of freedom [18]. For example, if we add conditions that are only valid when we have a randomized experiment, we could test whether the randomization has been properly performed.
We can also test the null hypothesis H 0 : θ i = c, for one element of interest of θ, as follows. First, we fix θ i to its value under the null hypothesis, and then we estimate the remaining elements of θ using GEL methods. LetR(c) = 2nCR(π; a) be the solution under the null hypothesis, andR = 2nCR(π; a) be the solution under the unrestricted model. Then, if the null hypothesis is true, Q(c) =R(c) −R is asymptotically distributed as a chi-squared distribution with 1 degree of freedom by a result similar to Wilk's Theorem. Furthermore, we can construct a nonparametric 95% confidence interval for θ i by inverting the statistical test, i.e., by searching for all values of c that are such that Q(c) < 3.8415, the critical value of the chi-squared distribution with 1 degree of freedom. This is how we have constructed our confidence intervals for the three GEL methods presented in this paper. This approach is computationally intensive because we need to estimate the restricted model for all values of c (i.e., the implied probabilitiesπ i are functions of c, and they must be recomputed each time) (This is how the function confint with the option type='invLR' constructs the confidence intervals in version 1.6 and above of the R package gmm [7].). However, this approach is more flexible than the OLS-based approach to construct confidence intervals because it does not require the confidence intervals to be symmetrical around the point estimate.

Appendix C. Moment Conditions
In our study, we consider the estimation of the difference between two means without adjustment for covariates as well as with adjustment for one or two covariates. For the situation involving no covariate adjustment, the moment conditions can be written as follows: where y is the dependent variable, z is a binary treatment group indicator (z=0 for group 1, and z = 1 for group 2), µ 1 is the mean of the outcome for group 1, and ∆ is the difference between the means of the outcome for the two groups. The number of moment conditions is equal to the number of parameters {µ 1 , ∆}, which implies that the point estimates from any GEL methods will be identical to the OLS estimate, although the confidence intervals will be different as described above. To illustrate the procedure described in the previous section, if we want to test the hypothesis H 0 : ∆ = c, we have to estimate µ 1 after imposing this null hypothesis. The restricted model is therefore over-identified (k = 1, q = 2). The confidence interval is constructed by searching for all c that are such that Q(c) < 3.8415, where Q(c) =R(c) becauseR = 0 (π i = 1/n for the just-identified models). For the situation involving adjustment for one covariate, the moment conditions can be written as follows: where x is the covariate, µ x is the common mean of the covariate for both groups, and the other notations are the same as for the case involving no adjustment. Here, we add two additional moment conditions for the covariate x to incorporate the information that we expect the two treatment groups to be balanced with respect to the covariate. We could add many more moment conditions because in randomized experiments E[z( f (x) − E( f (x)))] = 0 for all functions f . In our paper, we only analyze the simplest case f (x) = x. For the situation involving adjustment for two covariates, the moment conditions can be written as follows: where x 1 and x 2 are two covariates, µ x 1 and µ x 2 are the common means of the first and second covariate for both groups, respectively, and the other notations are the same as for the case involving no adjustment.