Normalized Augmented Inverse Probability Weighting with Neural Network Predictions

The estimation of average treatment effect (ATE) as a causal parameter is carried out in two steps, where in the first step, the treatment and outcome are modeled to incorporate the potential confounders, and in the second step, the predictions are inserted into the ATE estimators such as the augmented inverse probability weighting (AIPW) estimator. Due to the concerns regarding the non-linear or unknown relationships between confounders and the treatment and outcome, there has been interest in applying non-parametric methods such as machine learning (ML) algorithms instead. Some of the literature proposes to use two separate neural networks (NNs) where there is no regularization on the network’s parameters except the stochastic gradient descent (SGD) in the NN’s optimization. Our simulations indicate that the AIPW estimator suffers extensively if no regularization is utilized. We propose the normalization of AIPW (referred to as nAIPW) which can be helpful in some scenarios. nAIPW, provably, has the same properties as AIPW, that is, the double-robustness and orthogonality properties. Further, if the first-step algorithms converge fast enough, under regulatory conditions, nAIPW will be asymptotically normal. We also compare the performance of AIPW and nAIPW in terms of the bias and variance when small to moderate L1 regularization is imposed on the NNs.


Introduction
Estimation of causal parameters such as the average treatment effect (ATE) in observational data requires confounder adjustment. The estimation and inference are carried out in two steps: In step 1, the treatment and outcome are predicted by a statistical models or machine learning (ML) algorithm, and in the second step the predictions are inserted into the causal effect estimator. If ML algorithms are employed in step 1, the non-linear relationships can potentially be taken into account. The relationship between the confounders and the treatment and outcome can be non-linear which make the application of machine learning (ML) algorithms, which are non-parametric models, appealing. Farrell et al. [1] proposed to use two separate neural networks (double NNs or dNNs) where there is no regularization on the network's parameters except the stochastic gradient descent (SGD) in the NN's optimization [2][3][4][5]. They derive the generalization bounds and prove that the NN's algorithms are fast enough so that the asymptotic distribution of causal estimators such as the augmented inverse probability weighting (AIPW) estimator [6][7][8] will be asymptotically linear, under regulatory conditions and the utilization of cross-fitting [9].
Farrell et al. [1] argue that the fact that SGD-type algorithms control the complexity of the NN algorithm to some extent [2,10] is sufficient for the first step. Our initial simulations and analyses, however, contradict this claim in scenarios where strong confounders and instrumental variables (IVs) exist in the data.
Conditioning on IVs is harmful to the performance of the causal effect estimators such as ATE (Myers et al. [11]) but there may be no prior knowledge about which covariates are IVs, confounders or otherwise. The harm comes from the fact that the complex NNs can provide near-perfect prediction in the treatment model which violates the empirical positivity assumption [12].
The positivity assumption (Section 2) is fundamental to hold to have an identifiable causal parameter in a population. However, in a finite sample, although the parameter is identifiable by making the positivity assumption, the bias and variance of the estimator can be inflated if the estimated propensity scores are close to zero or one bounds (or become zero or one by rounding errors). This is referred to as the empirical positivity assumption which is closely related to the concept of sparsity studied in Chapter 10 of Van der Laan and Rose [8]. The violation of the empirical positivity assumption can cause the inflation of the bias and variance of inverse probability weighting (IPW)-type and AIPW-type estimators.
The inverse probability weighting method dates at least back to Horvitz and Thompson [13] in the literature of sampling with unequal selection probabilities in sub-populations. IPWtype and matching methods have been extensively studied Lunceford and Davidian [7], Rubin [14], Rosenbaum and Rubin [15,16], Busso et al. [17]. IPW is proven to be a consistent estimator of ATE if the propensity scores (that are the conditional probability of treatment assignments) are estimated by a consistent parameter or non-parametric model. The other set of ATE estimators include those involving the modeling of the outcome and inserting the predictions directly into the ATE estimator (Section 2). They are referred to as single robust (SR) estimators as they provide √ n−consistent estimators for ATE if the outcome model is √ n−consistent. In this sense, IPW is also single robust as it is consistent if the treatment (or the propensity score) model is √ n−consistent. The focus of this work is to study the augmented IPW-type methods as they involve modeling both treatment and outcome and can be √ n−consistent estimators of ATE if either of the models is consistent.
We propose and study a simple potential remedy to the empirical positivity violation issue by studying the normalization of the AIPW estimator (similar to the normalization of IPW [7]), here referred to as nAIPW. In fact, both AIPW and nAIPW can be viewed as a more general estimator which is derived via the efficient influence function of ATE [18,19].
A general framework of estimators that includes nAIPW as a special case was proposed by [20]. In their work, the authors did not consider machine learning algorithms for the first-step estimation, but rather assumed parametric statistical models estimated by likelihood-based approaches. They focused on how to consistently estimate ATE within different sub-populations imposed by the covariates. There is a lack of numerical experimentation on these estimators especially when IVs and strong confounders exist in the set of candidate covariates.
To the best of our knowledge, the performance of nAIPW has not been previously studied in the machine learning context, with the assumption that strong confounders and IVs exist in the data. We will prove that this estimator has the doubly robust [6] and the rate doubly robust [19] property, and illustrate that it is robust against extreme propensity score values. Further, nAIPW (similar to AIPW), has the orthogonality property [9] which means that it is robust against small variations in the predictions of the outcome and treatment assignment predictions. One theoretical difference is that AIPW is the most efficient estimator among all the double robust estimators of ATE given both treatment and outcome models are correctly specified [21]. In practice, however, often there is no a priori knowledge about the true outcome and propensity score relationships with the input covariates and thus this feature of AIPW is probably of less practical use.
We argue that for causal parameter estimation, dNN with no regularization may lead to high variance for the causal estimator used in the second step. We compare AIPW and nAIPW through a simulation study where we allow for moderate to strong confounding and instrumental variable effects, that is, we allow for possible violation of the empirical positivity assumption. Further, a comparison between AIPW and nAIPW is made on the Canadian Community Health Survey (CCHS) dataset where the intervention/treatment is the food security vs. food insecurity and the outcome is individuals' body mass index (BMI).
Our contributions include presenting the proof for the orthogonality, doubly robust and rate doubly robust property of nAIPW. Further, it is proven that, under certain assumptions, nAIPW is asymptotically normal and we provide its consistent variance estimator. We analyze the estimation of ATE in the presence of not only confounders, but also IVs, ypredictors and noise variables. We demonstrate that in the presence of strong confounders and IVs, if complex neural networks without L 1 regularizations are used in the step 1 estimation, both AIPW and nAIPW estimators and their asymptotic variances perform poorly, but, relatively speaking, nAIPW performs better. In this paper, the NNs are mostly used as means of estimating the outcome and treatment predictions.
Organization of the article is as follows. In Section 2 we will formally introduce the nAIPW estimator to the readers and state its double robustness property, and in Section 3 we present the first-step prediction model, double neural networks. In Sections 4 and 5 we will present the theoretical aspects of the paper, including the asymptotic normality, doubly robustness and rate doubly robustness orthogonality of the proposed estimator (nAIPW) and the asymptotic normality. We will present the simulation scenarios and results of comparing the nAIPW estimator with other conventional estimators in Section 6. We apply the estimators on a real dataset in Section 7. The article will be concluded with a short discussion on the findings in Section 8. The proofs are straightforward but long and thus are included in Appendix A.

Normalized Doubly Robust Estimator
Let with W being the adjusting factors. P is the true observed data distribution,P n is the distribution of O such that its marginal distribution with respect to W is its empirical distribution and the expectation of the conditional distribution Y|A = a, W, for a = 0, 1, can be estimated. We denote the prediction function of the observed outcome given explanatory variables in the treated group Q 1 := Q(1, W) = E[Y|A = 1, W], and that in the untreated group Q 0 := Q(0, W) = E[Y|A = 0, W], and the propensity score as g(W) = E[A|W]. Throughout, the expectations E are with respect to P. The symbolˆon the population-level quantities indicates the corresponding finite sample estimator, and P is replaced byP n .
Let the causal parameter of interest be the average treatment effect (ATE) where Y 1 and Y 0 are the potential outcomes of the treatment and controls [6]. For identifiablity of the parameter, the following assumptions must hold true. The first assumption is the conditional independence, or unconfoundedness stating that, given the confounders, the potential outcomes are independent of the treatment assignments (Y 0 , Y 1 ⊥ A|W). The second assumption is positivity which entails that the assignment of treatment groups is not deterministic (0 < Pr(A = 1|W) < 1). The third assumption is consistency which states that the observed outcomes equal their corresponding potential outcomes (Y A = y). There are other modeling assumptions made such as time order (i.e., the covariates W are measured before the treatment), IID subjects and a linear causal effect.
A list of first candidates to estimate ATE are naive ATEβ naiveATE = 1 The naive average treatment effect (naive ATE) is a biased (due to the selection bias) estimator of ATE [22] and is the poorest estimator among all the candidates. The single robust (SR) is not an orthogonal estimator [9] and if ML algorithms which do not belong to the Donsker class ( [23], Section 19.2) or have entropy that grows with the sample size are used, this estimator also becomes biased and is not asymptotically normal. The inverse probability weighting (IPW) [13] and its normalization versions adjust (or weight) the observations in the treatment and control groups. IPW and nIPW are also not orthogonal estimators and are similar to SR in this respect. In addition, bothβ SR andβ IPW (andβ nIPW ) are single robust, that is, they are consistent estimators of ATE if the models used are √ n-consistent [7]. IPW is an unbiased estimator of ATE if g is correctly specified, but nIPW is not unbiased, but is less sensitive to extreme predictions. The augmented inverse probability weighting (AIPW) estimator [21] is an improvement over SR, IPW and nIPW, which involves the predictions for both treatment (the propensity score), and the causal parameter can be expressed as: and the sample version estimator of (3) iŝ . Among all the doubly robust estimators of ATE, AIPW is the most efficient estimator if both of the propensity score or outcome models are correctly specified, but is not necessarily efficient under incorrect model specification. In fact, this nice feature of AIPW may be less relevant in real-life problems as we might not have a priori knowledge about the predictors of the propensity score and outcome and we cannot correctly model them. Further, in practice, perfect or near-perfect prediction of the treatment assignment can inflate the variance of the AIPW estimator [8]. As a remedy, similar to the normalization of the IPW estimator, we can define a normalized version of the AIPW estimator which is less sensitive to extreme values of the predicted propensity score, referred to as the normalized augmented inverse probability weighting (nAIPW) estimator: where w . Both AIPW and nAIPW estimators add adjustment factors to the SR estimator which involve both models of the treatment and the outcome.
Both AIPW and nAIPW are examples of a class of estimators wherê where we refer to this general class as the general doubly robust (GDR) estimator. Lettingĥ 1 =ĝ andĥ 0 = 1 −ĝ gives the AIPW estimators and lettingĥ 1 =ĝÊ Â g and h 0 = (1 −ĝ)Ê 1−A 1−ĝ gives the nAIPW estimator. The GDR estimator can also be written aŝ If h 1 and h 0 are chosen so that by the total law of expectationβ GDR is an unbiased estimator of β.

Outcome and Treatment Predictions
The causal estimation and inference when utilizing the AIPW and nAIPW is carried out in two steps. In step 1, the treatment and outcome are predicted by a statistical or machine learning (ML) algorithm, and in the second step the predictions are inserted into the estimator. The ML algorithms in step 1 can capture the linear and non-linear relationships between the confounders and the treatment and the outcome.
Neural networks (NNs) [2][3][4] are a class of non-linear and non-parametric complex algorithms that can be employed to model the relationship between any set of inputs and some outcome. There has been a tendency to use NNs as they have achieved great success in the most complex artificial intelligence (AI) tasks such as computer vision and natural language understanding [2].
Farrell et al. [1] used two independent NNs for modeling the propensity score model and the outcome with the rectified linear unit (RELU) activation function [2], here referred to as the double NN or dNN: where two separate neural nets model y and A (no parameter sharing). Farrell et al. [1] proved that dNN algorithms almost attain n 1 4 -rates. By employing the cross-fitting method and theory developed by Chernozhukov et al. [9], an orthogonal causal estimator is asymptotically normal, under some regularity and smoothing conditions, if the dNN is used in the first step (see Theorem 1 in [1]).
These results assume no regularizations imposed on the NNs' weights, and only the stochastic gradient descent (SGD) is used. Farrell et al. claim that the fact that SGD controls the complexity of the NN algorithm to some extent [2,10] is sufficient for the first step. Our initial simulations, however, contradict this claim and we hypothesize that for causal parameter estimation, a dNN with no regularization leads to high variance for the causal estimator used in the second step. Our initial experiments indicate that L 2 regularization and dropout do not perform well in terms of the mean square error (MSE) of AIPW. The loss functions we use contain L 1 regularization (in addition to SGD during the optimization): where C L 1 , C L 1 are hyperparameters that can be set before training or be determined by cross-validation, that can cause the training to pay more attention to one part of the output layer. The dNN can have an arbitrary number of hidden layers, or the width of the network (HL) can be another hyperparameter. For a three-layer network, HL = [l 1 , l 2 , ..., l h ], where l j is the number neurons in layer j, j = 1, 2, ..., h. P y , P A are the connection parameters in the non-linear part of the networks, with Ωs being shared for the two outcome and propensity models. Note that the gradient descent-type optimizations in the deep learning platforms (such as pytorch in our case) do not cause the NN parameters to shrink to zero.

GDR Estimator Properties
In this section we will see that nAIPW (5) is doubly robust, that is, if either of the outcome or propensity score models are √ n-consistent, nAIPW will be consistent. Further, nAIPW is orthogonal [9] and is asymptotically linear under certain assumptions and we calculate its asymptotic variance.

Consistency and Asymptotic Distribution of nAIPW
In causal inference, estimating the causal parameter and drawing inference on the parameter are two major tasks. Employing a machine learning algorithm to estimate Q and g in (5) is a means to estimate and draw inference on the causal parameter; the ultimate goal is the relationship between the treatment and the outcome. This allows people to use blackbox ML models with no explanation how these models have learned from the explanatory features. The question is if the consistency and asymptotic normality of the second step causal estimator are preserved if complex ML algorithms are utilized twice for the treatment and outcome models, each with a convergence rate smaller than √ n, and entropy that grows with n. Chernozhukov et al. [24] provide numerical experiments illustrating that some estimators are not consistent or asymptotically normal if complex ML models are used that do not belong to the Donsker class and have entropy that grows with n. They further provide a solution by introducing "orthogonal" estimators that, under some regulatory conditions and cross-fitting, are asymptotically normal even if complex ML models can be used as long as their rates of convergence are as small as n 1 4 . The next two subsections provide an overview of the general theory and prove that nAIPW is asymptotically normal.

The Efficient Influence Function
Hahn [18] derives the efficient influence function (EIF) of β = β 1 − β 0 as To study the asymptotic behaviour of nAIPW, we write the scaled difference (12) where the first term is a normal distribution by the central limit theorem, and the third and fourth terms are controlled if the class of functions are Donsker and standard smoothing conditions are satisfied ( [9,23], Theorem 19.26). If the nuisance parameters are not Donsker, data splitting and cross-fitting guarantees plus the regulatory conditions are needed to control these two terms [1,9]. It is unclear, however, how the second term behaves, i.e., (13) whereβ = β(P n ), as it contains data-adaptive nuisance parameter estimations. There are different tricks to get rid of this term. One method is the one-step method in which we move this term to the left to create a new estimator which is exactly the same as the AIPW estimator with known propensity scores: Another trick is to let this term vanish which results in estimating equations whose solution is exactly the same as the one-step estimator. The targetted learning strategy is to manipulate the data generating process which results in a different estimator [8,19] (which we do not study here).
The requirement in the above estimator is that the propensity score is known, which is unrealistic. In reality, this quantity should be estimated using the data. However, replacing g with a data-adaptive estimator changes the remainder term in (12) that needs certain assumptions to achieve asymptotic properties such as consistency. We replace g and 1 − g in (14) byĥ 1 andĥ 0 , respectively, which provides a more general view of the above one-step estimator.

Doubly Robustness and Rate Doubly Robustness Properties of GDR
One of the appealing properties of AIPW is its doubly robust property which partially relaxes the restrictions of IPW and SR which require the consistency of the treatment and outcome models, respectively. This property is helpful when the first-step algorithms are √ n-consistent. The following theorem states that the nAIPW estimator (5) actually possesses the doubly robustness property.
The proof is left to the appendix. Theorem 1 is useful when we a priori knowledge about the propensity scores (such as in the experimental studies) or we estimate the propensity scores with √ n-rate converging algorithms. In practice, however, the correct specification is infeasible in the observational data, but √ n-rate algorithms such as parametric models, generalized additive models (GAMs) or the models that assume sparsity might be used [25].
This is restrictive and these model assumptions might not hold in practice which is why non-parametric ML algorithms such as NNs are used. As mentioned before, the NN we utilize here does not offer a √ n-consistent prediction model in the first step of the estimation [1]. This reduces the usefulness of the double robustness property of the GDR estimator when using complex ML algorithms. A more useful property when using complex ML algorithms is the rate double robustness (RDR) property [26]. RDR does not require either of the prediction models to be √ n-consistent; it suffices that they are consistent at any rate but together become √ n-consistent; that is, if the propensity score and outcome model are consistent at n r A and n r Y , respectively (r Y , r A < 0), we must have r A + r Y = 1 2 . To see that the DR has this property (as does DR [25]), note that the remainder (12) can be written as which, by the Hölder inequality, is upper bounded: Making the standard assumptions that that is, the GDR has the rate double robustness property. The assumptions in (17) are less restrictive than needing at least one of the prediction models to be √ n-consistent for the double robust property [19,25]. This means that the outcome and propensity score models can be at least as fast as o(n − 1 4 ) (which is an attainable generalization bound for many complex machine learning algorithms [9]), and the GDR estimator is still consistent. Farrell et al. [1] proves that two neural networks without regularization (except the one imposed by the stochastic gradient descent optimization) satisfy such bounds and can provide a convenient first-step prediction algorithm (when they utilize the AIPW estimator and the cross-fitting strategy proposed by Chernozhukov et al. [9]).
In order for a special case of GDR estimator to outperform the AIPW estimator, we must have , in addition to conditions in (17). Note that these two conditions are satisfied for nAIPW; replacing h 1 and h 0 withĝÊ Â g and (1 −ĝ)Ê 1−A 1−ĝ can help stabilize the bias and variance magnitude and help shrink the remainder (15) to zero. The scenario analysis performed in Section 4.4 provides an insight about the reduction in the sensitivity to the violation of the empirical positivity assumption.

Robustness of nAIPW against Extreme Propensity Scores
There are two scenarios in which the empirical positivity is violated, where the probability of receiving the treatment for the people who are treated is 1, that is, A k = 1 and P(A k = 1|W) = 1 (or vice versa for the untreated group A k = 0 and P(A k = 0|W) = 0), and where there are a handful of treated subjects whose probability of receiving the treatment is 0, that is, A k = 1 and P(A k = 1|W) = 0 (and vice versa for the untreated group, that is, A k = 0 and P(A k = 0|W) = 1). Although the identifiability assumptions guarantee that such scenarios do not occur, in practice, extremely small or large probabilities similar to the second scenario above, that is, where there exists a treated individual who has a near-zero probability of receiving the treatment, can impact the performance of the estimators that involve propensity score weighting. For example, replacing h 1 withĝ and h 0 with 1 −ĝ in practice can increase both the bias and variance of AIPW [8]. This can be seen by viewing the bias and variance of these weighting terms. As noted before, the AIPW and nAIPW add adjustments to the single robust estimator EQ 1 − Q 0 . The adjustments involve weightings A g or A gE A g to the residuals of Y and Q k , k = 0, 1. Under a correct specification of the propensity score g, these weights have the same expectations. The difference is in their variances: under the correct specification of the propensity score g. By letting g tend to zero in violation of the empirical positivity assumption, it can be seen that the nAIPW is less volatile than the AIPW estimator. That is, the weights in AIPW might have a larger variance than those in nAIPW.

Scenario Analysis
A scenario analysis is performed to see how nAIPW stabilizes the estimator: Assume that the empirical positivity is violated, that is, there is at least an observation k where A k = 1 whereĝ k is extremely close to zero, such asĝ k = 10 −s for s 0. AIPW will blow up in this case: where I a = {j : A j = a}, I a −k = {j : A j = a}, and subscripts a = 1 and a = 0 refer to the estimators of the first and the second components in ATE (1). However, nAIPW is robust against this empirical positivity violation: and Thus The factor 10 s in (20) can blow up the AIPW if 10 s n (and the outcome estimation is not close enough to the observer outcome), but this factor does not appear in the numerator of the nAIPW estimator. For such large factors, (23) can be simplified to Thus, the extreme probability does not make β 1,nAIPW blow up, but the adjustment to the β 1,SR that accounts for confounding effects. The second factor β 0,nAIPW is not impacted in this scenario. Considering a scenario that there is another treated individual with extremely small probability, such asĝ l = 10 −t , such that, without loss of generality, t > s 0, we will have: Depending on the values s and t, one of the first two terms in (25) might vanish, but the estimator does not blow up. There is at most only a handful of treated individuals with extremely small probabilities and, based on the above observation, the nAIPW estimator does not blow up. That said, nAIPW might not sufficiently correct the β SR for the confounding effects, although confounders have been taken into account in the calculation of β SR to some extent. The same observation can be made in the asymptotic variance of these estimators. This shows how extremely small probabilities for treated individuals (or extremely large probabilities for untreated individuals) can result in a biased and unstable estimator, while neither of the bias or variance of nAIPW suffer as much. Although not performed, the same observation can be made for the untreated individuals with extremely large probabilities.
The above scenario analysis indicates the bias and variance of nAIPW might go up in cases of the violation of empirical positivity, but it still is less biased and more stable than AIPW. The remainder term (15) is also more likely to be o(n − 1 2 ) in nAIPW versus AIPW as it contains k's where A k = 1, g k E n A k g k ≥ g k .

Asymptotic Sampling Distribution of nAIPW
Replacing g in the denominator of the von Mises expansion (12) with the normalizing terms is enough to achieve the asymptotic distribution of the nAIPW and its asymptotic standard error. However, we can see that nAIPW is also the solution to (extended) estimating equations. The solution to the estimating equations is important as van der Vaart (Chapters 19 and 25) proves that under certain regulatory conditions, if the prediction models belong to the Donsker class, the solutions to Z-estimators are consistent and asymptotically normal ( [23], Theorem 19.26). Thus, nAIPW that is the solution to a Z-estimator (also referred to an M-estimator) will inherit the consistency and asymptotic normality, assuming certain regulatory conditions and that the first-step prediction models belong to the Donsker class: The Donsker class assumption prevents too complex algorithms in the first step, algorithms such as tree-based models, NNs, cross-hybrid algorithms or their aggregations [19,27]. The Donsker class assumption can be relaxed if sample splitting (or cross-fitting) is utilized and the target parameter is orthogonal [9]. In the next section we see that nAIPW is orthogonal and, thus, theoretically, we can relax the Donsker class assumption under certain smoothing regulatory conditions. Before seeing the orthogonality property of nAIPW, let us review the smoothing regularity conditions necessary for asymptotic normality. Let β be the causal parameter, η ∈ T be the infinite dimensional nuisance parameters where T is a convex set with a norm. Additionally, let the score function φ : O × B × T → R be a measurable function, O be the measurable space of all random variables O with probability distribution P ∈ P n and B be an open subset of R containing the true causal parameter. Let the sample O = (O 1 , O 2 , ..., O n ) be observed and the set of probability measures P n expand with sample size n. In addition, let β ∈ B be the solution to the estimating equation Eφ(O, β, η) = 0. The assumptions that guarantee that the second-step orthogonal estimator β is asymptotically normal are [9]: (1) β does not fall on the boundary of B; (2) the map (β, η) → E P φ O, β, η is twice Gateauax differentiable (this holds by the positivity assumption). β is identifiable; (3) E P φ O, β, η is smooth enough; (4)η ∈ T with high probability and η ∈ T .η converges to η 0 at least as fast as n − 1 4 (similar but slightly stronger than first two assumptions in (17)); (5) score function(s) φ(., β, η) has finite second moment for all β ∈ B and all nuisance parameters η ∈ T ; (6) the score function(s) φ(., β, η) is measurable; (7) the number of folds increases by sample size.

Orthogonality and the Regulatory Conditions
The orthogonality condition [9] is a property related to the estimating equations We refer to an estimator drawn from the estimating Equation (27) as an orthogonal estimator. Let η ∈ T, where T is a convex set with a norm. Additionally, let the score functions φ : O × B × T → R be a measurable function, O is measurable space of all random variables O with probability distribution P ∈ P n and B is an open subset of R containing the true causal parameter. Let the sample O = (O 1 , O 2 , ..., O n ) be observed and the set of probability measures P n can expand with sample size n. The score function φ follows the Neyman orthogonality condition with respect to T ⊆ T, if the Gateauax derivative operator exists for all ∈ [0, 1): Chernozhukov et al. [24] presents a few examples of orthogonal estimating equations including the AIPW estimator (4). Utilizing cross-fitting, under standard regulatory conditions, the asymptotic normality of estimators with orthogonal estimating equations is guaranteed even if the nuisance parameters are estimated by ML algorithms not belonging to the Donsker class and without finite entropy conditions [24]. The regulatory conditions to be satisfied are (1) β does not fall on the boundary of B; (2) the map (β, η) → E P φ O, β, η is twice Gateauax differentiable. β is identifiable; (3) E P φ O, β, η is smooth enough; (4)η ∈ T with high probability and η ∈ T .η converges to η 0 at least as fast as n − 1 4 ; (5) score function(s) φ(., β, η) has finite second moment for all β ∈ B and all nuisance parameters η ∈ T ; (6) the score function(s) φ(., β, η) is measurable; (7) the number of folds increases by sample size.
By replacing λ and γ in the first line of (26) with their solutions in the second and third equations: Implementing the orthogonality condition (28), it can be verified that nAIPW (5) is also an example of an orthogonal estimator. To see this, we apply the definition of orthogonality [9]: where Q k = Q k + (1 − )Q k , k = 0, 1, and g = g + (1 − )g, and for some functions a and b. The last equality is because EA( under correct specification of the propensity score g.
Thus, nAIPW is orthogonal, and by utilizing cross-fitting for the estimation, nAIPW is consistent and asymptotically normal, under certain regulatory conditions.

Asymptotic Variance of nAIPW
To evaluate the asymptotic variance of nAIPW, we employ the M-estimation theory [23,28]. For causal inference for M-estimators, the bootstrap for the estimation of causal estimator variance is not generally valid even if the nuisance parameter estimators are √ n-convergent.
However, sub-sampling m out of n observations [29] can be shown to be universally valid, provided m → ∞ and m n → 0. In practice, however, we can face computational issues since nuisance parameters must be separately estimated (possibly with ML models) for each subsample/bootstrap sample.
The variance estimator of AIPW (4) is [7] The theorem below states that the variance estimator of AIPW (31) can intuitively extend to calculate the variance estimator of nAIPW (5) by moving the denominator n 2 to the square term in the summation and replacing it withĝÊ Â g or (1 −ĝ)Ê 1−A 1−ĝ in the terms containing g and 1 − g in the denominator, respectively. Theorem 2. The asymptotic variance of the nAIPW (5) iŝ The proof utilizing the estimating equation technique is straightforward and is left to Appendix A. The same result can be seen when deriving the estimator in the one-step method (see (12) and (14)). Since nAIPW is orthogonal,σ 2 nAIPW is consistent by applying the theories of [1,9], if the assumptions are met, cross-fitting is used, and the step 1 ML algorithms have the required convergence rates.
The above theorem states that the variance estimator of AIPW (31) can intuitively extend to calculate the variance estimator of nAIPW (5) by moving the denominator n 2 to the square term in the summation and replacing it withĝÊ Â g or (1 −ĝ)Ê 1−A 1−ĝ in the terms containing g and 1 − g in the denominator, respectively. This is intuitive because, by the law of total probability, E the first two terms is n.

Monte Carlo Experiments
A Monte Carlo simulation study (with 100 iterations) was performed to compare AIPW and nAIPW estimators, where the dNN is used for the first-step prediction. There are a total of 2 case scenarios according to the size of the data. We fixed the sample sizes to be n = 750 and n = 7500, with the number of covariates being p = 32 and p = 300, respectively. The predictors include four types of covariates: The confounders, X c , instrumental variables, X iv , the outcome predictors, X y , and the noise or irrelevant covariates, X irr . Their sizes for the scenarios are #X c = #X iv = #X y = #X irr = 8, 75 and they are independent from each other and drawn from the multivariate normal (MVN) distribution as X ∼ N (0, Σ), with Σ kj = ρ j−k and ρ = 0.5. The models to generate the treatment assignment and outcome were specified as and β = 1. The functions f a , g a , f y , g y select 20% of the columns and apply interactions and non-linear functions listed below (35). The strength of the instrumental variable and confounding effects were chosen as γ c , γ c , γ y ∼ Uni f (r 1 , r 2 ) where (r 1 = r 2 = 0.25), and γ iv ∼ Uni f (r 3 , r 4 ) where (r 3 = r 4 = 0.25).
The non-linearities are randomly selected from among the following functions: The networks' activation function is rectified linear unit (ReLU), with 3 hidden layers as large as the input size (p), with L 1 regularization and batch size equal to 3 * p and 200 epochs. The adaptive moment estimation (Adam) optimizer [30] with learning rate 0.01 and momentum 0.95 was used to estimate the network's parameters, including the causal parameter (ATE).

Simulation Results
The oracle estimations are plotted in all the graphs to compare the real-life situations with the truth. In almost all the scenarios we cannot obtain perfect causal effect estimation and inference. Figure 1 shows the distribution of AIPW and nAIPW for different hyperparameter settings of NNs. The nAIPW estimator outperforms AIPW in almost all the scenarios. As the AIPW gives huge values in some simulation iterations, the log of the estimation is taken in Figure 1.
where β = 1, withβ j 's are the AIPW or nAIPW estimations in the j th simulation 431 Figure 1. The distribution of log of the estimated AIPW and nAIPW in the 100 simulated iterations. The performance of nAIPW is clearly superior to the performance of AIPW as it is less dispersed and is more stable in terms of different hyperparameter settings. p is either 32 or 300 for the small or large datasets and q ≈ p We also compare the estimators in different scenarios with bias, variance and their tradeoff measures: where β = 1, withβ j s being the AIPW or nAIPW estimations in the jth simulation round, µ = 1 m ∑ m j=1β j and m = 100 being the number of simulation rounds andσ being the square root of (31) or (32). Figure 2 demonstrates the bias, MC standard deviation (MC std) and the root mean square error (RMSE) of AIPW and nAIPW estimators for the scenarios where n = 750 and n = 7500, and for four hyperparameter sets (L 1 regularization and width of the dNN). In general, in each figure of the panel, the hyperparameter scenarios in the left imply a more complex model (with less regularization or a narrower network). In these graphs, the lower the values, the better the estimator. For the smaller data size n = 750 in the left three panels, the worst results are attributed to AIPW when there is the least regularization and the hidden layers are as wide as the number of inputs. To have more clear plots for comparison, we skipped plotting the upper bounds as they were large numbers; the lower bounds are enough to show the significance of the results. In the scenarios where there are smaller numbers of hidden neurons with 0.01 L 1 regularization, the bias, variance and their tradeoff (here measured by RMSE) are more stable. By increasing the L 1 regularization, these measures go down which indicates the usefulness of regularization and AIPW normalization for causal estimation and inference. Almost the same pattern is seen for the larger size (n = 7500) scenario, except for the bump in all the three measures in the hyperparameter scenario where regularization remains the same (L 1 = 0.01) and the numbers of neurons in the first and last hidden layers are small too. In all three measures of bias, standard deviation and RMSE, nAIPW is superior to AIPW, or at least there is no statistically significant difference between AIPW and nAIPW.
We have noted that the results of the step 1 NN architecture without L 1 regularization are too unstable and cannot be visually presented in the graphs. To avoid that, we have allowed a span of values for the L 1 regularization strengths: L 1 = 0.01 and L 1 = 0.1. The former case is close to no regularization. So, if the results of the latter are better than the former's, this is evidence that enough L 1 must be imposed. Figure 3 illustrates how the theoretical standard error formulas perform in MC experiments, and how accurately they estimate the MC standard deviations. In these two graphs, smaller does not necessarily imply superiority. In fact, the best results will be achieved as long as the confidence intervals of asymptotic SEs and MC SDs intersect. In the left two scenarios where the NN's complexity is high, the MC std and SE are far from each other. Additionally, in the hyperparameter scenarios where both the width of the NNs is small and regularization is higher, the MC std and SE are well separated. The scenario with largest regularization and wide NN architecture seems to the best scenario. That said, none of the scenarios confirm the consistency of SEs, which would likely also result in low coverage probability of the resulting confidence intervals.

Application: Food Insecurity and BMI
The Canadian Community Health Survey (CCHS) is a cross-sectional survey that collects data related to health status, health care utilization and health determinants for the Canadian population in multiple cycles. The 2021 CCHS covers the population 12 years of age and over living in the ten provinces and the three territorial capitals. Excluded from the survey's coverage are: Persons living on reserves and other Aboriginal settlements in the provinces and some other sub-populations that altogether represent less than 3% of the Canadian population aged 12 and over. Examples of modules asked in most cycles are: General health, chronic conditions, smoking and alcohol use. For the 2021 cycle, thematic content on food security, home care, sedentary behavior and depression, among many others, was included. In addition to the health component of the survey are questions about respondent characteristics such as labor market activities, income and socio-demographics.
In this article, we use the CCHS dataset to investigate the causal relationship of food insecurity and body mass index (BMI). Other gathered information in the CCHS is used which might contain potential confounders, y-predictors and instrumental variables. The data are from a survey and need special methods such as the resampling or bootstrap methods to estimate the standard errors. However, here, we use the data to illustrate the utilization of a dNN on the causal parameters in the case of empirical positivity violation. In order to reduce the amount of variability in the data, we have focused on the sub-population 18-65 years of age. Figure 4 shows the ATE estimates and their 95% asymptotic confidence intervals with nIPW, DR and nDR methods, with four different neural networks which vary in terms of width and strength of L 1 regularization. The scenario that results in the largest R 2 (as a measure of outcome prediction performance) outperforms the other scenarios. The scenario that results in the largest AUC (as a measure of treatment model performance) results in the largest confidence intervals. This is because of more extreme propensity scores in this scenario. It is worth noting that the normalized IPW has smaller confidence intervals as compared to AIPW. However, as we do not know the truth about the ATE in this dataset, we can never know which estimator outperforms the other. To gain insight about this using the input matrix of this data, we simulated multiple treatments and outcomes with small to strong confounders and IVs and compared AIPW and nAIPW. In virtually all of them, the nAIPW is the best one. We do not present these results in this paper, but they can be provided to readers upon request.

Discussion
Utilizing machine learning algorithms such as NNs in the first-step estimation process is comforting as the concerns with regard to the non-linear relationships between the confounders and the treatment and outcome are addressed. However, there is no free lunch, and using NNs has its own caveats including theoretical as well as numerical challenges. Farrell et al. [1] addressed the theoretical concerns where they calculated the generalization bounds when two separate NNs are used to model the treatment and the outcome. However, they did not use or take into account regularization techniques such as L 1 or L 2 regularization. As NNs are complex algorithms, they provide perfect prediction for the treatment when the predictors are strong enough (or might overfit). Through Monte Carlo (MC) simulations, we illustrated that causal estimation and inference with double NNs can fail without the usage of regularization techniques such as L 1 and/or extreme propensity scores are not taken care of. If L 1 regularization is not used, the normalization of the AIPW estimator (i.e., nAIPW) is advised to be employed as it dilutes the extreme predictions of the propensity score model and provides better bias, variance and RMSE. Our scenario analysis also showed that in the case of violation of the empirical positivity assumption in AIPW, normalization helps avoid blowing up the estimator (and standard error), but might be ineffective in taking into account confounding effects for some observations.
We note that the nAIPW estimator cannot perform better when the empirical positivity is violated as compared to when it is not. However, when the empirical positivity is violated, nAIPW can perform better than AIPW. If the empirical positivity is not violated, our results indicated that AIPW outperforms nAIPW.
An alternative estimator might be trimming the propensity scores to avoid extreme values. However, the causal effect estimator will no longer be consistent and there is no determined method for where to trim. We hypothesize thatĥ 1 n will result in a consistent estimator, making the right assumptions, and will outperform both AIPW and nAIPW in the case of the empirical positivity violation. We will study this hypothesis in a future article.
Another reason why NNs without regularization fail in the causal estimation and inference is that the networks are not targeted, and are not directly designed for these tasks. NNs are complex algorithms with strong predictive powers. This does not accurately serve the purpose of causal parameter estimation, where the empirical positivity assumption can be violated if strong confounders and/or instrumental variables [22] exist in the data. Ideally, the network should target the confounders and should be able to automatically limit the strength of predictors so that the propensity scores are not extremely close to 1 or 0. This was not investigated in this article and a solution to this problem is postponed to another study.
In Section 7, we applied the asymptotic standard errors of both AIPW and nAIPW, where the latter achieves smaller standard errors. That said, we acknowledge the fact that the asymptotic standard errors when using complex ML are not reliable and, in fact, they underestimate the calculated MC standard deviations, as illustrated in the simulations Section 6. This is partly because of the usage of complex algorithms such as NNs for estimation of the nuisance parameters in the first step. Further, the asymptotic distributions of the estimators are not symmetric (and thus are not normal). However, nAIPW is more symmetric than AIPW, according to the simulations, while both estimators suffer from outliers. We will investigate the reasons and possible remedies for both the asymptotic distribution and standard errors of the estimators in a future paper. The consistency of the variance of nAIPW (and AIPW) relies on meeting the assumptions. More investigations are needed on how to achieve consistent and asymptotically normal estimators for ATE with a consistent variance estimator. Potential avenues can include proposing alternative estimators or improving the step 1 ML algorithms. Data Availability Statement: The simulated data can be regenerated using the codes, which can be provided to the interested user via an email request to the correspondence author. The CCHS data is not publicly available and only the authorized people can access and perform analyses on it.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
First, let us review the proof sketch of the AIPW double robustness: (3) can be consistently estimated bŷ The second formula guarantees the consistency of AIPW ifĝ is consistent, and the third expression shows that the consistency ofQ 0 i andQ 1 i is consistent.
Proof. From (A2),β nAIPW is a consistent estimator of β ifQ 0 i andQ 1 i are consistent. This is because the first termÊ(Q 1 −Q 0 ) converges to β, while the second term tends to zero.
By re-expressing (A2), the other argument is clear. Lettingŵ 1 =Ê[ Â g ] andŵ 0 =Ê[ 1−A 1−ĝ ], we have: The first expression in (A3) is the same as the nIPW estimator which is a consistent estimator of β [7]. Now, under the consistency ofĝ, the second term tends to zero, aŝ In the theorem below, it is shown that there is an M-estimation equivalent to β nAIPW and w 1 and w 0 . This, plus the continuous mapping theorem, proves that ∑ n i=1 A î g i converges in probability to n ifĝ p − → g.
Theorem A2. The asymptotic variance of the nAIPW (5) iŝ Proof. Let us define a few notations first: With this set of notations, the nAIPW estimator (5) can be written aŝ Following the methods in [28], to find an estimating equation whose solution iŝ β nAIPW , we introduce two more estimating equations. Employing the M-estimation theory, we will prove that nAIPW is asymptotically normal, and we will calculate its standard error.
It can be seen that (A6) is not a solution to an M-estimator directly. However, by defining two more parameters and concatenating their estimating equations, we obtain 3-dim multivariate estimating equations: To ease the calculations, we modify the first estimating equation with an equivalent one, but the results will not differ: By defining the following notations, The M-estimation theory implies that under regulatory conditions, the solutions to these estimating equations converge in distribution to a multivariate normal distribution: whose inverse is In order to estimate the variance ofβ nAIPW , we do not need to calculate all entries of the variance-covariance matrix, only the first entry: The entries are irrelevant to the calculation of variance of nAIPW and the term is a very long expression which involves terms converging to zero faster than the actual estimating Equation (A9) [19] (also verified by simulations): = −Eφη(nEuh + λ(β − q)) + EφΩ(nEv f − γ(β − q))− (nEuh + λ(β − q))(−Eη 2 (nEuh + λ(β − q)) + EηΩ(nEv f − γ(β − q))+ Eφη) + (nEv f − γ(β − q))(−EηΩ(nEuh + λ(β − q))+ EΩ 2 (nEv f − γ(β − q)) + EφΩ). where we replace E with sample averages in Expressions (A10)-(A12) and θ with their corresponding solutions to Equation (A8). Following this recipe, we obtain which is the same as (A4).