Direct and Indirect Effects under Sample Selection and Outcome Attrition

: This paper extends the evaluation of direct and indirect treatment effects, i.e., mediation analysis, to the case that outcomes are only partially observed due to sample selection or outcome attrition. We assume sequential conditional independence of the treatment and the mediator, i.e., the variable through which the indirect effect operates. We also impose missing at random or instrumental variable assumptions on the outcome attrition process. Under these conditions, we derive identiﬁcation results for the effects of interest that are based on inverse probability weighting by speciﬁc treatment, mediator, and/or selection propensity scores. We also provide a simulation study and an empirical application to the U.S. Project STAR data in which we assess the direct impact and indirect effect (via absenteeism) of smaller kindergarten classes on math test scores. The estimators considered are available in the ‘causalweight’ package for the statistical software ‘R’.


Introduction
Mediation analysis, i.e., the evaluation of direct and indirect causal effects, is widespread in social sciences, following the seminal papers (Baron and Kenny 1986;Judd and Kenny 1981;Robins and Greenland 1992). The aim is to disentangle the total causal effect of a treatment on an outcome of interest into an indirect component operating through one or several intermediate variables, i.e., mediators, as well as a direct component. As example, consider the effect of educational interventions on health, where part of the effect might be mediated by health behaviors, see (Brunello et al. 2016), or personality traits, see (Conti et al. 2016). While earlier studies on mediation typically rely on tight linear models, the more recent literature considers more flexible and possibly nonlinear specifications. A large number of contributions assumes sequential conditional independence, implying that the assignment of the treatment and the mediator is conditionally exogenous given observed covariates and given the treatment and the covariates, respectively. For examples, see (Pearl 2001;Petersen et al. 2006;Robins 2003;Albert and Nelson 2011;Flores and Flores-Lagunes 2009;Hong 2010;Huber 2014a;Imai et al. 2010;Tchetgen and Shpitser 2012;VanderWeele 2009;Vansteelandt et al. 2012; Zheng and van der Laan 2012), among many others.
In this paper, we extend mediation analysis to account for the complication of outcome nonresponse and sample selection, implying that outcomes are only observed for a subset of the initial population of interest. Such problems frequently occur in empirical applications like wage gap decompositions, where wages are only observed for those who work. In a range of studies evaluating total (rather than direct and indirect) effects, sample selection is assumed to be missing at random (MAR), i.e., conditionally exogenous given observed variables, see for instance (Abowd et al. 2001;Little and Rubin 1987;Rubin 1976;Robins et al. 1994Robins et al. , 1995Carroll et al. 1995;Fitzgerald et al. 1998;Shah et al. 1997;Wooldridge 2002Wooldridge , 2007. In contrast, nonignorable nonresponse models permit sample selection to be related to unobservables. Unless strong parametric assumptions are imposed (see for instance (Heckman 1976(Heckman , 1979Hausman and Wise 1979;Little 1995)), identification requires an instrumental variable (IV) for sample selection (e.g., Das et al. 2003;Newey 2007;Huber 2012Huber , 2014b. In this paper, we combine the evaluation of average natural direct and indirect effects based on sequential conditional independence with specific MAR or IV assumptions about sample selection. We identify the parameters of interest in the total as well as the selected population (whose outcomes are actually observed) by inverse probability weighting 1 (IPW) based on propensity scores for treatment and selection. Under MAR, effects in the total population are obtained through reweighing by the inverse of the selection propensity given observed characteristics. If selection is related to unobservables, we make use of a control function that can be regarded as a nonparametric version of the inverse Mill's ratio in Heckman-type selection models. Under specific conditions, reweighing observations by the inverse of the selection propensity given observed characteristics and the control function identifies the effects in the selected or the total population. To convey the intuition of our identification results, we provide a brief simulation study in which the propensity scores are estimated by probit models.
As an empirical illustration, we evaluate the average natural direct and indirect effects of Project STAR, an educational experiment in Tennessee, which randomly assigned children to small classes in kindergarten and primary school. The positive impact of STAR classes on academic achievement has been demonstrated for example in (Krueger 1999), but less is known about the underlying causal mechanisms. We consider absenteeism in kindergarten as potential mediator of the effect. The outcome of interest is the score in a standardized math test in the first grade of primary school, which is unobserved for a non-negligible share of children due to attrition. We apply one of our proposed IPW-based estimators to account for outcome attrition and compare the results to several alternative mediation estimators that make no corrections for sample selection. The results suggest that absenteeism is not an important driver of the total effect. 2 The remainder of this paper is organized as follows. Section 2 discusses the parameters of interest, the assumptions, and the nonparametric identification results based on inverse probability weighting. Section 3 outlines estimation based on the sample analogs of the identification results. Section 4 presents a simulation study. Section 5 provides an application to Project STAR data. Section 6 concludes the paper.

Parameters of Interest
We would like to disentangle the average treatment effect (ATE) of a binary treatment variable D on an outcome variable Y into a direct effect and an indirect effect operating through the mediator M, which has bounded support and may be a scalar or a vector and discrete and/or continuous. To define the effects of interest, we use the potential outcome framework, see (Rubin 1974), which has been applied in the context of mediation analysis by (Rubin 2004;Ten Have et al. 2007;Albert 2008), among others. M(d)andY(d, M(d )) denote the potential mediator state as a function of the treatment and potential outcome as a function of the treatment and the potential mediator, respectively, under treatments d, d ∈ {0, 1}. Only one potential outcome and mediator state, 1 The idea of using inverse probability weighting to control for selection problems goes back to (Horvitz and Thompson 1952). 2 The estimators considered in the simulation study and the empirical application are available in the 'causalweight' package by (Bodory and Huber 2018) for the statistical software 'R'. respectively, are observed for each unit, because the realized mediator and outcome values are The ATE is given by To disentangle the latter, note that the (average) natural direct effect (using the denomination of (Pearl 2001)) 3 is identified by exogenously varying the treatment but keeping the mediator fixed at its potential value for D = d: Equivalently, by exogenously shifting the mediator to its potential values under treatment and non-treatment but keeping the treatment fixed at D = d, the (average) natural indirect effect 4 is obtained: The ATE is the sum of the direct and indirect effects defined upon opposite treatment states: This follows from adding and subtracting E[Y(0, M(1))] or E[Y(1, M(0))], respectively. The notation θ(1), θ(0) and δ(1), δ(0) points to possible effect heterogeneity w.r.t. the potential treatment state, implying the presence of interaction effects between the treatment and the mediator. However, the effects cannot be identified without further assumptions, as either Y(1, M(1)) or Y(0, M(0)) is observed for any unit, whereas Y(1, M(0)) and Y(0, M(1)) are never observed.
In contrast to natural effects, which are functions of the potential mediators, the so-called controlled direct effect is obtained by setting the mediator to a predetermined value m, rather than M(d): Whether θ(d) or γ(m) is of primary interest depends on the research question at hand. The controlled direct effect may provide policy guidance whenever mediators can be externally prescribed, as for instance in a sequence of active labor market programs assigned by a caseworker, where D and M denote assignment of the first and second program, respectively. This allows analyzing the direct effect of the first program under alternative combinations of program prescriptions. In contrast, the natural direct effect assesses the effectiveness of the first program given the status quo decision to participate in the second program in the light of participation or non-participation in the first program. We refer to (Pearl 2001) for further discussion of what he calls the descriptive and prescriptive natures of natural and controlled effects.
Our identification results will make use of a vector of observed covariates, denoted by X, that may confound the causal relations between D and M, D and Y, and M and Y. A further complication in our evaluation framework is that Y is assumed to be observed for a subpopulation, i.e., conditional on 3 Robins (2003); Robins and Greenland (1992) refer to this parameter as the total or pure direct effect and (Flores and Flores-Lagunes 2009) as net average treatment effect. 4 Robins (2003); Robins and Greenland (1992) refer to this parameter as the total or pure indirect effect and (Flores and Flores-Lagunes 2009) as mechanism average treatment effect. S = 1, where S is a binary variable indicating whether Y is observed/selected, or not. We therefore also define the direct and indirect effects among the selected population: Empirical examples with partially observed outcomes include wage regressions, with S being an employment indicator, see for instance (Gronau 1974), or the evaluation of the effects of policy interventions in education on test scores, with S being participation in the test, see (Angrist et al. 2006). Throughout our discussion, S is allowed to be a function of D, M, and X, i.e., S = S(D, M, X). However, S must neither be affected by nor affect Y. 5 S is therefore not a mediator, as selection per se does not causally influence the outcome. An example for such a set up in terms of nonparametric structural models is given by where U, V are unobserved characteristics and φ, ψ are general functions. 6

Assumptions and Identification Results under MAR
This section presents identifying assumptions that formalize the sequential conditional independence of D and M as imposed by (Imai et al. 2010) and many others as well as an MAR restriction on Y that implies that S is related to observables. 7 Assumption 1 (conditional independence of the treatment). By Assumption 1, there are no unobservables jointly affecting the treatment, on the one hand, and the mediator and/or the outcome, on the other hand, conditional on X. In observational studies, the plausibility of this assumption crucially hinges on the richness of the data, while in experiments, it is satisfied if the treatment is randomized within strata defined by X or randomized independently of X. 8 Assumption 2 (conditional independence of the mediator). Y(d, m)⊥M|D = d , X = x for all d, d ∈ {0, 1} and m, x in the support of M, X.
By Assumption 2, there are no unobservables jointly affecting the mediator and the outcome conditional on D and X. Assumption 2 only appears realistic if detailed information on possible confounders of the mediator-outcome relation is available in the data (even in experiments with random treatment assignment) and if post-treatment confounders of M and Y can be plausibly ruled out when controlling for D and X. 9  See for instance (Imai 2009) for an alternative set of restrictions, assuming that selection is related to the outcome but is independent of the treatment conditional on the outcome and other observable variables. 6 Note that Y(d, M(d )) = φ(d, M(d ), X, U), which means that fixing the treatment and the potential mediator yields the potential outcome. 7 We implicitly also impose the Stable Unit Treatment Value Assumption (SUTVA, see (Rubin 1990)), stating that the potential mediators and outcomes for any individual are stable in the sense that their values do not depend on the treatment allocations in the rest of the population. By Assumption 3, there are no unobservables jointly affecting selection and the outcome conditional on D, M, X, such that outcomes are missing at random (MAR) in the denomination of (Rubin 1976). Put differently, selection is assumed to be selective w.r.t. observed characteristics only.
Assumption 4 (common support). Assumption 4(a) is a common support restriction requiring that the conditional probability to receive a specific treatment given M, X, henceforth referred to as propensity score, is larger than zero in either treatment state. It follows that Pr(D = d|X = x) > 0 must hold, too. By Bayes' theorem, Assumption 4(a) implies that Pr(M = m|D = d, X = x) > 0, or in the case of M being continuous, that the conditional density of M given D, X is larger than zero. Conditional on X, M must not be deterministic in D, as otherwise identification fails due to the lack of comparable units in terms of the mediator across treatment states. Assumption 4(b) requires that for any combination of D, M, X, the probability to be observed is larger than zero. Otherwise, the outcome is not observed for some specific combinations of these variables implying yet another common support issue. Figure 1 illustrates the causal framework underlying our assumptions by means of a causal graph, see for instance (Pearl 1995), in which each arrow represents a potential causal effect. Further (unobserved) variables that only affect one of the variables explicitly displayed in the system are kept implicit. For instance, there may be unobservable variables U that affect the outcome, but do not influence D, M, or S; otherwise, there would be confounding. Under Assumptions 1 to 4, potential outcomes as well as direct and indirect effects in the total population are identified based on weighting by the inverse of the treatment and selection propensity scores. Theorem 1.
Using the results of Theorem 1, it can be easily shown that the direct and indirect effects are identified by .
These expressions are related to the IPW-based identification in (Huber 2014a) for the case with no missing outcomes with the difference that here, multiplication by S/ Pr(S = 1|D, M, X) is included to account for sample selection. Furthermore, our results fit into the general framework of (Wooldridge 2002), who considers the IPW-based M-estimation of missing data models. Finally, for the identification of γ(m), Assumption 1 can be relaxed to Assumption 1(a) because (in contrast to θ(d), δ(d)) the distribution of the potential mediator M(d) need not be identified.

Assumptions and Identification Results under Selection Related to Unobservables
In the following discussion, we consider the case that selection is related to both observables and unobservables that are associated with the outcome. Assumptions 3 and 4 are therefore replaced. Rather, we assume that an instrumental variable for S is available to tackle sample selection. Assumption 5 no longer imposes the independence of Y and S given observed characteristics. As the unobservable V in the selection equation is allowed to be associated with unobservables affecting the outcome, Assumptions 1 and 2 generally do not hold conditional on S = 1 due to the endogeneity of the post-treatment variable S. In fact, S = 1 implies that Π(D, M, X, Z) > V such that conditional on X, the distribution of V generally differs across values of D, M. This entails a violation of the sequential conditional independence assumptions on D, M given S = 1 if potential outcome distributions differ across values of V. We, therefore, require an instrumental variable denoted by Z, which is allowed to be affected by D and M, but must not affect Y or be associated with unobservables affecting M or Y, as invoked in (5a). 10 We apply a control function approach based on this instrument, 11 which requires further assumptions.

Assumption 5 (instrument for selection). (a) There exists an instrument Z that may be a function of
By the threshold crossing model postulated in We will henceforth use the notation p(W) = Pr(S = 1|D, M, X, Z) with W = D, M, X, Z for the sake of brevity. Again by Assumption 5(b), the selection probability p(W) increases strictly monotonically in Π conditional on X, such that there is a one-to-one correspondence between the distribution function F V and specific values v given X. By Assumption 5(c), V is independent of (D, M, Z) given X, implying that the distribution function of V given X is (nonparametrically) identified. Figure 2 illustrates the causal framework underlying Assumptions 1, 2, and 5 by means of a causal graph. By comparing individuals with the same p(W), we control for F V and thus for the confounding associations of V with (i) D and {Y(d, m), M(d )} and (ii) M and Y(d, m) that occur conditional on S = 1. In other words, p(W) serves as control function where the exogenous variation comes from Z. Controlling for the distribution of V based on the instrument is thus a feasible alternative to the (infeasible) approach of directly controlling for levels of V. More concisely, it follows from our assumptions for any bounded function g that The first equality follows from p(W) = F V under Assumption 5, the second from the fact that when controlling for F V , conditioning on S = 1 does not result in an association between Y(d, m) and M given D, X such that Y(d, m)⊥M|D, X, p(W), S = 1 holds by Assumptions 2 and 5. This is due to the fact that conditional on p(W) (or F V ), there are no unobservables that are jointly related with S and Y. Therefore, conditioning on S = 1 when also controlling for p(W) does not introduce a statistical association between M and unobservables affecting Y (a phenomenon known as collider or sample selection bias). The third equality follows from the fact that when controlling for F V , conditioning on S = 1 does not result in an association between Y(d, m) and D given X such that Y(d, m)⊥D|X, p(W), S = 1 holds by Assumptions 1 and 5. 12 Similarly, follows from the fact that when controlling for F V , conditioning on S = 1 does not result in an association between M(d) and D given X such that M(d)⊥D|X, p(W), S = 1 holds by Assumptions 1 and 5. These results will be useful in the proofs of Theorems 2 and 3, see Appendix A.2.
Furthermore, identification requires the following common support assumption, which is similar to Assumption 4(a), but in contrast to the latter also includes p(W) as a conditioning variable.
By Bayes' theorem, Assumption 6 implies that the conditional density of p(W) = p(w) given D, M, X, S = 1 is larger than zero. This means that in fully nonparametric contexts, the instrument Z must in general be continuous and strong enough to importantly shift the selection probability p(W) conditional on D, M, X in the selected population. Assumptions 1, 2, 5, and 6 are sufficient for the identification of mean potential outcomes as well as direct and indirect effects in the selected population.

Theorem 2.
(i) Under Assumptions 1, 2, 5, and 6 for d ∈ {0, 1}, (ii) Under Assumptions 1(a), 2, 5, and 6, and M following a discrete distribution, Therefore, the direct and indirect effects are identified by In nonparametric models that allow for general forms of effect heterogeneity related to unobservables, direct and indirect effects can generally only be identified among the selected population. The reason is that effects among selected observations cannot be extrapolated to the non-selected population if the effects of D and M interact with unobservables affecting the outcome, henceforth denoted by U, as the latter are in general distributed differently across S = 1, 0 even conditional on observed variables. To see this, note that conditional on p(W) = Pr(V ≤ Π(D, M, X, Z)), the distribution of V differs across the selected (satisfying V ≤ Π(D, M, X, Z)) and the non-selected (satisfying V > Π(D, M, X, Z)), such that the distribution of U differs, too, if V and U are associated. While control function p(W) is required for the unconfoundedness of the treatment and the mediator in the selected subpopulation, it does not permit extrapolating effects to the population with unobserved outcomes, see also (Huber and Melly 2015) for further discussion.
The identification of effects in the total population therefore requires additional assumptions. In Assumption 7 below, we impose homogeneity in the direct and indirect effects across selected and non-selected populations conditional on X, V. A sufficient condition for effect homogeneity is the separability of observed and unobserved components in the outcome variable, i.e., Y = η(D, M, X) + ν(U), where η, ν are general functions. Furthermore, common support as postulated in Assumption 6 needs to be strengthened to hold in the entire population. In addition, the selection probability p(w) must be larger than zero for any w in the support of W; otherwise, outcomes are not observed for some values of D, M, X. Assumption 8 formalizes these common support restrictions. While the mean potential outcomes in the total population remain unknown even under Assumptions 7 and 8, the effects of interest are nevertheless identified by the separability of U.
We conclude our discussion on identification by informally sketching an instrumental variable approach when the treatment D is not conditionally independent as postulated in Assumption 1. Consider for instance an experiment in which the access to the treatment is randomized, but actual treatment participation may endogenously deviate from the granted access based on unobserved characteristics. If Assumption 1 holds for the access variable, it may serve as instrument for treatment participation under the additional assumptions that it shifts the treatment weakly monotonically and has no direct effect on the outcome other than through the treatment. Imbens and Angrist (1994); Angrist et al. (1996) show that in the absence of sample selection, these assumptions permit identifying a local ATE (LATE) in the subpopulation of compliers, i.e., among those whose treatment status reacts to the instrument. This requires scaling the so-called intention-to-treat or reduced form effect of the instrument on the outcome by the first stage effect of the instrument on the treatment.
Adding further complications to the identification problem like sample selection and/or mediation requires appropriately modifying the expression of the intention-to-treat effect before scaling it by the first stage. See for instance (Frölich and Huber 2014), who evaluate the LATE when assuming that sample selection is not associated with unobserved characteristics conditional on observables alone or conditional on observables and the compliance type (i.e., under latent ignorability, see (Frangakis and Rubin 1999)). Alternatively, Fricke et al. (2020) discuss identification when sample selection is associated with unobservables based on distinct instruments for the treatment and selection. In the absence of sample selection, Frölich and Huber (2017) consider disentangling the LATE into direct and indirect effects when an instrument for the mediator is available (in addition to that for the treatment). A combination of such approaches permits jointly tackling sample selection and mediator endogeneity in instrumental variable frameworks and is left for future research.

Extensions to Further Populations, Parameters, and Variable Distributions
This section briefly discusses how the identification results can be extended to further populations of interest, policy-relevant parameters, and richer distributions of the treatment and/or the mediator. First and in analogy to the concept of weighted treatment effects in (Hirano et al. 2003), direct and indirect effects can be identified for particular target populations defined upon covariates X by reweighing observations according to the distribution of X in the target population. To this end, we define ω(X) to be a well-behaved weighting function depending on X. Including ω(X) E[ω(X)] in the expectation operators presented in the theorems above yields the parameters of interest for the target population. As an important example, consider ω(X) = Pr(D = 1|X). For some well-behaved function f (Y, D, M, S, X, Z) of the observed data, i.e., the expected value of that function among the treated is identified. Likewise, defining ω(X) = 1 − Pr(D = 1|X) gives the expected value among the non-treated. Any of the expressions in the expectation operators of the theorems may serve as f (Y, D, M, S, X, Z) in (12). 13 Second, the identification results may be extended to well-behaved functions of Y, rather than Y itself. For instance, replacing Y by I{Y ≤ a}, the indicator function that Y is not larger than some value a, everywhere in the theorems permits identifying distributional features or effects. The inversion of potential outcome distribution functions allows identifying quantile treatment effects.
Third, our framework can be adapted to allow for multiple or multivalued (rather than binary) treatments. If D is multivalued discrete, the derived expressions may be applied under minor adjustments. For instance, for any d = d in the discrete support of D, the expression for potential outcomes in Theorem 1 becomes under appropriate common support conditions. If D is continuous, any indicator functions for treatment values, which are only appropriate in the presence of mass points, need to be replaced by kernel functions, while treatment propensity scores need to be substituted by conditional density functions. In analogy to (Hsu et al. 2018), who consider mediation analysis with continuous treatments in the absence of sample selection, the expression for potential outcomes in Theorem 1 becomes The weighting function ω(D; d) = K ((D − d)/h) /h, with K being a symmetric second order kernel function assigning more weight to observations closer to d and h being a bandwidth operator. For h going to zero, i.e., lim h→0 , E[ω(D; d , h)|X] and E[ω(D; d , h)|M, X] correspond to the conditional densities of D given X and given M, X, respectively, also known as generalized propensity scores. We refer to (Hsu et al. 2018) for more discussion on direct and indirect effects of continuous treatments and how estimation may proceed based on generalized propensity scores. We also note that in the context of controlled direct effects, such kernel methods not only allow for a continuous treatment, but (contrarily to our theorems) also for a continuous mediator. 13 For instance, the weighted versions of the parameters identified in Theorem 1 correspond to Pr(D=d|M,X)·Pr(S=1|D,M,X) · Pr(D=1−d|M,X) Pr(D=d|X)·Pr(M=m|D,X)·Pr(S=1|D,M,X) . .

Estimation
The parameters of interest can be estimated using the normalized versions of the sample analogs of the IPW-based identification results in Section 2. This implies that the weights of the observations used for the computation of mean potential outcomes add up to unity, as advocated in (Busso et al. 2009;Imbens 2004). For instance, the normalized sample analogs of the results in Theorem 1, part (i) are given bŷ When propensity scores are estimated parametrically, e.g., based on probit models as in the simulations and application below, thenμ d,M(d ) ,θ(d),δ(d) satisfy the sequential GMM framework discussed in (Newey 1984), with propensity score estimation representing the first step and parameter estimation the second step. This approach is √ n-consistent and asymptotically normal under standard regularity conditions. When the propensity scores are estimated nonparametrically, √ n-consistency and asymptotic normality can be obtained if the first step estimators satisfy particular regularity conditions. See (Hsu et al. 2017), who consider series logit estimation of the propensity scores, however, for the case without sample selection. Furthermore, the bootstrap is consistent for inference as the proposed IPW estimators are smooth and asymptotically normal.
The suggested IPW estimators are computationally inexpensive and straightforwardly permit considering multiple mediators. On the negative side, IPW-based estimation is sensitive to (estimation errors in) propensity scores that are very close to one or zero, see the simulation results in (Busso et al. 2009;Frölich 2004) as well as the theoretical discussion in (Khan and Tamer 2010). This sensitivity can lead to an explosion in the variance and numerical instability in finite samples. Furthermore, as the propensity score directly enters the expression for estimating the potential outcomes or treatment effects, IPW may be less robust to propensity score misspecification than for instance propensity score matching, which merely uses the score to match observations across treatment states, see (Waernbaum 2012). This suggests the use of sufficiently flexible propensity score specifications, while the sensitivity issue can be tackled by trimming too extreme propensity scores, see (Crump et al. 2009), at the cost of somewhat reducing external validity.

Simulation Study
This section provides a brief simulation study, in which we investigate the finite sample properties of estimation of natural direct and indirect effects based on the sample analogs of Theorems 1 to 3. To this end, the following data generating process is considered: The outcome Y is a linear function of the observed variables D, M, X and an unobserved term U, and is only observed if the selection indicator S-which depends on D, M, X, an instrument Z, and an unobservable V-is equal to one. α gauges the interaction of D and U in the outcome equation. For α = 0, the treatment effect is heterogeneous in U such that Assumption 7 is violated. W and R denote the unobservables in the linearly modeled mediator M and instrument Z, respectively. Any unobservable as well as the observed covariate X are standard normally distributed independent of each other. In this framework, the assumptions underlying Theorem 1 are satisfied.
We run 5000 Monte Carlo simulations with sample sizes n = 1000, 4000 and consider estimation of the natural direct and indirect effects in the total population (θ(d), δ(d)) based on three different estimators: (i) normalized IPW as suggested in (Huber 2014a) among the selected ('IPW w. S = 1') that controls for X but ignores selection bias, (ii) normalized IPW based on Theorem 1 assuming MAR (IPW MAR), and (iii) normalized IPW based on Theorem 3 (IPW IV). We estimate the treatment and selection propensity scores by probit and apply a trimming rule that discards observations withp(M, X) smaller than 0.05 or larger than 0.95 or withπ(D, M, X) smaller than 0.05 to prevent exploding weights due to small denominators. Trimming hardly affects IPW estimator (i), but reduces the variance of estimation based on Theorems 1 and 3 in several cases. Table 1 reports the simulations results under α = 0.25, 14 namely the bias, standard deviation (std), and the root mean squared error (RMSE) of the various estimators for the natural direct and indirect effects in the total population. Ignoring selection (IPW w. S = 1) yields biased estimates of the direct effects under either sample size, while biases are generally small for estimation based on Theorem 1. Interestingly, the latter result also holds for estimation related to Theorem 3, where the selection process accounts for the same observed factors as under the correct MAR assumption, plus the control function. Even though including the control function is not required for consistency, it does not jeopardize identification either, even if Assumption 7 requiring α = 0 is not satisfied, 15 as reflected in the low biases. However, accounting for this unnecessary variable entails an increase of the standard deviation in some cases. In general, the estimators based on Theorems 1 and 3 are (due to the estimation of the sample selection propensity score) less precise than IPW without selection correction in the selected sample. The proposed methods become relatively more competitive in terms of the RMSE as the sample size increases and gains in bias reduction become relatively more important compared to losses in precision. Note: std and rmse report the standard deviation and root mean squared error, respectively.
14 Results are very similar when α = 0 and therefore omitted. 15 Note that in spite of α = 0.25, estimation based on (the incorrect) Theorem 3 is consistent because the distribution of U is not associated with S conditional on D, M, X.
As a modification to our initial setup, we introduce a correlation between U and V, which implies that the assumptions underlying Theorem 1 no longer hold, while those of Theorem 2 are satisfied and those of Theorem 3 are satisfied when α = 0: where µ = 0 0 and Σ = 1 0.8 0.8 1 Table 2 reports the results for the estimation of natural effects in the total population under α = 0 and 0.25 using the same methods as before. Non-negligible biases occur not only when ignoring sample selection (IPW w. S = 1), but also when selection is assumed to be related to observables only (IPW MAR). When α = 0, estimation based on Theorem 3 (IPW IV) is close to being unbiased and dominates the other methods in terms of RMSE under the larger sample size (n = 4000). When α = 0.25, however, also the latter approach is biased due to the violation of Assumption 7. Therefore, Table 3 considers the estimation of natural effects among the selected population only (θ S=1 (d), δ S=1 (1)) in the presence of the D-U-interaction effect. We investigate the performance of estimation based on Theorem 2 (IPW IV w. S = 1), as well as of IPW among the selected ignoring selection. While the latter approach is biased, the former is close to being unbiased, but less precise. The relative performance of the IV method in terms of the RMSE improves as the sample size (and thus precision) increases. 16 Table 2. Simulations with selection on unobservables, total population. Note: std and rmse report the standard deviation and root mean squared error, respectively. Table 3. Simulations with selection on unobservables, selected population (S = 1). Note: std and rmse report the standard deviation and root mean squared error, respectively. 16 Results are very similar when setting α = 0 and therefore omitted.

θ(1)θ(0)δ(1)δ(0)
However, it needs to be pointed out that the usefulness of the instrument-based estimator might be limited in many empirical applications. In our simulations, IPW IV has the highest variance among the methods considered, which may outweigh the gains in terms of a smaller bias and thus entail a higher RMSE in particular in moderate samples. Furthermore, the high variance issue becomes considerably more severe if the instrument is weak and has (in contrast to our simulation design) only a limited effect on selection, at least when controlling for multiple covariates. In such realistic scenarios, the biased IPW MAR method has most likely a smaller RMSE than the unbiased, but unstable IPW IV estimator. Nevertheless, the instrumental variable approach appears useful in research designs with randomly assigned instruments (that are sufficiently strong), e.g., financial incentives for responding in follow-up surveys such as vouchers, cash payments, or cash lotteries. See for instance (Castiglioni et al. 2008;Hsu et al. 2017;Pforr et al. 2015).

Empirical Application
This section illustrates the evaluation of direct and indirect treatment effects in the presence of sample selection using data from Project Student-Teacher Achievement Ratio (STAR), an educational experiment conducted from 1985 to 1989 in Tennessee, USA. In the experiment, a cohort of students entering kindergarten and their teachers were randomly assigned within their school to one of three class types: small (13-17 students), regular (22-26 students), or regular with an additional teacher's aid. Students were supposed to remain in the assigned class type through third grade, returning to regular classes afterwards. The goal of Project STAR was investigating the impact of class size on academic achievement measured by standardized and curriculum-based tests in mathematics, reading, and basic study skills. Numerous studies found positive effects of reduced class size on academic performance both short- (Finn and Achilles 1990;Folger and Breda 1989;Krueger 1999), mid- (Finn et al. 1989;Krueger and Whitmore 2001;Nye et al. 2001), and even on later-life outcomes (Chetty et al. 2011). While benefits of small class size are well documented, the causal mechanisms underlying the effect are less well-understood. Finn and Achilles (1990) argue that the impact is likely driven by classroom processes related to higher teacher morale and satisfaction translated to students, increased teacher-student interactions and time for individual attention, and student involvement in learning activities.
We investigate whether the effect of reduced class size on academic performance is mediated by the number of days absent from school. There might be several explanations for why class size affects days of absence. A smaller concentration of children in a classroom may be related to reduced transmission of infectious diseases and hence absenteeism. 17 Increased student involvement and closer teacher-student relationships in smaller classes may represent further channels making children and their parents more engaged and less likely to miss classes. As for the link between school absence and academic performance, a number of studies demonstrated a negative association between the two, see for instance (Gershenson et al. 2017;Gottfried 2009;Morrissey et al. 2014).
We compare results using the IPW MAR estimator (IPW MAR in Table 5) based on Theorem 1 (relying on Assumptions 1 through 4) in Section 2 to three previously considered mediation estimators that ignore sample selection: 18 (i) a linear mediation estimator allowing for treatment-mediator interactions but neither accounting for observed pre-treatment confounders, nor selection, which is numerically equivalent to the decomposition of (Blinder 1973;Oaxaca 1973) (Lin w. S = 1, no X); 19 (ii) a semiparametric IPW-based analog of the linear mediation estimator not accounting for confounding also considered in (Huber 2015) (IPW w. S = 1, no X); and (iii) the IPW estimator suggested in (Huber 2014a) that incorporates observed pre-treatment covariates X but ignores sample 17 Odongo et al. (2017) find a positive correlation between school size and communicable disease prevalence rates in Kenya.
We are, however, not aware of any such study considering class (rather than school) size. 18 We do not consider IPW IV estimation based on Theorems 2 and 3, as our data do not contain credible instruments. 19 See (Huber 2015) on the equivalence of conventional wage gap decompositions and a simple mediation model. selection when estimating the effect for the total population (IPW w. S = 1). We apply the same trimming rule as in the simulations presented in Section 4, which discards observations with treatment propensity scoresp(M, X) smaller than 0.05 or larger than 0.95 or withπ(D, M, X) smaller than 0.05. However, no observations are dropped for any IPW method as such extreme propensity scores do not occur in our sample.
The treatment (D) is a binary indicator which is one if a child entering kindergarten was enrolled in a small class and zero otherwise. 20 The outcome (Y) is the first grade score in the Stanford Achievement Test (SAT) in mathematics. For IPW MAR estimation, a selection indicator S for missing outcomes is generated and all observations in our evaluation sample are preserved, such that effects are estimated for the entire population. In the case of the remaining three estimators, the evaluation is based on the data with non-missing Y, such that estimation relies on the selected sample only. The mediator (M) is the number of days a child was absent during the kindergarten year. Observed covariates (X) consist of a child's race, gender, year of birth, and free lunch status as a proxy for socio-economic status. They are controlled for in the IPW w. S = 1 and IPW MAR estimators. Even if these variables are initially balanced due to the random assignment of D, they might confound M and Y, implying that they are imbalanced when conditioning on the mediator for estimating direct and indirect effects. 21 We restrict the initial sample of 11,601 children to 6325 observations who were part of Project STAR in kindergarten such that their treatment status was observed. 22 About 30% of participants in the kindergarten year were randomized into small classes. Table 4 presents summary statistics for the variables included in our empirical illustration for individuals without any missing values in the covariates. It shows a positive and statistically significant association between reduced class size and the average score in the standardized math test. Furthermore, children in small classes are, on average, about 0.7 days less absent. This difference is significant at the 5% level, but arguably small in terms of absolute magnitude. There are no statistically significant differences in students' gender, race, 23 , and free lunch status across treatment states due to treatment randomization. The sample is not perfectly balanced in terms of students' year of birth: children born in 1978 and 1980 were less likely to be in small classes (differences are statistically significant at the 1 and 10% levels, respectively), while those born in 1979 were more likely to be in small classes (significant at the 5% level). There is substantial attrition: math SAT scores in the first grade are observed for only 70% of program participants in the kindergarten year. The number of missing values in other key variables is much smaller. In the estimations, observations with missing values in M or X are dropped, which concerns all in all 83 cases, or about 1% of the sample.  Ready (2010) reports a stronger negative impact of absenteeism on early literacy outcomes for students with lower socioeconomic status, which implies that socioeconomic status and absenteeism interact in explaining the outcome. If socioeconomic status in addition affects absenteeism, it is a confounder of the association between absenteeism and the literacy outcomes. 22 5276 students joined the program in subsequent years. About 2200 entered the experiment in the first grade, 1600 in the second and 1200 in the third grade. 23 Less than 1% of students in the sample are Asian, Hispanic, Native American or other race. In our analysis, they are included in one group with black students. (2.14) Note: Standard deviations are in squared brackets. Cluster-robust standard errors are in parentheses. Table 5 provides point estimates (est.), cluster-robust standard errors (s.e.) based on blockbootstrapping the effects 1999 times, and p-values for the total treatment effect, as well as natural direct and indirect effects under treatment and non-treatment (θ(1),θ(0),δ(1),δ(0)) for the four estimators. The total average effect of small class assignment is very similar across all methods and highly statistically significant, amounting to an increase of almost 10 points. Furthermore, we find that, if anything, the contribution of the indirect effects due to reduced days of absence is rather small, ranging 0.18 to 0.99 points across different methods and treatment states. This is not surprising in the light of the quite modest differences in absenteeism across treatment groups, see Table 4. The IPW MAR estimator yields the largest indirect effects (amounting to 3-11% of the total effect), and the indirect effect on the non-treated group is statistically significant at the 10% level. It is thus the direct effects, which are highly statistically significant for any method, that mostly drive the total effect. IPW MAR yields direct effect estimates of 8.52 points under treatment and 7.75 points under non-treatment, which is slightly smaller than those of the other estimators exploiting the subsample with non-missing outcomes only (ranging from 9.01 to 9.55 points under treatment and from 8.77 to 9.55 points under non-treatment). We therefore conclude that causal mechanisms not observed in the data (possibly including teacher motivation and individual teacher-student interaction) and entering the direct effect are much more important than absenteeism for explaining the effect of small kindergarten classes on math performance.

Conclusions
In this paper, we proposed an approach for disentangling a total causal effect into a direct component and a indirect effect operating through a mediator in the presence of outcome attrition or sample selection. To this end, we combined sequential conditional independence assumptions about the assignment of the treatment and the mediator with either selection on observables/missing at random or instrumental variable assumptions on the outcome attrition process. We demonstrated the identification of the parameters of interest based on inverse probability weighting by specific treatment, mediator, and/or selection propensity scores and outlined estimation based on the sample analogs of these results. We also provided a brief simulation study and an empirical illustration based on the Project STAR experiment in the U.S. to evaluate the direct and indirect effects of small classes in kindergarten on math test scores in first grade. The estimators considered in the simulation study and the empirical application are available in the causalweight package for the statistical software R.
The first and sixth equalities follow from the law of iterated expectations, the second from basic probability theory, the third from the observational rule, the fourth from Assumptions 2 and 5 (which imply Y(d, m)⊥M|D = d, X = x, p(W) = p(w), S = 1), and the fifth from Assumptions 1 and 5 (which imply Y(d, m)⊥D|X = x, p(W) = p(w), S = 1). The first and last equalities follow from the law of iterated expectations, the second from basic probability theory, the third from basic probability theory and the fact that Pr(S = 1|D, M, X, p(W)) = Pr(S = 1|D, M, X, Z) = p(W) (as p(W) is a deterministic function of Z conditional on D, M, X), the fourth from Bayes' theorem and the observational rule, the fifth from Assumptions 2 and 5 (which imply Y(d, m)⊥M|D = d , X = x, p(W) = p(w), S = 1), the sixth from Assumptions 1 and 5 (which imply {Y(d, m), M(d )}⊥D|X = x, p(W) = p(w), S = 1), and the seventh from Assumption 7 by acknowledging that p(W) = F V .