Next Article in Journal
An Interpretable Hybrid RF–ANN Early-Warning Model for Real-World Prediction of Academic Confidence and Problem-Solving Skills
Previous Article in Journal
An Interpretable Financial Statement Fraud Detection Framework Enhanced by Temporal–Spatial Patterns
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparing Meta-Learners for Estimating Heterogeneous Treatment Effects and Conducting Sensitivity Analyses

School of Mathematics and Statistics, Beijing Technology and Business University, Beijing 102488, China
*
Author to whom correspondence should be addressed.
Math. Comput. Appl. 2025, 30(6), 139; https://doi.org/10.3390/mca30060139
Submission received: 19 September 2025 / Revised: 11 December 2025 / Accepted: 11 December 2025 / Published: 16 December 2025

Abstract

In disciplines such as epidemiology, economics, and public health, inference and estimation of heterogeneous treatment effects (HTE) are critical. This approach helps reveal differences in treatment effect estimates between subgroups, which supports personalized decision-making processes. While a variety of meta-learners (e.g., S-, T-, X-learners) have been proposed for estimating HTE, there is a lack of consensus on their relative strengths and weaknesses under different data conditions. To address this gap and provide actionable guidance for applied researchers, this study conducts a comprehensive simulation-based comparison of these methods. We first introduce the causal inference framework and review the underlying principles of the methods used to estimate these effects. We then simulate different data generating processes (DGPs) and compare the performance of S-, T-, X-, DR-, and R-learners with the causal forest, highlighting the potential of meta-learners for HTE estimation. Our evaluation reveals that each learner excels under distinct conditions: the S-learner yields the least bias and is most robust when the conditional average treatment effect (CATE) is approximately zero; the T-learner provides accurate estimates when the response functions differ significantly between the treatment and control groups, resulting in a complex CATE structure, and the X-learner can accurately estimate the HTE in imbalanced data.Additionally, by integrating Z-bias—a bias that may arise when adjusting the covariate only affects the treatment variable—with a specific sensitivity analysis, this study demonstrates its effectiveness in reducing the bias of causal effect estimates. Finally, through an empirical analysis of the Trends in International Mathematics and Science Study (TIMSS) 2019 data, we illustrate how to implement these insights in practice, showcasing a workflow for HTE assessment within the meta-learner framework.

1. Introduction

In causal inference research, the central issue is estimating the treatment effects of interventions [1,2,3,4,5]. Traditional methods focus on average treatment effects (ATEs), overlooking potential heterogeneity in subgroups [6]. In contexts with significant individual differences, relying solely on ATE may not capture the true impact of interventions. As a result, there is an increasing demand for the estimation of HTE to reflect the diverse impacts of interventions more accurately [7].
To identify the sources of HTE, researchers often rely on prior knowledge of group populations based on different characteristics for subgroup analysis, or include interaction terms between population characteristics and interventions in traditional regression models to test their statistical significance [8]. However, each subgroup analysis typically groups by a single population characteristic, and performing multiple subgroup analyses may increase the risk of false positives. When treatment effects are influenced by the combination of multiple characteristics, subgroup analysis based on individual characteristics may fail to capture the true effects. Furthermore, in the analysis of interaction terms, the number of interactions increases exponentially when multiple population characteristics are considered simultaneously, making manual specification of these terms time-consuming and prone to model specification errors.
Previous research on causal inference has predominantly addressed identification challenges, whereas machine learning has been widely utilized for large-scale data processing and has shown exceptional performance in prediction tasks. To overcome the limitations of traditional methods, researchers have combined both approaches to estimate HTE in a data-driven manner. This approach shows great potential for accurately identifying the sources of HTE [9,10]. Rapid advancements in machine learning technologies have introduced innovative approaches for causal inference, as they do not rely on manual specifications and overcome the model misspecification issues often encountered with parametric models. Moreover, they can estimate the individual treatment effect (ITE), enabling HTE analysis. Recently, there has been increasing attention toward leveraging causal machine learning algorithms to estimate HTE and elucidate underlying causal mechanisms. For example, Hattab et al. [11] employed causal forest to perform heterogeneous analysis in the Oregon Health Insurance Experiment; Sverdrup et al. [12] applied the causal forest methodology to the field of psychiatry and conducted a comprehensive analysis. Due to their interpretability and ability to accurately estimate HTE in high-dimensional datasets, causal forest has become one of the most representative methods [13].
In addition to causal forest, meta-learners have also received significant attention in recent developments. Simulation comparisons of meta-learners already exist in the literature. In contrast to Brantner et al. [14], who extended single-study methods—such as S, X, and causal forest—to multi-trial settings and evaluated the influence of method selection and aggregation on estimation accuracy, our simulation study offers a more comprehensive comparison in method selection. While Okasa restricted their simulation comparisons to meta-learners [15], omitting causal forest, and Jacob considered only linear and nonlinear scenarios with balanced CATEs, overlooking imbalanced data settings [16]. Compared to the studies mentioned above, we have conducted a more comprehensive evaluation of the S-, T-, X-, DR-, R-learners, and the causal forest method. Our simulation incorporates factors such as the selection of base learners (random forest and Bayesian additive regression trees), data balance, functional form of CATE, and estimation approaches (full-sample and cross-fitting).
It is essential to recognize that while causal machine learning methods like causal forest and meta-learners are suitable for randomized experiments, their application to observational data for causal inference hinges critically on the assumption of ignorability. This assumption requires that, conditional on the observed covariates, the treatment assignment is independent of the potential outcomes. Since counterfactual outcomes are fundamentally unverifiable, conducting a sensitivity analysis to probe potential unobserved confounding is necessary.
This paper evaluates the performance of S-, T-, X-, DR-, and R-learners for estimating HTE, contrasting them with the widely used causal forest. In addition, it introduces an integrated framework that combines Z-bias scenarios with sensitivity analysis, thereby addressing the fundamental untestability of the ignorability assumption in observational studies. Furthermore, We conduct an empirical analysis using the TIMSS 2019 dataset, a prominent observational study in education. Its established use in causal inference, including its analysis with Bayesian causal forest by McJames et al. [17], provides a validated foundation, thus providing a relevant context for applying meta-learners and allowing for sensitivity analysis in HTE estimation. The goal is to promote the application of meta-learners for estimating HTE, beyond the confines of causal forest.

2. Framework

2.1. Background

Consider data from an observational study, D i = { X i , T i , Y i } , where i { 1 , , N } . Here, X i X represents the covariates of individual i, which control for potential confounding and allow for the estimation of conditional average treatment effects (CATE). For a binary treatment T i { 0 , 1 } , T i = 1 indicates treatment, and T i = 0 indicates control. The corresponding potential outcomes for each individual are denoted as Y i ( 1 ) and Y i ( 0 ) . In our TIMSS 2019 study, Y i ( 1 ) denotes the science achievement of students who participated in online collaborative activities, whereas Y i ( 0 ) corresponds to the science achievement of students who did not engage in such activities. Then the individual treatment effect (ITE) of person i is defined as τ i = Y i ( 1 ) Y i ( 0 ) . A fundamental challenge is the inability to observe both potential outcomes for the same individual simultaneously, making it impossible to directly compute τ i [18]. However, under additional assumptions, the expected value of τ i may be identified.
Therefore, the primary focus of statistical analysis shifts to estimating the CATE and ATE. The CATE, denoted as τ ( x ) , is formally defined as:
τ ( x ) = E [ τ i X i = x ] = E [ Y i ( 1 ) Y i ( 0 ) X i = x ]
whereas the ATE is expressed as:
τ = E [ Y i ( 1 ) Y i ( 0 ) ] = E E [ Y i ( 1 ) Y i ( 0 ) X i ] = E [ τ ( X i ) ]
When the CATEs are identical for all individuals, the treatment effects are deemed homogeneous. Conversely, if the treatment effects differ among individuals, they indicate the presence of HTE. In such cases, the CATE can identify subgroups that exhibit different responses to the treatment.
To estimate the CATE, we rely on three fundamental assumptions, as described below:
Assumption 1. 
Unconfoundedness:
( Y i ( 0 ) , Y i ( 1 ) ) T i X i
Assumption 2. 
Common Support:
0 < P ( T i = 1 X i = x ) < 1 , x supp ( X i )
Assumption 3. 
Stable Unit Treatment Value Assumption (SUTVA):
Y i = T i · Y i ( 1 ) + ( 1 T i ) · Y i ( 0 )
Under these assumptions, the CATE can be expressed as:
τ ( x ) = E Y i ( 1 ) Y i ( 0 ) X i = x = E Y i ( 1 ) X i = x E Y i ( 0 ) X i = x = μ 1 ( x ) μ 0 ( x ) = E Y i ( 1 ) X i = x , T i = 1 E Y i ( 0 ) X i = x , T i = 0 = E Y i X i = x , T i = 1 E Y i X i = x , T i = 0 = μ ( x , 1 ) μ ( x , 0 )
where μ ( x , 1 ) and μ ( x , 0 ) represent the response functions for treated and untreated individuals, respectively.
In Equation (1), Assumption 2 ensures the existence of the second equality sign. For example, if P ( T i = 1 X i = x ) = 1, then E Y i ( 0 ) X i = x does not exist, and attempting to estimate it would be unreasonable. The third equality sign relates to Assumption 1, and the fourth equality sign depends on Assumption 3. Ultimately, we can use the observed data to estimate the CATE.

2.2. Sensitivity Analysis Method

In observational studies, Assumption 1 is untestable, and unmeasured confounding may lead to biased CATE estimates. Therefore, we need to conduct sensitivity analysis to assess how violations of Assumption 1 or the strength of unmeasured confounding affect the estimation results. Ding and VanderWeele [19] proposed the E-value for sensitivity analysis, but it is only applicable to binary outcomes. Cinelli and Hazlett [20] developed a sensitivity analysis approach within the omitted variable bias framework, although its implementation necessitates reliance on regression model. Rubin and Rosenbaum [21] pioneered the development of parametric approaches. Tchetgen et al., Halloran et al. and Lu et al. [22,23,24] subsequently proposed analogous parametric methodologies across distinct research contexts. The parametric approach overcomes the limitations mentioned above by enabling quantification of unmeasured confounding effects on estimation outcomes, irrespective of the type of outcome variable and without relying on specific model specifications.
Under Assumption 1, let X denote the covariate set comprising both confounders and non-confounders. Pearl [25] noted that controlling for all observed covariates may lead to Z-bias. Therefore, given its unique double-robustness property, we employ the sensitivity analysis method from Lu et al. [24] to remove covariates that only affect the treatment variable T, with the remaining variables denoted as L.
First, we define two sensitivity parameters:
E Y ( 1 ) T = 1 , L E Y ( 1 ) T = 0 , L = R 1 ( L ) , E Y ( 0 ) T = 1 , L E Y ( 0 ) T = 0 , L = R 0 ( L ) .
The doubly robust estimator is a methodology for estimating ATE that also relies on the ignorability assumption, yielding an estimated causal effect τ D R :
τ D R = E μ 1 ( L ) μ 0 ( L ) + T ( Y μ 1 ( L ) ) e ( L ) ( 1 T ) ( Y μ 0 ( L ) ) 1 e ( L ) .
where μ 1 ( L ) = E Y T = 1 , L , μ 0 ( L ) = E Y T = 0 , L , and e ( L ) represents the propensity score p ( T = 1 L ) . Through conversion, we can attain τ R as follows:
τ R = E T μ 1 ( L ) + ( 1 T ) μ 1 ( L ) / R 1 ( L ) E T μ 0 ( L ) R 0 ( L ) + ( 1 T ) μ 0 ( L ) + E W 1 ( L ) T ( Y μ 1 ( L ) ) / e ( L ) W 0 ( L ) ( 1 T ) ( Y μ 0 ( L ) ) / ( 1 e ( L ) ) ,
where W 1 ( L ) = e ( L ) + [ 1 e ( L ) ] / R 1 ( L ) and W 0 ( L ) = e ( L ) R 0 ( L ) + 1 e ( L ) . Our analysis demonstrates that when Assumption 1 holds, R 1 ( L ) = R 0 ( L ) = 1 , then τ D R = τ R .
Additionally, we can model the influence of observed covariates on the sensitivity parameters R 1 ( L ) and R 0 ( L ) . Section 5’s case study involves 34 covariates, with 4 covariates excluded, leaving 30 covariates L = ( L 1 , L 2 , , L 30 ) . For S { 1 , , 30 } , under the condition T Y L , we can find
R t ( L S ) = E [ Y ( t ) T = 1 , L S ] E [ Y ( t ) T = 0 , L S ] = E [ μ t ( L ) T = 1 , L S ] E [ μ t ( L ) T = 0 , L S ]
for t = 0 , 1 , where L S represents the remaining covariates in L after removing every L k , k S . In fact, the estimated R t ( L S ) from observational data quantifies the degree to which L S violates Assumption 1 or the confounding effect of L S ( L S includes every L k , k S ). Taking the maximum and minimum values of the sensitivity parameter R t ( L S ) , we can establish the connection between L S , unmeasured confounding, and the causal conclusion, which we will explain in Section 5.

3. Estimating CATE

Since put forward by Wager et al., the causal forest approach has gained widespread recognition and adoption among researchers in fields such as epidemiology and econometrics [13,26,27]. With the growing convergence of machine learning and causal inference, meta-learners have emerged as innovative tools for estimating HTE [28,29,30]. Note that traditional causal inference methods such as inverse probability weighting and doubly robust estimation are designed primarily to estimate the ATE at the population level and thus do not readily provide the ITE. In contrast, some methods—including causal forest and meta-learners—are capable of estimating ITEs when Assumptions 1–3 hold [31,32], thereby enabling analyzing the treatment effect heterogeneity. This capability is particularly valuable in applied research: as illustrated in the application section of this study, ITE estimates can be used to partition the study population into subgroups with distinct causal effect magnitudes. Subsequent covariate profiling of these subgroups can then help identify key factors driving heterogeneity—an analytical pathway that is not feasible with conventional aggregate-level methods. Given these advantages, the present article focuses on explicating the underlying principles and methodological features of causal forest and meta-learners.

3.1. Causal Forest

Building on the foundation of decision trees, Athey et al. [33] integrated classification and regression trees into the causal inference framework, thereby introducing the concept of causal trees. These causal trees are distinguished from conventional regression trees by two key innovations: First, they employ a specialized splitting criterion designed to maximize the identification of treatment effect heterogeneity by minimizing the expected mean squared error. Second, they adhere to an “honesty” principle, whereby a tree typically uses one subsample for determining its split structure and a separate, non-overlapping subsample for estimating effects within terminal nodes. This approach effectively mitigates overfitting and ensures that the predictions from each tree are asymptotically normal. Subsequently, Wager et al. [13] extended this approach, leading to the development of causal forest. The entire ensemble composed of these causal trees is trained as an R-learner. Formally, this method estimates τ ( X i ) by minimizing the R-loss function, which is defined by the following equation:
τ ^ ( X i ) = arg min τ 1 N i = 1 N Y i μ ^ i ( X i ) T i e ^ i ( X i ) τ ( X i ) 2 + Λ n ( τ ( X i ) )
This procedure minimizes a locally centered loss function using out-of-bag predictions, with Λ n ( τ ( X i ) ) acting as a regularizer.
A further innovation of the causal forest lies in its use of the ensemble as an adaptive kernel, which calculates weights to estimate treatment effects within local neighborhoods in the high-dimensional covariate space, contrasting with the random forest approach that directly averages tree predictions.The weights for a covariate set x are determined by evaluating each data point i and computing the fraction of trees where X i and x share the same leaf node, while excluding trees that were trained on X i to prevent overfitting. This defines the kernel function α ( x ) . Formally, for every tree b within the ensemble B, which is built on a sample S b ,
α i ( x ) = 1 B b = 1 B 1 ( { X i L b ( x ) , i S b } ) | { i : X i L b ( x ) , i S b } | .
Then the causal effect can be obtained as:
τ ^ ( x ) = i = 1 n α i ( x ) Y i μ ^ ( i ) ( X i ) T i e ^ ( i ) ( X i ) i = 1 n α i ( x ) T i e ^ ( i ) ( X i ) 2
This method yields an asymptotically unbiased and normal estimator that, owing to its minimal demand for manual hyperparameter tuning, has been widely adopted. However, an excessive number of variables—especially when only a few useful predictors are buried among many irrelevant ones—can degrade performance, as the random subspace method used in each tree reduces the chance of selecting informative features, thus letting noise dominate the splits, a scenario where variable selection could be beneficial.

3.2. S-Learner

S-learner refers to “Single-Learner” because this learner involves only a single model μ ( x , t ) . Uplift modeling in marketing and epidemiological research provided the initial conceptual basis for the S-learner methodology [34]. When using a general linear model as the base learner in the S-learner to estimate the ATE, epidemiologists refer to this method as the “parametric g-formula” [35,36].
The S-learner framework establishes a predictive model in which the outcome variable depends on both covariates and treatment assignment, expressed as:
μ ^ ( x ; t ) = E ^ [ Y i X i = x , T i = t ]
The CATE is subsequently derived through:
τ ^ ( x ) = μ ^ ( x ; 1 ) μ ^ ( x ; 0 )
Nevertheless, the methodology presents several limitations. First, it fails to consider potential covariate distribution shifts between treatment conditions, a prevalent challenge in non-experimental settings. Second, by incorporating the treatment indicator as a regular predictor variable, the approach may yield unreliable CATE estimates, particularly when the treatment variable exhibits weak predictive capacity. This occurs because machine learning algorithms, such as regression trees, might disregard T i during model fitting, potentially excluding it from splitting criteria. Additionally, regularization effects can introduce estimation bias, potentially shrinking treatment effects toward null values [28].
Recent studies have identified critical constraints in the S-learner’s architecture [37,38]. The method’s uniform regularization approach fails to accommodate potential heterogeneity in sparsity patterns and smoothness requirements between treatment groups. When response functions vary significantly in complexity between groups, this limitation leads to suboptimal performance. In contrast, the S-learner performs well when the treatment effect structure is simple and the response functions have similar complexity levels.

3.3. T-Learner

The T-learner methodology, an abbreviation denoting “Two-Learner,” employs a dual-model estimation framework, requiring separate models for μ ( x , 1 ) and μ ( x , 0 ) . Originally developed within the context of marketing uplift modeling, this approach has gained significant traction in recent years, with widespread adoption across diverse research domains including medical studies and econometric analysis for HTE estimation [39,40,41,42].
The T-learner adopts a distinct modeling strategy compared to the S-learner, constructing separate response functions μ ( x , 1 ) and μ ( x , 0 ) for treated and control samples respectively before deriving CATE estimates. This methodological separation enables the preservation of group-specific distributional characteristics that may emerge from selection bias, while accommodating potential heterogeneity in sparsity and smoothness patterns conditioned on treatment status during the regression of Y on X. However, this approach presents a significant limitation: the independent estimation of response functions prevents information sharing between treatment groups, which becomes particularly detrimental in randomized experiments where individuals often share identical distributions. Consequently, the T-learner demonstrates superior performance in scenarios where the response functions exhibit markedly different distributional properties, yielding complex CATE structures, as evidenced by multiple simulation studies [26,28]. Conversely, its performance deteriorates when response functions display similar trends, resulting in simpler CATE forms. Additionally, the method’s reliability diminishes when applied to imbalanced datasets, as the response function estimated for the smaller group becomes susceptible to overfitting, potentially mistaking random noise for genuine HTE.

3.4. X-Learner

In analyzing the limitations of the T-learner, we mentioned that additional bias may arise in imbalanced data. Künzel et al. [28] proposed the X-learner as an enhanced alternative to address this concern. Building upon the T-learner’s foundation, the X-learner initially constructs separate response functions μ ( x , 1 ) and μ ( x , 0 ) . Subsequently, it integrates these estimates with observational data as follows:
ξ ˜ i 1 = Y i μ ^ ( X i , 0 ) and ξ ˜ i 0 = μ ^ ( X i , 1 ) Y i
which serve as pseudo-outcomes for estimating group-specific treatment effects:
τ ( x , 1 ) = E [ ξ ˜ i 1 X i = x , T i = 1 ]
τ ( x , 0 ) = E [ ξ ˜ i 0 X i = x , T i = 0 ]
Then the CATE is computed using the estimated propensity score e ^ ( x ) :
τ ^ ( x ) = e ^ ( x ) · τ ^ ( x , 0 ) + 1 e ^ ( x ) · τ ^ ( x , 1 )
To provide intuitive insight into the re-weighting mechanism, consider a scenario with disproportionate group sizes, where the control group vastly outnumbers the treatment group. In such configurations, μ ( x , 1 ) estimation typically demonstrates greater uncertainty relative to μ ( x , 0 ) . Given that τ ( x , 0 ) relies on μ ( x , 1 ) while τ ( x , 1 ) depends on μ ( x , 0 ) , the latter generally yields more reliable estimates. The X-learner’s weighting scheme, assigning 1 e ^ ( x ) to τ ( x , 1 ) and e ^ ( x ) to τ ( x , 0 ) , effectively prioritizes the more precise τ ( x , 1 ) estimates.

3.5. DR-Learner

The DR-learner integrates inverse probability weighting with regression modeling, achieving double robustness [29]. This property ensures consistent estimation provided either the response function or propensity score model is correctly specified. Like the T-learner, the DR-learner initially estimates response functions μ ( x , 1 ) and μ ( x , 0 ) . Subsequently, it combines these nuisance functions with treatment and outcome data to construct doubly robust scores:
ψ ^ i = T i Y i μ ^ ( X i , 1 ) e ^ ( X i ) ( 1 T i ) Y i μ ^ ( X i , 0 ) 1 e ^ ( X i ) + μ ^ ( X i , 1 ) μ ^ ( X i , 0 )
These scores are then regressed on covariates, enabling adaptation to structural characteristics like smoothness or sparsity.
While the DR-learner’s doubly robust property may provide advantages over the X-learner, it is susceptible to increased bias when both models are misspecified [43]. Furthermore, instability may arise with extreme propensity scores, particularly in high-dimensional settings, potentially yielding unreasonable estimates [44,45].

3.6. R-Learner

Building on Robinson’s work on partially linear models [46], Nie et al. [30] proposed the R-learner method. The “R” not only acknowledges Robinson’s contribution but also refers to the “residualization” approach. The following two expressions are defined:
Y i = m ( X i ) + τ ( X i ) T i + ϵ i
μ ( X i ) = E ( Y i | X i ) = m ( X i ) + τ ( X i ) e ( X i )
where Y i represents the outcome model, and μ ( X i ) denotes the conditional mean outcome model. Under this parameterization, unconfoundedness assumption implies E ( ϵ i | X i , Z i ) = 0 , and
Y i μ ( X i ) = ( T i e ( X i ) ) τ ( X i ) + ϵ i
Residual regression, a well-established technique in econometrics, achieves Neyman orthogonality, effectively addressing selection bias in observational studies. This concept is further developed in the double machine learning framework [47]. Building on this framework, Nie et al. [30] derived a loss function for CATE estimation:
τ ( X i ) ^ = arg min τ E ( Y i μ ( X i ) ) ( T i e ( X i ) ) τ ( X i ) 2
The R-learner implements this process in two steps. In the first step, cross-fitted predicted values μ ^ i ( X i ) and π ^ i ( X i ) are obtained for each subject as follows: the dataset is divided into K equally sized folds, and predictions for each observation are generated by fitting the outcome and treatment models using only the data from the K 1 folds that exclude the i-th subject. In the second stage, ITE is estimated through penalized empirical loss minimization:
τ ^ ( X i ) = arg min τ L n ( τ ( X i ) ) + Λ n ( τ ( X i ) )
where
L n ( τ ( X i ) ) = 1 N i = 1 N Y i μ ^ i ( X i ) T i e ^ i ( X i ) τ ( X i ) 2
and Λ n ( τ ( X i ) ) is a regularizer specific to the chosen machine learning method.
The R-learner’s two-stage estimation framework presents several notable advantages. First, it explicitly decouples the estimation of nuisance functions ( μ ^ ( x i ) and e ^ ( x i ) ) from the CATE estimation process, thereby facilitating the application of tailored regularization techniques. Second, this approach effectively mitigates spurious correlations between the outcome function μ ( x ) and the propensity score e ( x ) .

3.7. Cross-Fitting

When employing meta-learner techniques to estimate CATE, the concurrent estimation of nuisance functions and treatment effects within the same data set may result in overfitting, thereby inducing bias in the CATE estimator. It is pertinent to acknowledge that S- and T-learners are immune to this form of overfitting, as they solely necessitate the estimation of the conditional mean function for the CATE computation. The research by Chernozhukov et al. [47] demonstrated that sample splitting can regulate overfitting bias in ATE estimation, a principle that has been subsequently generalized to CATE estimation by Kennedy and Nie et al. [22,30]. This approach involves utilizing one segment of the data sample for nuisance function estimation and the complementary segment for treatment effect estimation.
To counteract the overfitting bias engendered by the simultaneous estimation of nuisance and primary parameters on identical observations, we employ the cross-fitting technique [47]. Cross-fitting, in contrast to sample splitting, harnesses a more extensive portion of the data for estimation, thereby diminishing variance. The specific process can be described in Figure 1. The sample is divided into two subsets. The first subset is allocated for the estimation of nuisance parameter models, which subsequently facilitate the prediction of nuisance parameters in the other subset, yielding pseudo-outcomes. A regression function is then trained on the second subset, with the validation set employed to predict the CATE (denoted as τ ^ 1 ( x ) ). The roles of the subsets are subsequently inverted to derive τ ^ 2 ( x ) . The CATE estimate is ultimately calculated as the mean of τ ^ 1 ( x ) and τ ^ 2 ( x ) .
Furthermore, Newey et al. [48] and Jacob [49] have also introduced multiple adaptations of sample splitting and cross-fitting. Despite the variant implemented, the overarching aim of these methodologies is consistent: to guarantee that the nuisance functions utilized in the construction of individual pseudo-outcomes are estimated without utilizing the data from the individual.

4. Simulation Study

4.1. Simulation Study of HTE

To rigorously assess the efficacy and robustness of meta-learners in estimating HTE through both full sample and cross-fitting methodologies, in comparison to the causal forests approach, we select random forests as the base learner and devised a series of simulation scenarios.
Random forest are selected as the base learners for all meta-learners due to their ensemble approach, which amalgamates predictions from multiple decision trees to mitigate model variance. Their utilization of out-of-bag estimation enhances robustness against overfitting and noise, proving advantageous in causal inference within observational studies by improving the accuracy of potential outcome predictions. As a non-parametric technique, random forest obviate the need for data preprocessing, allowing for flexible adaptation to complex functional forms inherent in the data. This adaptability is crucial for treatment effect estimation, where confounding functions are typically intricate, yet the CATE is often straightforward and sparse [28,50]. Additionally, the widespread adoption of random forest in empirical research, supported by numerous efficient software implementations, underscores their suitability for practical applications.
In contrast to conventional simulation studies that utilize simplistic, low-dimensional functions for nuisance parameters, our design incorporates highly nonlinear yet sparse functions within a high-dimensional covariate space [28,29]. This complexity rigorously tests the capabilities of machine learning methods.
Our study employs a sample size that substantially exceeds those used in prior simulation studies on cross-fitting methods, and performance metrics were evaluated on a validation set of 1000 observations [49,51]. The random forest algorithm is implemented with standard hyperparameter settings: 1000 trees, the number of randomly selected split variables set to the square root of the number of features. For each DGP, simulations are conducted with training data N = { 500 , 5000 } times, 1000 replications for the 500-sample size and 250 replications for the 5000-sample size, the latter reduced due to computational constraints.

4.1.1. Performance Measures

We evaluated model performance using three primary metrics: the root mean squared error (RMSE), absolute bias (|BIAS|), and standard deviation (SD) of the predictions for each observation in the validation sample.
RMSE ( τ ^ ( X i ) ) = 1 N n = 1 N τ ( X i ) τ ^ n ( X i ) 2
BIAS ( τ ^ ( X i ) ) = 1 N n = 1 N τ ( X i ) τ ^ n ( X i )
SD ( τ ^ ( X i ) ) = 1 N n = 1 N τ ^ n ( X i ) 1 N n = 1 N τ ^ n ( X i ) 2
Furthermore, we compute the Jarque–Bera statistic to test the normality of τ ^ ( X i ) [52,53]:
JB ( τ ^ ( X i ) ) = N 6 S ( τ ^ ( X i ) ) 2 + 1 4 K ( τ ^ ( X i ) ) 3 2
where S ( · ) and K ( · ) denote the skewness and kurtosis functions, respectively.

4.1.2. Simulation Design

First, we generate a p-dimensional covariate matrix X i R p by sampling from a uniform distribution, a method also used by Okasa [15]. Specifically:
X i U ( [ 0 , 1 ] n × p )
In the simulation, the sample size is either n = 500 or n = 5000 , and p = 80 . Next, we simulate potential outcomes as follows:
Y i ( 0 ) = μ 0 ( X i ) + ϵ i ( 0 )
Y i ( 1 ) = μ 1 ( X i ) + ϵ i ( 1 )
with errors ϵ i ( 0 ) , ϵ i ( 1 ) iid N ( 0 , 1 ) . The treatment assignment is simulated using the propensity score function:
T i Bern ( e ( X i ) )
For all simulations, the response function is defined using the well-known Friedman function [54],
μ 0 ( x ) = μ 1 ( x ) τ ( x ) = sin ( π · x 1 · x 2 ) + 2 · ( x 3 1 2 ) 2 + x 4 + 1 2 · x 5 ,
which creates a highly non-linear but sparse response function that is inherently difficult to estimate. This function has also been used in simulations by Biau [55].
The propensity score is modeled similarly to Wager et al. [13], using a β distribution with parameters 2 and 4:
e ( x ) = α 1 + β 2 , 4 ( f ( x ) )
Here, the parameter α controls the proportion of treated units in the sample and ensures that the propensity scores are bounded away from 0 and 1, reducing extreme values that could destabilize meta-learners [56,57]. The propensity score is made in a non-linear manner, as proposed by Nie et al. [30]:
f ( x ) = sin ( π · x 1 · x 2 · x 3 )
To comprehensively evaluate the accuracy and robustness of meta-learners and the causal forest in estimating HTE while taking into account factors such as sample size, data balance, base learners, and the functional form of CATE, we designed six DGPs. First, based on the theoretical properties of each meta-learner and prior research, we constructed five specific DGPs: Case 1–5, respectively to the advantage of the S-, T-, DR-, R- and X-learners. Additionally, we included a complex DGP where no method holds a priori advantage, making it a key focus of our analysis. we designed six distinct DGPs as described below:
Case 1: balanced treatment and constant zero CATE
τ ( x ) = 0
as used in Okasa [15], with a balanced treatment assignment ( α = 1 4 ), resulting in approximately 50% treated and 50% control units.
Case 2: balanced treatment and complex nonlinear CATE
μ 1 ( x ) = 1 , τ ( x ) = 1 sin ( π · x 1 · x 2 ) 2 · ( x 3 1 2 ) 2 x 4 1 2 · x 5
as used in Künzel et al. [28], with a balanced treatment assignment ( α = 1 4 ), yielding approximately 50% treated units.
Case 3: unbalanced treatment and simple CATE
τ ( x ) = 1 + 1 · 1 ( x 1 > 0.5 )
as used also in Künzel et al. [28], with a balanced treatment assignment ( α = 1 8 ), resulting in approximately 25% treated units.
Case 4: unbalanced treatment and linear CATE
τ ( x ) = 1 + 1 2 x 1 + 1 2 x 2
as used in Nie et al. [30], with an unbalanced treatment assignment ( α = 1 8 ), yielding approximately 25% treated units.
Case 5: highly unbalanced treatment and linear CATE
τ ( x ) = 1 + 1 2 x 1 + 1 2 x 2
as used also in Nie et al. [30], with a highly unbalanced treatment assignment ( α = 1 14 ), resulting in approximately 15% treated units.
Case 6: unbalanced treatment and nonlinear CATE
τ ( x ) = 1 + 4 3 i = 1 3 ( 1 1 + e 12 ( x i 0.5 ) 1 2 )
as used in Wager et al. [13], with an unbalanced treatment assignment ( α = 1 8 ), resulting in approximately 25% treated units.

4.1.3. Simulation Results Analysis

Based on the results presented in Figure 2 and Figure 3, our analysis proceeds accordingly. Supplementary Materials provides additional results for comparison, including those obtained using Bayesian additive regression trees as the base learner, along with the corresponding RMSE and Jarque-Bera statistics.
In Case 1, as expected, the S-learner demonstrates significantly superior performance compared to other estimation methods when the CATE equals zero, exhibiting absolute dominance under this condition. In Case 2, the T-learner shows the best performance at sample sizes of both 500 and 5000. However, it exhibits higher variance, mainly due to its approach of estimating two separate response functions for CATE estimation. In Case 3, the performance of the DR-learner is substantially influenced by the cross-fitting, achieving the smallest bias among all methods at a sample size of 5000. In Cases 4 and 5, both the causal forest and X-learner methods display strong estimation capabilities, with the X-learner consistently outperforming others in large-sample settings. Notably, the X-learner shows no significant difference between full-sample and cross-fitting scenarios. This contrasts with the performance of DR- and R-learners, which struggle in these settings. The DR- and R-learners are affected by extreme propensity scores, thus exhibiting unstable estimation, leading to increased bias in highly imbalanced data. The instability of these estimators is further reflected in the heavy-tailed distribution of CATE, as indicated by larger Jarque-Bera statistics and higher variance. In Case 6, our primary setting of interest, the T-learner and X-learner achieved the smallest absolute bias at a sample size of 500, whereas the DR-learner performed best at a sample size of 5000.
In Cases 1, 4, and 5, where the CATE follows a simple linear form, the causal forest estimator accurately captures the treatment effect. In contrast, its performance deteriorates in the presence of nonlinear CATE structures (Cases 2, 3, and 6), with the largest absolute bias observed under the nonlinear settings of Cases 6. These results suggest that complex nonlinear CATE may compromise the accuracy of causal forest, and that the bias tends to increase with the complexity of the CATE function. This pattern can be explained by the intrinsic piecewise-constant approximation mechanism of causal forest [13]. Specifically, causal forest estimate CATE through within-leaf differences between treated and control units, which amounts to a local constant approximation of the underlying CATE function. When the true CATE is linear, such local averaging introduces little systematic error. However, for nonlinear CATE functions, especially those with high curvature or rapid local variation, the within-leaf averaging induces non-negligible approximation bias, leading to systematic under- or over-estimation of the true treatment effect. As the complexity and nonlinearity of the CATE function increase, this local averaging bias becomes more pronounced. On the other hand, since the causal forest is pointwise consistent for the true treatment effect and have an asymptotically gaussian and centered sampling distribution, it demonstrates remarkable robustness across almost all scenarios.
We observed that the standard deviation increases with larger sample sizes, particularly in Cases 2, 3, and 6. The primary reason for this lies in the bias–variance trade-off arising from the interplay between the nonlinear CATE structure and the high flexibility of the random forest models. When estimating the influence functions for the meta-learners, we used the default parameters of random forest—imposing no restrictions on the depth of the tree models. This allowed the trees to perform deep splits and fit highly complex nonlinear structures. While this effectively reduced the bias in CATE estimation, it also amplified the model’s sensitivity to sample-specific noise, leading to an increase in standard deviation. Therefore, appropriate parameter tuning may enable achieving both smaller absolute bias and lower standard deviation even at larger sample sizes.
In Supplementary Materials, a comparative analysis of meta-learners employing random forest versus bart as base learners reveals several key insights. In terms of similarities, the estimation performance of all methods is subject to the influence of data balancing and the functional form of the CATE. Among them, causal forest proves to be the most robust estimator, while the X-learner consistently delivers reasonably accurate estimates across all scenarios. Regarding differences, bart-based meta-learners exhibit superior accuracy in capturing nonlinear CATE structures. In the case of random forest as the base learner, no single method emerges as universally superior—a finding consistent with existing literature [15]. By contrast, the bart-based S-learner achieves the highest estimation accuracy in every case, representing a novel contribution of this study.

5. Application to a Real Data

As demonstrated by the simulation results, the X-learner exhibits stable performance in accurately estimating the CATE across all simulated datasets. Building on these findings, we subsequently applied the X-learner to analyze the TIMSS 2019 dataset.

5.1. Data Description

TIMSS 2019 represents the seventh iteration of the Trends in International Mathematics and Science Study, conducted across 64 countries and 8 benchmarking systems at both fourth and eighth-grade levels [58]. In addition to academic performance metrics, TIMSS gathers policy-relevant data on students’ learning environments through questionnaires administered to students, parents or caregivers, teachers, and school principals. In summary, TIMSS offers comparative insights into student achievement across countries, alongside contextual factors related to family, school, and classroom environments.
In this study, we focus exclusively on the science achievement of eighth-grade students in the USA. The TIMSS 2019 provides data from questionnaires completed by 8698 students, 472 science teachers, and 273 school principals. After integration, a combined dataset comprising 1135 observations was formed, with each dataset containing 34 variables describing the characteristics of students, science teachers, and schools. Key student-level characteristics include gender, frequency of absenteeism, and the number of educational resources at home. Important teacher-level characteristics encompass teacher gender, years of teaching experience, and job satisfaction. From the principal questionnaires, we also obtained information on school discipline, educational resources, and background. Our objective is to assess the impact of student online collaboration on science achievement.
Figure 4 displays the mean differences in covariates between the treatment and control groups. Typically, standardized mean differences below 0.1 are deemed acceptable, whereas values ≥0.1 indicate imbalance [59]. As depicted, there are numerous imbalanced covariates within the data, necessitating a heterogeneity analysis of treatment effects in this observational study.

5.2. Inference on HTE

Before performing heterogeneity analysis on the data, it is essential to conduct heterogeneity testing. Although Chernozhukov et al. [60] primarily developed their framework for randomized controlled trials, it can be extended to observational data. The adapted testing approach involves estimating the following regression model:
Y i μ ^ ( X i ) = β 1 T i e ^ ( X i ) + β 2 τ ^ ( X i ) τ ^ T i e ^ ( X i ) + ϵ
Here, μ ^ ( X i ) represents the estimated response function for individual i, e ^ ( X i ) denotes the estimated propensity score, and τ ^ = 1 n i = 1 n τ ^ ( X i ) is the average of the CATE estimates. If the meta-learner effectively captures the underlying heterogeneity, β 2 should equal 1. Thus, a statistically significant β 2 greater than zero suggests the presence of HTE, indicating that the meta-learner has captured this heterogeneity to some extent. Consider the uncertainty introduced by sample splitting; Chernozhukov et al. [60] recommend conducting multiple iterations of the experiment and deriving parameter estimates by computing their median across these repetitions. Table 1 presents the results of this analysis applied to an illustrative dataset using the X-learner.
The coefficient β 1 in model (2) represents the ATE when the true functions μ ( X i ) , e ( X i ) are used in place of their estimates. The CATE estimated by the X-learner is 1.307 and, as presented in Table 1, the regression estimate of 4.832 has a confidence interval that includes zero. Both estimates suggest that online collaboration does not demonstrate a statistically significant impact on the scientific achievement of the student cohort. Moreover, the significance and proximity of β 2 to 1 allow us to reject the null hypothesis of no HTE, indicating the X-learner has effectively captured the heterogeneity in CATE.
In the field of educational research, Talan [61] suggested that computer-supported collaborative learning had a moderately positive effect on academic performance. However, Jung [62] highlighted potential psychological challenges associated with online collaboration, including stress, which might stem from low self-efficacy, limited technical skills, or difficulties in the collaborative process. Prolonged or intense stress could foster negative attitudes and lead to poor academic outcomes. Consequently, the effects of online collaboration on student achievement are not uniform, emphasizing the importance of conducting heterogeneity analysis of treatment effects in this study.
Given the significant heterogeneity in treatment effects observed, it is crucial to explore how these effects differ across individuals. To address this, individuals can be ranked based on their estimated CATEs and grouped into quantiles. Adopting the methodology of Chernozhukov et al. [60], we partitioned the data into five subgroups, denoted as G 1 , , G 5 . Subsequently, the following regression model was applied:
Y i μ ^ ( X i ) = T i e ^ ( X i ) k = 1 5 γ k D k , i + ϵ i
In Model (3), D k , i is a dummy variable for the k-th subgroup, indicating that D k , i = 1 if the predicted CATE for individual i falls into the k-th group, and D k , i = 0 , otherwise. The key parameter in this model is the coefficient γ k , which represents the CATE for the k-th subgroup when the true functions μ ( X i ) and e ( X i ) are used. Specifically:
γ k = E [ τ ( X i ) | G k ]
These subgroup-specific CATEs are termed sorted group average treatment effects (GATES) [60]. Figure 5 presents the GATES based on the study. As depicted, the effect of online collaboration among students is not statistically significant across most subgroups. However, a comparison of the 20% most affected students (Group 1 versus Group 5) reveals that the average science achievement in Group 5 is significantly higher than that in Group 1, with non-overlapping confidence intervals. This finding highlights substantial heterogeneity in the treatment effects between two extreme subgroups.
Specifically, while the intervention does not appear to exert a significant influence on the majority of students, it demonstrates a statistically significant and educationally relevant impact on the academic performance of students in Group 5 compared to those in Group 1. This underscores the importance of examining subgroup-specific effects to better understand the differential impacts of educational interventions and inform targeted policy decisions.
When global tests and GATES reveal significant heterogeneity in treatment effects, it is essential to identify which variables drive this heterogeneity. Specifically, one can compare the average levels of covariates between the subgroups most affected and the least affected, a process called classification analysis [60]. In the data example, we used Welch-tests to analyze differences in covariate means between students in Group 1 and Group 5, with Holm correction applied to adjust for multiple testing. Table 2 presents the results for covariates with significant mean differences between participating and non-participating students (i.e., those with an absolute Hedge’s g greater than 0.20).
First, we found that the student subgroups benefiting the most from the intervention had lower rates of absenteeism, parents with university-level education, and access to abundant home educational resources. Additionally, their teachers reported higher job satisfaction, a stronger focus on academic success, and a perception of a safe and orderly working environment. Schools with stricter disciplinary policies, an emphasis on academic achievement, and a more advantaged socioeconomic background were associated with greater improvements in science performance. Finally, among the most influential variables, “Parents went to college,” “Number of home educational resources,” “Safe and orderly school,” and “Socioeconomic background of the school” all demonstrated absolute Hedge’s g values greater than 0.8, indicating substantial effect sizes. These variables consistently exhibited positive contributions to the causal treatment effect. In contrast, “Often absent” was the only variable showing a negative association with an Hedge’s g value of 0.53. These findings are consistent with earlier studies in educational research, which have consistently emphasized the role of family and school-related factors in determining science achievement [63,64,65,66,67]. Thus, our research makes a significant contribution through the verification of these findings using a causal inference approach. Lastly, although the dataset in this study originates from the field of education, the methodological approach can be extended to other social science domains, such as epidemiology, policy evaluation, and public health, to comprehensively assess the effects of interventions. This broader applicability underscores the potential of our framework for informing evidence-based decision-making across diverse contexts.

5.3. Sensitivity Analysis

The statistical analysis of Section 5.2 indicates that in the TIMSS 2019 data, online collaboration does not demonstrate a significant positive or negative effect on the science achievement of the students. Because this conclusion relies on the unconfoundedness assumption. To enhance the robustness of our findings, we conducted additional sensitivity analyzes to investigate potential impacts of unobserved confounders on causal inferences. Table 3 presents how variations in sensitivity parameters R 0 , R 1 influence the estimated causal effect and its confidence intervals. We find that when R 0 and R 1   < = 0.97 or > = 1.03, the original causal conclusion may be altered. For instance, R 0 = R 1 = 1.03 implies that—beyond the 4 covariates that are not directly related to the outcome variable—if there exist unmeasured confounders such that students engaged in online collaboration would, on average, score 16 points higher than their nonparticipating counterparts under both treatment and control conditions, our causal conclusion would no longer hold (The average value of the outcome variable is 538).
Furthermore, setting X S = (School discipline, Socioeconomic background of the school) that show significant differences in Table 3, we obtain R 0 ( X S ) and R 1 ( X S ) within the ranges of (0.989, 1.015) and (0.990, 1.014), respectively, and mark the extrema in the contour plot of Figure 6. This implies that unobserved confounders would need to be stronger than X S = (School discipline, Socioeconomic background of the school) to conclude that online collaboration could produce significant enhancement or inhibitory effect on students’ science achievement.
The extended sensitivity analysis method quantitatively characterizes the relationships between unobserved confounders, outcome variables, observed confounders, and causal inferences, thereby enhancing the interpretability and robustness of the findings.

6. Discussion

In the preceding sections, we reviewed recent advances in estimating HTE within observational studies and extended a new sensitivity parameter analysis method. By conducting simulation studies to compare meta-learners and causal forest, we have drawn several key conclusions.
Overall, the estimation performance of meta-learners is influenced by the base learner, sample size, data balancing, and the functional form of the CATE. Therefore, we propose the following recommendations. First and most importantly, selecting an appropriate base learner can significantly enhance the estimation performance of meta-learners, as demonstrated in the Supplementary Materials. This represents a key advantage of meta-learners over causal forest. Second, the X-learner consistently provides accurate CATE estimates under varying conditions, making it a recommended choice for heterogeneity analysis when an optimal estimation method is uncertain. Additionally, meta-learners exhibit different convergence rates; thus, sample size should be considered when selecting an estimation method. Under conditions of data imbalance, both the DR-learner and the X-learner achieve the smallest absolute bias. The DR-learner demonstrates optimal performance in general imbalanced scenarios, whereas the X-learner exhibits a distinct advantage in cases of severe data imbalance. Consequently, the DR-learner and X-learner may represent the most suitable estimation methods for handling data imbalance. As for causal forest, it is sensitive to the functional form of the CATE. Its estimation accuracy deteriorates when the CATE is nonlinear, and the absolute bias tends to increase with the complexity of the CATE structure. Nevertheless, it demonstrates outstanding robustness across nearly all conditions evaluated. However, compared to meta-learners, causal forest exhibits pointwise consistency for the true treatment effect and possess an asymptotically Gaussian sampling distribution centered on the true parameter. This theoretical foundation enables the construction of statistically valid confidence intervals that quantify estimation uncertainty with formal guarantees. While meta-learners may demonstrate excellent estimation accuracy, their variance estimation and the construction of valid confidence intervals remain unresolved methodological challenges in the current literature. This constitutes a major theoretical limitation of meta-learners and represents a critical direction for future research.
In our empirical analysis, we leveraged educational data to demonstrate how contemporary practices in educational research can be integrated into the meta-learner framework. We applied X-Learner, which showed superior performance in our simulations, to analyze the HTE by estimating CATEs. Specifically, we tested for statistically significant heterogeneity and identified covariates associated with such heterogeneity.
Furthermore, we conducted a sensitivity analysis of the unconfoundedness assumption to rigorously assess the influence of potential confounders on the causal inferences, thereby improving the robustness and interpretability of the findings. Utilizing data from TIMSS 2019, this study successfully quantifies the effect of online collaboration on students’ science achievement and reveals that minor variations in sensitivity parameters can substantially impact causal conclusions—an aspect rarely addressed in conventional educational research. Thus, the proposed causal inference framework offers a novel and valuable tool for advancing empirical studies in education.
We anticipate that this study will promote the widespread adoption of meta-learner methods in HTE research and, combined with the sensitivity analysis approach extended here, enhance the interpretability of causal inference.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/mca30060139/s1.

Author Contributions

Conceptualization, J.Z., Y.J. and X.W.; Data curation, J.Z. and Y.J.; Formal analysis, J.Z. and X.W.; Funding acquisition, X.W.; Methodology, X.W.; Software, J.Z.; Supervision, X.W.; Writing—original draft, J.Z.; Writing—review & editing, J.Z. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This article is supported by the funding from the Beijing Municipal Education Commission for the Emerging Interdisciplinary Platform for Digital Business at Beijing Technology and Business University (NO 19002024028).

Data Availability Statement

The data presented in the application is available from https://timss2019.org/reports/, accessed on 19 September 2025.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Winship, C.; Morgan, S.L. The estimation of causal effects from observational data. Annu. Rev. Sociol. 1999, 25, 659–706. [Google Scholar] [CrossRef]
  2. Hill, J.L. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 2011, 20, 217–240. [Google Scholar] [CrossRef]
  3. Ding, P. A First Course in Causal Inference; Chapman and Hall/CRC: London, UK, 2024. [Google Scholar]
  4. Rosenbaum, P.R. Observational Studies. In Observational Studies; Springer: New York, NY, USA, 2002; pp. 1–17. [Google Scholar]
  5. Rubin, D.B. The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Stat. Med. 2007, 26, 20–36. [Google Scholar] [CrossRef]
  6. Choi, B.Y. Instrumental variable estimation of truncated local average treatment effects. PLoS ONE 2021, 16, e0249642. [Google Scholar] [CrossRef]
  7. Chen, X.; Harhay, M.O.; Tong, G.; Li, F. A Bayesian machine learning approach for estimating heterogeneous survivor causal effects: Applications to a critical care trial. Ann. Appl. Stat. 2024, 18, 350–374. [Google Scholar] [CrossRef]
  8. Xie, Y.; Brand, J.E.; Jann, B. Estimating heterogeneous treatment effects with observational data. Sociol. Methodol. 2012, 42, 314–347. [Google Scholar] [CrossRef] [PubMed]
  9. Athey, S.; Imbens, G.W. Machine learning methods for estimating heterogeneous causal effects. Stat 2015, 1050, 1–26. [Google Scholar]
  10. Li, S.; Pu, Z.; Cui, Z.; Lee, S.; Guo, X.; Ngoduy, D. Inferring heterogeneous treatment effects of crashes on highway traffic: A doubly robust causal machine learning approach. Transp. Res. Part C Emerg. Technol. 2024, 160, 104537. [Google Scholar] [CrossRef]
  11. Hattab, Z.; Doherty, E.; Ryan, A.M.; O’Neill, S. Heterogeneity within the Oregon Health Insurance Experiment: An application of causal forests. PLoS ONE 2024, 19, e0297205. [Google Scholar] [CrossRef] [PubMed]
  12. Sverdrup, E.; Petukhova, M.; Wager, S. Estimating treatment effect heterogeneity in Psychiatry: A review and tutorial with causal forests. Int. J. Methods Psychiatr. Res. 2025, 34, e70015. [Google Scholar] [CrossRef]
  13. Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]
  14. Brantner, C.L.; Nguyen, T.Q.; Tang, T.; Zhao, C.; Hong, H.; Stuart, E.A. Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Stat. Med. 2024, 43, 1291–1314. [Google Scholar] [CrossRef]
  15. Okasa, G. Meta-learners for Estimation of Causal Effects: Finite Sample Cross-fit Performance. arXiv 2022, arXiv:2201.12692. [Google Scholar]
  16. Jacob, D. CATE meets ML: Conditional average treatment effect and machine learning. Digit. Financ. 2021, 3, 99–148. [Google Scholar] [CrossRef]
  17. McJames, N.; O’Shea, A.; Goh, Y.C.; Parnell, A. Bayesian causal forests for multivariate outcomes: Application to Irish data from an international large scale education assessment. J. R. Stat. Soc. Ser. A Stat. Soc. 2025, 188, 428–450. [Google Scholar] [CrossRef]
  18. Holland, P.W. Statistics and causal inference. J. Am. Stat. Assoc. 1986, 81, 945–960. [Google Scholar] [CrossRef]
  19. VanderWeele, T.J.; Ding, P. Sensitivity analysis in observational research: Introducing the E-value. Ann. Intern. Med. 2017, 167, 268–274. [Google Scholar] [CrossRef]
  20. Cinelli, C.; Hazlett, C. Making sense of sensitivity: Extending omitted variable bias. J. R. Stat. Soc. Ser. B Stat. Methodol. 2020, 82, 39–67. [Google Scholar] [CrossRef]
  21. Rosenbaum, P.R.; Rubin, D.B. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J. R. Stat. Soc. Ser. B (Methodol.) 1983, 45, 212–218. [Google Scholar] [CrossRef]
  22. Tchetgen, E.J.T.; Shpitser, I. Semiparametric theory for causal mediation analysis: Efficiency bounds, multiple robustness, and sensitivity analysis. Ann. Stat. 2012, 40, 1816. [Google Scholar] [CrossRef]
  23. VanderWeele, T.J.; Tchetgen, E.J.T.; Halloran, M.E. Interference and sensitivity analysis. Stat. Sci. A Rev. J. Inst. Math. Stat. 2015, 29, 687. [Google Scholar] [CrossRef]
  24. Lu, S.; Ding, P. Flexible sensitivity analysis for causal inference in observational studies subject to unmeasured confounding. arXiv 2023, arXiv:2305.17643. [Google Scholar]
  25. Pearl, J. Invited commentary: Understanding bias amplification. Am. J. Epidemiol. 2011, 174, 1223–1227. [Google Scholar] [CrossRef]
  26. Athey, S.; Wager, S. Estimating treatment effects with causal forests: An application. Obs. Stud. 2019, 5, 37–51. [Google Scholar] [CrossRef]
  27. Raghavan, S.; Josey, K.; Bahn, G.; Reda, D.; Basu, S.; Berkowitz, S.A.; Ghosh, D. Generalizability of heterogeneous treatment effects based on causal forests applied to two randomized clinical trials of intensive glycemic control. Ann. Epidemiol. 2022, 65, 101–108. [Google Scholar] [CrossRef] [PubMed]
  28. Künzel, S.R.; Sekhon, J.S.; Bickel, P.J.; Yu, B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. USA 2019, 116, 4156–4165. [Google Scholar] [CrossRef]
  29. Kennedy, E.H. Towards optimal doubly robust estimation of heterogeneous causal effects. Electron. J. Stat. 2023, 17, 3008–3049. [Google Scholar] [CrossRef]
  30. Nie, X.; Wager, S. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 2021, 108, 299–319. [Google Scholar] [CrossRef]
  31. Lipkovich, I.; Svensson, D.; Ratitch, B.; Dmitrienko, A. Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data. Stat. Med. 2024, 43, 4388–4436. [Google Scholar] [CrossRef]
  32. Bica, I.; Alaa, A.M.; Lambert, C.; Van Der Schaar, M. From real-world patient data to individualized treatment effects using machine learning: Current and future methods to address underlying challenges. Clin. Pharmacol. Ther. 2021, 109, 87–100. [Google Scholar]
  33. Athey, S.; Imbens, G. Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. USA 2016, 113, 7353–7360. [Google Scholar] [CrossRef] [PubMed]
  34. Lo, V.S. The true lift model: A novel data mining approach to response modeling in database marketing. ACM SIGKDD Explor. Newsl. 2002, 4, 78–86. [Google Scholar] [CrossRef]
  35. Robins, J. A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Model. 1986, 7, 1393–1512. [Google Scholar] [CrossRef]
  36. Snowden, J.M.; Rose, S.; Mortimer, K.M. Implementation of G-computation on a simulated data set: Demonstration of a causal inference technique. Am. J. Epidemiol. 2011, 173, 731–738. [Google Scholar] [CrossRef]
  37. Alaa, A.; Schaar, M. Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 129–138. [Google Scholar]
  38. Hahn, P.R.; Murray, J.S.; Carvalho, C.M. Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Anal. 2020, 15, 965–1056. [Google Scholar] [CrossRef]
  39. Hansotia, B.; Rukstales, B. Incremental value modeling. J. Interact. Mark. 2002, 16, 35–46. [Google Scholar] [CrossRef]
  40. Radcliffe, N. Using control groups to target on predicted lift: Building and assessing uplift model. Direct Mark. Anal. J. 2007, 14–21. Available online: https://www.research.ed.ac.uk/en/publications/using-control-groups-to-target-on-predicted-lift-building-and-ass/ (accessed on 19 September 2025).
  41. Foster, J.C.; Taylor, J.M.; Ruberg, S.J. Subgroup identification from randomized clinical trial data. Stat. Med. 2011, 30, 2867–2880. [Google Scholar] [CrossRef]
  42. Feuerriegel, S.; Frauen, D.; Melnychuk, V.; Schweisthal, J.; Hess, K.; Curth, A.; van der Schaar, M. Causal machine learning for predicting treatment outcomes. Nat. Med. 2024, 30, 958–968. [Google Scholar] [CrossRef]
  43. Kang, J.D.; Schafer, J.L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci. 2007, 22, 569–573. [Google Scholar] [PubMed]
  44. Powers, S.; Qian, J.; Jung, K.; Schuler, A.; Shah, N.H.; Hastie, T.; Tibshirani, R. Some methods for heterogeneous treatment effect estimation in high dimensions. Stat. Med. 2018, 37, 1767–1787. [Google Scholar] [CrossRef]
  45. D’Amour, A.; Ding, P.; Feller, A.; Lei, L.; Sekhon, J. Overlap in observational studies with high-dimensional covariates. J. Econom. 2021, 221, 644–654. [Google Scholar] [CrossRef]
  46. Robinson, P.M. Root-N-consistent semiparametric regression. Econom. J. Econom. Soc. 1988, 56, 931–954. [Google Scholar] [CrossRef]
  47. Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/debiased machine learning for treatment and structural parameters. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
  48. Newey, W.K.; Robins, J.M. Cross-fitting and fast remainder rates for semiparametric estimation. In CeMMAP Working Papers; 41/17; Institute for Fiscal Studies: London, UK, 2017. [Google Scholar]
  49. Jacob, D. Cross-Fitting and Averaging for Machine Learning Estimation of Heterogeneous Treatment Effects. In IRTG 1792 Discussion Papers; 2020-014; 1792 “High Dimensional Nonstationary Time Series”; Humboldt University of Berlin, International Research Training Group: Berlin, Germany, 2020. [Google Scholar]
  50. Sekhon, J.S.; Shem-Tov, Y. Inference on a new class of sample average treatment effects. J. Am. Stat. Assoc. 2021, 116, 798–804. [Google Scholar] [CrossRef]
  51. Zivich, P.N.; Breskin, A. Machine learning for causal inference: On the use of cross-fit estimators. Epidemiology 2021, 32, 393–401. [Google Scholar] [CrossRef]
  52. Jarque, C.M.; Bera, A.K. Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Econ. Lett. 1980, 6, 255–259. [Google Scholar] [CrossRef]
  53. Bera, A.K.; Jarque, C.M. Efficient tests for normality, homoscedasticity and serial independence of regression residuals: Monte Carlo evidence. Econ. Lett. 1981, 7, 313–318. [Google Scholar] [CrossRef]
  54. Friedman, J.H. Multivariate adaptive regression splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
  55. Biau, G. Analysis of a random forests model. J. Mach. Learn. Res. 2012, 13, 1063–1095. [Google Scholar]
  56. Huber, M.; Lechner, M.; Wunsch, C. The performance of estimators based on the propensity score. J. Econom. 2013, 175, 1–21. [Google Scholar] [CrossRef]
  57. Smith, B.I.; Chimedza, C.; Bührmann, J.H. Treatment effect performance of the X-Learner in the presence of confounding and non-linearity. Math. Comput. Appl. 2023, 28, 32. [Google Scholar] [CrossRef]
  58. Mullis, I.V.S.; Martin, M.O.; Foy, P.; Kelly, D.L.; Fishbein, B. TIMSS 2019 International Results in Mathematics and Science; Boston College: Chestnut Hill, MA, USA, 2020; Available online: https://timssandpirls.bc.edu/timss2019/international-results/ (accessed on 18 January 2025).
  59. Austin, P.C. Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Commun. Stat. Simul. Comput. 2009, 38, 1228–1234. [Google Scholar] [CrossRef]
  60. Chernozhukov, V.; Demirer, M.; Duflo, E.; Fernandez-Val, I. Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, with an Application to Immunization in India. In NBER Working Paper No. 24678; National Bureau of Economic Research: Cambridge, MA, USA, 2018; Available online: https://www.nber.org/papers/w24678 (accessed on 26 July 2024).
  61. Talan, T. The Effect of Computer-Supported Collaborative Learning on Academic Achievement: A Meta-Analysis Study. Int. J. Educ. Math. Sci. Technol. 2021, 9, 426–448. [Google Scholar] [CrossRef]
  62. Jung, I. Improving online collaborative learning: Strategies to mitigate stress. Procedia Soc. Behav. Sci. 2013, 93, 322–325. [Google Scholar] [CrossRef][Green Version]
  63. Deng, Y.; Cherian, J.; Khan, N.U.; Kumari, K.; Sial, M.S.; Comite, U.; Popp, J. Family and academic stress and their impact on students’ depression level and academic performance. Front. Psychiatry 2022, 13, 869337. [Google Scholar] [CrossRef] [PubMed]
  64. An, G.; Wang, J.; Yang, Y. Chinese parents’ effect on children’s math and science achievements in schools with different SES. J. Comp. Fam. Stud. 2019, 50, 139–161. [Google Scholar] [CrossRef]
  65. Pappattu, J.; Vanitha, J. A study on family environment and its effect on academic achievement in science among secondary school students. Int. J. Res. Granthaalayah 2017, 5, 428–436. [Google Scholar] [CrossRef]
  66. Mao, P.; Cai, Z.; He, J.; Chen, X.; Fan, X. The relationship between attitude toward science and academic achievement in science: A three-level meta-analysis. Front. Psychol. 2021, 12, 784068. [Google Scholar] [CrossRef]
  67. Freeman, S.; Eddy, S.L.; McDonough, M.; Smith, M.K.; Okoroafor, N.; Jordt, H.; Wenderoth, M.P. Active learning increases student performance in science, engineering, and mathematics. Proc. Natl. Acad. Sci. USA 2014, 111, 8410–8415. [Google Scholar] [CrossRef]
Figure 1. Presentation of the full-sample and cross-fitting framework.
Figure 1. Presentation of the full-sample and cross-fitting framework.
Mca 30 00139 g001
Figure 2. CATE results of Case 1–3. The base includes causal forest, S- and T-learners, while the X-, DR-, and R-learners are implemented using both full-sample and cross-fitting estimation methods, totaling nine distinct approaches.
Figure 2. CATE results of Case 1–3. The base includes causal forest, S- and T-learners, while the X-, DR-, and R-learners are implemented using both full-sample and cross-fitting estimation methods, totaling nine distinct approaches.
Mca 30 00139 g002
Figure 3. CATE results of Case 4–6. The base includes causal forest, S- and T-learners, while the X-, DR-, and R-learners are implemented using both full-sample and cross-fitting estimation methods, totaling nine distinct approaches.
Figure 3. CATE results of Case 4–6. The base includes causal forest, S- and T-learners, while the X-, DR-, and R-learners are implemented using both full-sample and cross-fitting estimation methods, totaling nine distinct approaches.
Mca 30 00139 g003
Figure 4. Initial covariate imbalance in the dataset. Highlighting the mean differences between individuals who participated in online collaboration (T = 1) and those who did not (T = 0). For binary covariates (marked with asterisks), raw mean differences are presented instead of standardized values. The dashed lines represent the 0.1 threshold, commonly used to assess acceptable covariate balance.
Figure 4. Initial covariate imbalance in the dataset. Highlighting the mean differences between individuals who participated in online collaboration (T = 1) and those who did not (T = 0). For binary covariates (marked with asterisks), raw mean differences are presented instead of standardized values. The dashed lines represent the 0.1 threshold, commonly used to assess acceptable covariate balance.
Mca 30 00139 g004
Figure 5. GATES of the subgroups. The median treatment effects within subgroups, categorized according to the X-learner’s predicted CATE, were estimated across 50 data splits, with error bars indicating the median 95% confidence intervals.
Figure 5. GATES of the subgroups. The median treatment effects within subgroups, categorized according to the X-learner’s predicted CATE, were estimated across 50 data splits, with error bars indicating the median 95% confidence intervals.
Mca 30 00139 g005
Figure 6. Contour plot of τ R for sensitivity parameters R 0 and R 1 . The contour lines are generated through grid search.
Figure 6. Contour plot of τ R for sensitivity parameters R 0 and R 1 . The contour lines are generated through grid search.
Mca 30 00139 g006
Table 1. Results of global test for HTE. The medians computed over 50 splits are displayed, with 95% confidence intervals provided in parentheses ( α = 0.05 ).
Table 1. Results of global test for HTE. The medians computed over 50 splits are displayed, with 95% confidence intervals provided in parentheses ( α = 0.05 ).
β 1 β 2
Estimate 4.832 0.818
Confidence Interval ( 13.232 , 4.028 ) ( 0.515 , 1.042 )
p-value0.2840.000
Table 2. Results of classification analysis. The medians computed over 50 splits are displayed, with 95% confidence intervals provided in parentheses ( α = 0.05 ) . Statistically significant differences are highlighted in bold, with p-values adjusted for multiple comparisons via Holm’s correction method.
Table 2. Results of classification analysis. The medians computed over 50 splits are displayed, with 95% confidence intervals provided in parentheses ( α = 0.05 ) . Statistically significant differences are highlighted in bold, with p-values adjusted for multiple comparisons via Holm’s correction method.
Mean in Subgroup 1 (CI)Mean in Subgroup 5 (CI)Difference (CI)Hedge’s g
Expect to go to college0.690.88−0.19−0.50
(0.58, 0.80)(0.80, 0.96)(−0.33, −0.06)
Often absent0.210.030.180.53
(0.11, 0.30)(−0.01, 0.07)(0.06, 0.28)
Parents went to college0.370.76−0.40−0.89
(0.25, 0.48)(0.65, 0.86)(−0.56, −0.25)
Number of home educational resources10.0811.71−1.68−1.18
(9.71, 10.41)(11.39, 12.02)(−2.15, −1.20)
Sense of school belonging9.3910.07−0.71−0.39
(8.94, 9.84)(9.67, 10.47)(−1.34, −0.09)
Confidence in science10.5010.98−0.43−0.23
(10.02, 10.99)(10.53, 11.42)(−1.05, 0.21)
Has study desk0.780.91−0.15−0.40
(0.68, 0.88)(0.84, 0.98)(−0.27, −0.02)
Home TV has premium TV channels0.850.93−0.06−0.21
(0.76, 0.93)(0.86, 0.99)(−0.18, 0.04)
Using the internet to access assignments0.760.94−0.18−0.54
(0.65, 0.86)(0.88, 1.00)(−0.30, −0.07)
Number of years teaching13.2217.76−4.51−0.43
(11.38, 15.10)(14.93, 20.48)(−7.81, −0.96)
Teacher job satisfaction9.6010.66−1.08−0.60
(9.12, 10.09)(10.28, 11.05)(−1.69, −0.49)
Safe and orderly school9.0511.51−2.25−1.21
(8.58, 9.51)(11.08, 11.94)(−2.89, −1.61)
Teaching unaffected by unprepared students9.6010.89−1.23−0.59
(9.18, 10.08)(10.39, 11.41)(−1.93, −0.51)
Teacher emphasizes on academic success9.7811.12−1.33−0.60
(9.31, 10.29)(10.56, 11.64)(−2.07, −0.58)
Type of degree0.590.80−0.22−0.49
(0.47, 0.71)(0.71, 0.90)(−0.37, −0.07)
School discipline9.9310.89−0.97−0.79
(9.62, 10.25)(10.64, 11.14)(−1.39, −0.57)
School emphasizes on academic success9.6510.95−1.33−0.66
(9.16, 10.15)(10.44, 11.42)(−1.99, −0.65)
Science resources10.5711.17−0.56−0.32
(10.17, 10.98)(10.76, 11.56)(−1.14, 0.02)
Socioeconomic background of the school0.060.49−0.44−1.10
(0.00, 0.12)(0.37, 0.61)(−0.57, −0.31)
Table 3. The impact of perturbation sensitivity parameters R 0 ,   R 1 on τ R . Bootstrap variance estimation is performed to derive 95% confidence intervals, with intervals excluding zero highlighted in bold.
Table 3. The impact of perturbation sensitivity parameters R 0 ,   R 1 on τ R . Bootstrap variance estimation is performed to derive 95% confidence intervals, with intervals excluding zero highlighted in bold.
R0
0.970.980.991.001.011.021.03
R 1 0.9715.78
(3.24, 28.33)
12.06
(−1.02, 25.15)
8.41
(−4.46, 21.28)
3.80
(−9.17, 16.77)
−0.18
(−13.34, 12.97)
−5.11
(−17.97, 7.76)
−8.05
(−20.73, 4.63)
0.9814.67
(1.67, 27.67)
11.76
(−1.31, 24.83)
6.58
(−6.83, 20.00)
2.31
(−10.52, 15.14)
−1.05
(−14.45, 12.35)
−5.31
(−18.28, 7.65)
−9.91
(−21.78, 1.95)
0.9913.01
(−0.17, 26.19)
9.54
(−3.46, 22.55)
6.15
(−6.90, 19.19)
0.80
(−12.94, 14.55)
−2.59
(−15.44, 10.26)
−7.20
(−20.90, 6.49)
−10.82
(−23.89, 2.26)
1.0011.90
(−0.34, 24.14)
7.61
(−5.40, 20.62)
4.57
(−8.37, 17.52)
0.47
(−12.41, 13.36)
−3.42
(−15.90, 9.07)
−8.92
(−21.73, 3.89)
−12.95
(−26.19, 0.30)
1.0111.55
(−1.70, 24.80)
7.46
(−5.06, 19.98)
2.61
(−10.34, 15.56)
−1.68
(−15.50, 12.14)
−4.82
(−17.46, 7.82)
−10.58
(−23.58, 2.41)
−12.65
(−26.94, 1.64)
1.0210.03
(−2.99, 23.06)
5.74
(−7.01, 18.50)
1.70
(−11.23, 14.64)
−1.58
(−15.76, 12.60)
−6.97
(−20.08, 6.15)
−9.38
(−22.86, 4.11)
−13.69
(−27.10, −0.28)
1.039.66
(−3.16, 22.48)
4.90
(−8.25, 18.04)
1.13
(−11.44, 13.71)
−2.80
(−16.51, 10.90)
−6.99
(−20.96, 6.97)
−12.67
(−26.06, 0.71)
−14.79
(−28.28, −1.30)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Jin, Y.; Wang, X. Comparing Meta-Learners for Estimating Heterogeneous Treatment Effects and Conducting Sensitivity Analyses. Math. Comput. Appl. 2025, 30, 139. https://doi.org/10.3390/mca30060139

AMA Style

Zhang J, Jin Y, Wang X. Comparing Meta-Learners for Estimating Heterogeneous Treatment Effects and Conducting Sensitivity Analyses. Mathematical and Computational Applications. 2025; 30(6):139. https://doi.org/10.3390/mca30060139

Chicago/Turabian Style

Zhang, Jingxuan, Yanfei Jin, and Xueli Wang. 2025. "Comparing Meta-Learners for Estimating Heterogeneous Treatment Effects and Conducting Sensitivity Analyses" Mathematical and Computational Applications 30, no. 6: 139. https://doi.org/10.3390/mca30060139

APA Style

Zhang, J., Jin, Y., & Wang, X. (2025). Comparing Meta-Learners for Estimating Heterogeneous Treatment Effects and Conducting Sensitivity Analyses. Mathematical and Computational Applications, 30(6), 139. https://doi.org/10.3390/mca30060139

Article Metrics

Back to TopTop