Next Article in Journal
A Flexible Truncated (u,v)-Half-Normal Distribution: Properties, Estimation and Applications
Previous Article in Journal
A Novel Metaheuristic-Based Methodology for Attack Detection in Wireless Communication Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Meta-Learning Approach for Estimating Heterogeneous Treatment Effects Under Hölder Continuity

School of Statistics, Capital University of Economics and Business, Beijing 100070, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(11), 1739; https://doi.org/10.3390/math13111739
Submission received: 2 May 2025 / Revised: 17 May 2025 / Accepted: 23 May 2025 / Published: 24 May 2025
(This article belongs to the Special Issue Statistical Machine Learning: Models and Its Applications)

Abstract

Estimating heterogeneous treatment effects plays a vital role in many statistical applications, such as precision medicine and precision marketing. In this paper, we propose a novel meta-learner, termed RXlearner for estimating the conditional average treatment effect (CATE) within the general framework of meta-algorithms. RXlearner enhances the weighting mechanism of the traditional Xlearner to improve estimation accuracy. We establish non-asymptotic error bounds for RXlearner under a continuity classification criterion, specifically assuming that the response function satisfies Hölder continuity. Moreover, we show that these bounds are achievable by selecting an appropriate base learner. The effectiveness of the proposed method is validated through extensive simulation studies and a real-world data experiment.

1. Introduction

Causal inference plays a pivotal role across a wide range of scientific disciplines. With the rapid advancement of big data technologies, researchers now have access to more information than ever before, enabling more accurate estimation of causal effects. A central task in this field is the estimation of heterogeneous treatment effects (HTEs), which seeks to quantify how treatment effects vary across individuals or subpopulations. This is particularly critical in domains such as precision medicine and targeted policy-making, where practitioners—such as doctors, policymakers, and researchers—aim to determine whether newly developed treatments or interventions produce the desired outcomes for different groups.
As research in causal inference has progressed, it has become increasingly clear that average treatment effects (ATEs) may mask significant heterogeneity at the individual level. As a result, growing attention has been devoted to the estimation of conditional average treatment effects (CATEs), which aim to capture individual-level causal effects conditional on observed covariates.
Early approaches to estimating CATEs include semi-parametric models, such as partial linear models [1,2] and additive models [3], as well as classical non-parametric methods [4]. In addition, several weighting-based methods have been proposed, including inverse probability weighting (IPW), augmented IPW (AIPW) [5,6,7], and propensity score optimization techniques [8]. These traditional methods are supported by mature theoretical foundations but often rely on restrictive modeling assumptions, which limit their flexibility in capturing complex real-world data structures.
With the rise of machine learning, researchers have developed a variety of flexible, data-driven methods for estimating CATEs. Ref. [9] introduced a model-free meta-learning algorithm, which was further extended by [10] through a general framework that includes the Slearner and Tlearner [11,12], and later the Xlearner. Other notable contributions include the Rlearner based on Robinson decomposition [13], Athey’s Causal Forest [14], and the Double Machine Learning (DML) framework [15], which utilizes Neyman orthogonality and cross-fitting to reduce sensitivity to nuisance parameters. The Doubly Robust Learner (DRL) [16] and the unifying framework proposed by [17]—which incorporates the L1 loss to enhance robustness—further broaden the scope of meta-learning-based estimators.
These modern approaches leverage the flexibility of machine learning to model complex functional relationships and have demonstrated strong empirical performance. However, despite their practical success, most theoretical analyses of these methods rely on strong smoothness assumptions, such as requiring the response function to lie in a reproducing kernel Hilbert space (RKHS) or satisfy Lipschitz continuity. While these assumptions facilitate theoretical derivations, they may not hold in many real-world applications.
Among the existing meta-learning methods, the Xlearner has demonstrated strong empirical performance, particularly in scenarios with covariate unbalance and unequal treatment assignment rates. Its design leverages different base learners for treated and control groups, which allows it to adapt well to unbalanced datasets and heterogeneous response surfaces [10]. These features make Xlearner one of the most widely used and practically effective methods in CATE estimation. However, despite its strengths, Xlearner still faces two main challenges. First, as noted by [10], its reliance on propensity scores as weighting functions can limit its flexibility, particularly in complex or high-dimensional data settings. Second, most of its theoretical guarantees rely on the assumption that the underlying response functions are Lipschitz continuous, which may not hold in many practical scenarios where the functions are less smooth or exhibit discontinuities.
To address the limitations of fixed-weight designs and the error accumulation inherent in the Xlearner, while preserving its structural advantages, we propose a novel meta-learning method called RXlearner. This method incorporates a data-driven, covariate-dependent weighting mechanism that adaptively combines pseudo-treatment effect estimates. By doing so, RXlearner enhances the model’s flexibility in capturing complex response patterns and mitigates cumulative errors across estimation stages.
We rigorously establish the theoretical properties of RXlearner by deriving a non-asymptotic error bound under the assumption that the response function satisfies Hölder continuity, a milder and more general condition than those typically assumed in the existing literature. The effectiveness of RXlearner is further demonstrated through extensive simulation studies and a real-world application. While the inherent variability in the data makes the identification of a universally optimal estimator elusive, our results show that RXlearner delivers consistently competitive and robust performance across a broad range of scenarios.
The remainder of this paper is organized as follows. Section 2 introduces the model assumptions and presents the RXlearner algorithm. Section 3 develops the theoretical analysis of the estimator, where we derive non-asymptotic error bounds under Hölder continuity. To evaluate the practical effectiveness of the proposed method, Section 4 reports results from simulation studies, and Section 5 applies RXlearner to a real-world dataset. Section 6 concludes the paper with further discussions and directions for future research. Technical proofs are provided in Appendix A.

2. Methodology

2.1. Models and Assumptions

We consider the estimation of the conditional average treatment effect (CATE) under the potential outcomes framework of Neyman–Rubin [18,19]. Let W { 0 , 1 } be a binary treatment indicator, and for each unit i, let Y i 1 and Y i 0 denote the potential outcomes under treatment and control, respectively. Let X i R p represent the p-dimensional covariates. We assume that the data are generated independently from a distribution P D , such that
( Y i 1 , Y i 0 , X i , W i ) P .
Under the Stable Unit Treatment Value Assumption (SUTVA) [20], the observed outcome is given by
Y i = I ( W i = 1 ) Y i 1 + I ( W i = 0 ) Y i 0 ,
where I ( · ) denotes the indicator function. The observed dataset is denoted as
D = { ( Y i , X i , W i ) } i = 1 N .
We define the response functions as
μ 1 ( x ) = E [ Y 1 X = x ] , μ 0 ( x ) = E [ Y 0 X = x ] ,
and the corresponding conditional average treatment effect (CATE) function is
τ ( x ) = E [ Y 1 Y 0 X = x ] = μ 1 ( x ) μ 0 ( x ) .
To simulate observed outcomes, we further adopt an additive noise model
Y i = μ W i ( X i ) + σ ϵ i .
The variance parameter σ controls the noise level and is varied across simulation scenarios.
The propensity score, i.e., the probability of receiving treatment conditional on covariates, is defined as
e ( x ) = P ( W = 1 X = x ) .
Our goal is to estimate τ with an estimator τ ^ , and to evaluate its performance under the expected mean squared error (EMSE), defined as
EMSE ( P , τ ^ ) = E ( τ ( X ) τ ^ ( X ) ) 2 .
Here, we follow the EMSE definition proposed in [10], where X Λ and X is assumed to be independent of τ ^ . In our implementation, this assumption is addressed via a sample-splitting strategy: the estimator τ ^ is trained on one subset of the data, and EMSE is evaluated on an independent test sample drawn from the same distribution. This design ensures that τ ^ ( X ) is independent of X at evaluation time. Consequently, Λ corresponds to the marginal distribution of the evaluation sample.
To establish the theoretical properties of the estimator, we make the following standard assumptions.
Assumption 1
(Ignorability). The treatment assignment W i is independent of the potential outcomes ( Y i 1 , Y i 0 ) conditional on the covariates X i , i.e.,
( Y i 1 , Y i 0 ) W i X i .
Assumption 2
(Positivity). The propensity score is bounded away from 0 and 1, i.e.,  e ( x ) ( 0 , 1 ) for all x X .
Assumption 3
(Conditionally Independent Errors). The errors are independent of treatment assignment given the covariates, i.e.,  ϵ i W i X i . We further assume that E [ ϵ i X i ] = 0 and that the conditional variance of the errors exists.
Remark 1.
Assumption 3 states that the error term ϵ i is conditionally independent of the treatment assignment W i given the covariates X i . This assumption ensures that the estimation of nuisance functions (such as the outcome regressors and the imputed treatment effect differences in the first stage) is unbiased. It is a standard condition in many meta-learner frameworks, including the Xlearner. In our RXlearner, this assumption supports the consistency of the refined weighting strategy. While the assumption may be restrictive in practice, it provides a clean theoretical foundation. We acknowledge this limitation and leave the relaxation of this assumption for future exploration.

2.2. Meta-Algorithms

In this section, we begin by reviewing representative models within the meta-learner framework for CATE estimation. Meta-learners are a class of model-agnostic methods that reduce the problem of causal effect estimation to a series of supervised learning tasks, enabling the use of flexible machine learning algorithms as base learners. We briefly describe three widely used approaches: the Slearner, Tlearner, and Xlearner, which form the foundation for our proposed RXlearner.
The Slearner estimates treatment effects using a single predictive model, where “S” stands for “single”. It incorporates the treatment indicator as one of the input features, treating it on an equal footing with other covariates. The response function is modeled as
μ ( x , w ) = E [ Y X = x , W = w ] ,
where w { 0 , 1 } .
The estimated conditional average treatment effect (CATE) at the covariate value x is then given by
τ ^ S ( x ) = μ ^ ( x , 1 ) μ ^ ( x , 0 ) .
While the Slearner uses a single model incorporating the treatment indicator, the Tlearner fits two models independently for each treatment group.
The Tlearner estimates treatment effects by splitting the dataset into treated and control groups and fitting separate models to each subgroup. The “T” stands for “two”, reflecting the use of two distinct models, one for each treatment condition.
The response function for the treated group is modeled as
μ 1 ( x ) = E [ Y 1 X = x ] ,
and for the control group as
μ 0 ( x ) = E [ Y 0 X = x ] .
The estimated CATE is then computed as the difference between the two fitted models:
τ ^ T ( x ) = μ ^ 1 ( x ) μ ^ 0 ( x ) .
A limitation of the Tlearner is that it estimates treatment effects separately for each group without borrowing strength from the other. The Xlearner improves upon this by incorporating cross-group information through imputation.
The Xlearner builds upon the Tlearner and proceeds in three main steps. The first step mirrors the Tlearner: the response functions for the treated and control groups, μ 1 ( x ) and μ 0 ( x ) , are estimated as in Equations (2) and (3).
In the second step, pseudo-treatment effects are imputed by leveraging the estimated response functions from the opposite group. Specifically, the imputed treatment effects are computed as
D ˜ i 1 : = Y i 1 μ ^ 0 ( X i 1 ) , D ˜ i 0 : = μ ^ 1 ( X i 0 ) Y i 0 .
These imputed differences are then used to estimate the treatment effects conditional on covariates:
τ 1 ( x ) : = E [ D ˜ 1 X = x ] , τ 0 ( x ) : = E [ D ˜ 0 X = x ] .
If the response functions are correctly estimated, i.e.,  μ ^ 0 = μ 0 and μ ^ 1 = μ 1 , then
τ ( x ) = E [ D ˜ 1 X = x ] = E [ D ˜ 0 X = x ] .
Any supervised learning or regression method can be employed to estimate τ ( x ) by regressing the imputed treatment effects on the covariates within each treatment arm. This yields two estimators: τ ^ 1 ( x ) from the treated group and τ ^ 0 ( x ) from the control group.
In the third step, the final CATE estimate is obtained by combining the two pseudo-effect estimates using a weighting function g ( x ) [ 0 , 1 ] :
τ ^ X ( x ) = g ( x ) τ ^ 0 ( x ) + ( 1 g ( x ) ) τ ^ 1 ( x ) ,
where g ( x ) typically depends on the propensity score, e.g.,  g ( x ) = e ( x ) as in [10].

2.3. RXlearner

To improve upon the standard Xlearner framework while retaining its structural advantages, we propose a novel meta-learning method termed RXlearner (Refitting Xlearner). This method enhances the traditional weighting strategy by incorporating a data-driven mechanism to adaptively combine the two pseudo-treatment effect estimators, τ 0 ( x ) and τ 1 ( x ) , based on covariate information.
The procedure begins with a standard Xlearner step, wherein we estimate τ 0 ( x ) and τ 1 ( x ) on the training set, as well as the propensity score e ( x ) . Using these quantities, we construct a pseudo-response variable:
τ r ( x ) : = e ( x ) · τ 0 ( x ) + ( 1 e ( x ) ) · τ 1 ( x ) ,
which serves as a proxy for the unobserved individual-level treatment effect.
Instead of relying on the fixed weight function g ( x ) = e ( x ) as in the original Xlearner, we adopt a two-stage refitting strategy. In the first stage, we fit a powerful regression model (e.g., Random Forest) to predict τ r ( x ) using x as input, effectively capturing complex nonlinear relationships between covariates and treatment heterogeneity. However, direct use of this refitted model may lead to instability, especially when the learned τ r ( x ) is noisy in regions with limited overlap.
To mitigate this, we employ a second stage that recasts the refitted predictions into a convex combination form, aligning with the Xlearner’s structure. Specifically, we aim to learn a new weighting function g ( x ) by minimizing the squared error between the first-stage prediction τ ^ refit ( x ) and a convex combination of τ 0 ( x ) and τ 1 ( x ) :
g ^ ( x ) : = arg min g x D test τ ^ refit ( x ) g ( x ) · τ 0 ( x ) + ( 1 g ( x ) ) · τ 1 ( x ) 2 ,
where τ ^ refit ( x ) denotes the pseudo-treatment effect from the refitting step, and  D test denotes the test set.
To approximate g ^ ( x ) , we frame this optimization as a supervised regression problem and adopt gradient boosted regression trees (GBRT) [21] implemented via the xgboost package. GBRT minimizes squared loss through functional gradient descent, providing explicit convergence guarantees under standard conditions and offering strong empirical stability in practice.
The final estimate of the CATE is then given by the following:
τ ^ ( x ) = g ^ ( x ) · τ 0 ( x ) + ( 1 g ^ ( x ) ) · τ 1 ( x ) .
Remark 2.
It is important to note that the learned weighting function g ^ ( x ) is defined on the test set D test , rather than on the entire covariate space X . This is because the optimization objective in Equation (6) is constructed only over the test points.
Remark 3.
To construct pseudo-outcomes, we apply propensity-score-based weighting similar to the Xlearner framework [10], assigning weights 1 e ^ ( X ) to treated units and e ^ ( X ) to control units. However, our RXlearner differs from the standard Xlearner in that the weight function g ( x ) is not fixed to e ^ ( x ) . Instead, it is refined through a second-stage regression step, where the optimal weights are learned in a data-driven manner by minimizing a squared loss objective (Equation (6)). This allows RXlearner to adaptively learn context-specific weights that may improve performance.
This two-stage design leverages the flexibility of machine learning models to fit pseudo-response values while preserving the interpretable structure of the Xlearner. Specifically, RXlearner can be viewed as a data-driven generalization of the Xlearner: instead of using the propensity score e ( x ) as a fixed weighting function, it learns a flexible weight g ( x ) by minimizing a squared loss with respect to a pseudo-target. This allows RXlearner to adaptively combine the two pseudo-treatment effect estimators τ 0 ( x ) and τ 1 ( x ) , depending on their relative reliability across the covariate space. Such adaptivity improves estimation robustness and accuracy, especially in the presence of covariate imbalance or heterogeneous noise, making RXlearner well-suited for complex causal inference tasks.
We emphasize that RXlearner can be viewed as a data-driven generalization of Xlearner. In particular, when the optimization problem in (6) is solved with a fixed weight function g ( x ) = e ( x ) , the RXlearner reduces exactly to the original Xlearner formulation. That is, if the refitted model τ ^ refit ( x ) perfectly recovers the pseudo-response τ r ( x ) = e ( x ) · τ 0 ( x ) + ( 1 e ( x ) ) · τ 1 ( x ) , then the optimal solution to (6) is attained when g ( x ) = e ( x ) . This establishes the Xlearner as a special case of RXlearner.
To facilitate implementation, we summarize the RXlearner procedure in Algorithm 1 as follows.
Algorithm 1 RXlearner Algorithm
1:
Input: Observed dataset D = { ( X i , Y i , W i ) } i = 1 n
2:
Output: Estimated conditional average treatment effect τ ^ ( x ) on the test set
3:
Step 0: Data Splitting
4:
Randomly split D into training set D train and test set D test
5:
Step 1: Estimate Nuisance Functions on Training Set
6:
Use D train to estimate the following:
  • Propensity score e ^ ( x ) = P ( W = 1 X = x )
  • Conditional response functions μ ^ 0 ( x ) = E [ Y X = x , W = 0 ] , μ ^ 1 ( x ) = E [ Y X = x , W = 1 ]
7:
Step 2: Construct Pseudo-Treatment Effects on Training Set
8:
For units in the treated group: τ 1 ( x ) : = Y μ ^ 0 ( x )
9:
For units in the control group: τ 0 ( x ) : = μ ^ 1 ( x ) Y
10:
Step 3: Construct Pseudo-Target τ r ( x )
11:
Combine the above using the estimated propensity score:
τ r ( x ) : = e ^ ( x ) · τ 0 ( x ) + ( 1 e ^ ( x ) ) · τ 1 ( x )
12:
Step 4: Learn Weighting Function g ^ ( x ) by (6).
13:
Step 5: Compute Final CATE on Test Set
14:
On D test , compute:
τ ^ ( x ) = g ^ ( x ) · τ ^ 0 ( x ) + ( 1 g ^ ( x ) ) · τ ^ 1 ( x )
15:
return τ ^ ( x ) for x D test
Remark 4.
To ensure reproducibility, we specify the base regression models and tuning parameters used in our implementation of the RXlearner. For the pseudo-target refitting model τ ^ r ( x ) in Step 3, we use a random forest regressor (R package:  randomForest,  ntree = 100). For learning the weighting function g ^ ( x ) in Step 4, we adopt gradient boosted regression trees (GBRT) using the  xgboostpackage, with squared error loss,  nrounds = 100,  max_depth = 3, and   learning_rate = 0.1. These choices balance flexibility and convergence stability across all experimental settings.

3. Asymptotic Properties

Assuming that only the weighting function is altered, the convergence analysis of RXlearner largely parallels that of Xlearner. Reference [10] established the convergence rate of the Xlearner under the assumption that the response function satisfies Lipschitz continuity. In this work, we generalize this assumption to Hölder continuity and derive the corresponding convergence rate for the RXlearner under this broader condition.

3.1. Fundamental Definitions and Results

To proceed, we begin by reviewing several fundamental definitions and results in the minimax nonparametric regression literature. Definition 1 introduces the concept of Hölder continuity, while Definitions 2 and 3, and Lemma 1 are adapted from [22].
It is worth noting that Hölder continuity provides a broader notion of smoothness than Lipschitz continuity, allowing for a controlled, potentially nonlinear rate of change in the function.
Definition 1.
Let f : X R be a function defined on a metric space X. f is referred to as Hölder continuous if there exist constants L > 0 and α ( 0 , 1 ] such that for any x 1 , x 2 X , the following inequality holds:
| f ( x 1 ) f ( x 2 ) | L x 1 x 2 α .
The constant L is called the Hölder constant, and α is called the Hölder exponent. When α = 1 , Hölder continuity reduces to Lipschitz continuity.
Definition 2.
Let p = k + β under the conditions k N 0 and 0 < β 1 , and let C > 0 . A function f : R d R is called ( p , C ) -smooth if, for every α = ( α 1 , , α d ) , where α i N 0 and j = 1 d α j = k , the k-th order partial derivative k f x 1 α 1 x d α d exists and satisfies
k f x 1 α 1 x d α d ( x ) k f x 1 α 1 x d α d ( z ) C · x z β x , z R d .
Let F ( p , C ) denote the collection of all ( p , C ) -smooth functions f : R d R . When k = 0 , the functions in F ( p , C ) as defined in Definition 2 satisfy the Hölder continuity condition.
Definition 3.
Let D ( p , C ) be a class of distributions on ( X , Y ) such that:
1. 
The features X i are identically distributed in [ 0 , 1 ] d ;
2. 
Y = m ( X ) + N , where X and N are independent, and N follows a standard normal distribution;
3. 
m F ( p , C ) .
In the Lemma 1, we derive a lower minimax rate of convergence for this class of distributions.
Lemma 1.
For the class D ( p , C ) , the sequence
a n = n 2 p 2 p + d
is the minimax lower bound rate of convergence. Specifically, for some constant C 1 independent of C,
lim inf n inf m n sup ( X , Y ) D ( p , C ) E { m n m 2 } C 2 d 2 p + d n 2 p 2 p + d C 1 > 0 .

3.2. RXlearner Convergence Rate in General

To demonstrate the convergence level of the RXlearner, some preparatory work is also requisite. We give the following definitions.
Definition 4.
Let F H be a class of distributions on ( X , Y ) [ 0 , 1 ] d × R such that
1. 
The features X i are independent and identically distributed (i.i.d.) and uniformly distributed in [ 0 , 1 ] d ;
2. 
The observed outcomes are Y i = μ ( X i ) + ε i , where each ε i is independent and follows a normal distribution with mean 0 and variance σ 2 ;
3. 
X i and ε i are independent;
4. 
The response function μ is Hölder continuous with parameters L and α.
Definition 5.
Let D m n H be a family of distributions on ( Y 0 , Y 1 , W , X ) R N × R N × { 0 , 1 } N × [ 0 , 1 ] d × N such that
1. 
N = m + n ;
2. 
The features X i are i.i.d. and uniformly distributed in [ 0 , 1 ] d ;
3. 
There are n treated units such that i W i = n ;
4. 
The observed outcomes are Y i ( ω ) = μ ω ( X i ) + ε ω i , where each ( ε 0 i , ε 1 i ) is independent and follows a normal distribution with mean 0 and marginal variance σ 2 ;
5. 
X, W, and ε = ( ε 0 i , ε 1 i ) are mutually independent;
6. 
The response functions μ 0 ( x ) and μ 1 ( x ) are Hölder continuous with parameters L and α.
Remark 5.
We consider that for a fixed n with 0 < n < N , we have the distribution of ( X i , Y i , W i ) i = 1 N given that we observe n treated units and m = N n control units. We denote this distribution by P n m .
( X i , Y i , W i ) i = 1 N | i = 1 N W i = n P m n .
We note that under P m n the ( X i , Y i , W i ) are identical in distribution, but not independent.
Next, we will derive a lower bound on the best achievable convergence rate for D m n H . The following theorem establishes the minimax lower bound on the rate of convergence for any estimator τ ^ m n under Hölder continuity.
Theorem 1
(Minimax Lower Bound). Let τ ^ m n D m n H be an arbitrary estimator of τ, and let a 1 , a 2 > 0 and c > 0 be constants such that for all m , n 1 ,
sup P D m n H EMSE ( P , τ ^ m n ) c m a 1 + n a 2 ,
then the convergence exponents must satisfy
a 1 , a 2 2 α 2 α + d .
We prove through Theorem 1 that the optimal rate of RXlearner is O ( n 2 α / ( 2 α + d ) + m 2 α / ( 2 α + d ) ) .
Proof of Theorem 1.
See Appendix A.1. □

3.3. The Convergence Rate of RXlearner When the Base Learner Is KNN

We now show that RXlearner attains the minimax lower bound by choosing KNN as the base learner.
Theorem 2.
Let d > 2 , and assume ( X , W , Y ( 0 ) , Y ( 1 ) ) P D m n H , where the response functions μ 0 ( x ) and μ 1 ( x ) are Hölder continuous with parameters L and α, satisfying
μ w ( x ) μ w ( z ) L x z α ,
where ω { 0 , 1 } , X U n i f [ 0 , 1 ] d , and 0.5 < α < 1 .
Furthermore, let τ ^ m n be the RXlearner constructed using the KNN base learner with the following specifications
1. 
g 0 ;
2. 
The first-stage base learner μ 0 ^ for the control group is a KNN estimator with k 0 = σ 2 / L 2 d 2 α + d m 2 α 2 α + d ;
3. 
The second-stage base learner τ 1 ^ for the treated group is a KNN estimator with k 1 = σ 2 / L 2 d 2 α + d n 2 α 2 α + d .
Then, τ ^ m n attains the minimax optimal convergence rate stated in Theorem 1. Moreover, there exists a constant C > 0 such that
E τ τ ^ m n 2 C σ 4 α / ( 2 α + d ) L 2 d 2 α + d m 2 α / ( 2 α + d ) + n 2 α / ( 2 α + d ) .
Proof of Theorem 2.
See Appendix A.2. □

4. Simulation

In this section, we consider several settings for the data-generating process, where the generation of X i follows the approach of [17]. Here, p denotes the dimension of covariates:
X i N p ( 0 , Σ ) , diag ( Σ ) = 1 , Corr ( X i j , X i k ) = ρ | j k | , i = 1 , , n ,
e i = e ( x i ) , w i Bernoulli ( e i ) ,
b ( X i ) = 0.5 μ 0 ( X i ) + μ 1 ( X i ) , τ ( X i ) = μ 1 ( X i ) μ 0 ( X i ) ,
ϵ i N ( 0 , 1 ) , y i = b ( X i ) + ( w i 0.5 ) τ ( X i ) + σ ϵ i .
In the following simulations, we set the training sample sizes to n { 200 , 500 , 1000 , 2000 , 5000 } , with a fixed test set size of n t = 10 5 , and conduct T = 100 independent replications for each setting. An exception is Simulation 5, which involves a highly imbalanced treatment assignment. Since a training size of n = 200 results in an insufficient number of treated units, we consider only training sizes from 500 to 5000 in this case.
We use MSE to evaluate the performance of each estimator. With the number of simulation replication T, we have
MSE = 1 T t = 1 T τ ^ ( t ) x t τ x t 2 ,
where x t is the observed value of the test set in the t replicate, and τ ^ ( t ) ( x ) is the estimator of τ ( x ) in the tth replicate.
We present our simulation results in both tabular and graphical formats. In the tables, the best-performing method under each setting is highlighted in bold for ease of comparison.

4.1. Different Response Functions

To evaluate the performance of the proposed estimator and validate the effectiveness of the theoretical framework, we first examine the performance of various estimators under different rates of change in the response function.
Simulation 1: Hölder Continuous Response Function
μ 0 ( X i ) = 0.5 | X i 4 | 2 3 ,
μ 1 ( X i ) = 0.5 | X i 4 | 2 3 + log ( 1 + | X i 5 | ) .
Simulation 2: Lipschitz Continuous Response Function
μ 0 ( X i ) = | X i 5 | ,
μ 1 ( X i ) = 0.5 X i 2 + | X i 5 | .
Simulation 1 reflects the Hölder continuity structure (with exponent α = 2 / 3 ), which aligns with the theoretical analysis in Section 3. We choose the form μ 0 ( x ) = 0.5 | X i 4 | 2 / 3 to introduce a moderate level of nonlinearity and limited smoothness, which is representative of real-world settings where the underlying structural functions are often non-differentiable or exhibit sharp curvature changes. Such Hölder-type structures have been observed in various domains, including economics and biology (e.g., Kleiber’s law and related metabolic scaling theory; see [23]). This choice also allows us to examine the sensitivity of different estimators to non-smoothness. To further assess robustness under alternative smoothness conditions, we include a Lipschitz continuous setting in Simulation 2.
For these two types of simulations, we consider ρ = 0.5 , e ( x ) = 0.5 , p { 5 , 10 } and σ = { 0.5 , 1 , 2 } .

4.2. Cases in Special Situations

We next examine three specially designed simulation settings to evaluate the robustness and adaptability of different estimators under practically challenging conditions. These scenarios are intended to assess the performance of our proposed RXlearner when key identification assumptions are weakened. Specifically, Simulation 3 introduces confounding by violating the unconfoundedness assumption to evaluate the estimator’s robustness under model misspecification. Simulation 4 focuses on the null treatment effect case, where the true conditional treatment effect is uniformly zero across all covariate profiles, highlighting the ability of each method to avoid spurious heterogeneity. Simulation 5 investigates an extremely unbalanced treatment assignment scenario, in which the treated group constitutes a very small proportion of the sample. Given that the Xlearner is known to perform well in such settings, we aim to examine whether RXlearner can inherit or improve upon this advantage. The response function μ 0 ( X i ) in Simulation 3 is constructed following the specification of [24].
Simulation 3: Confounding
e i = 1 1 + exp ( X i 2 + X i 3 ) ,
μ 0 ( X i ) = sin ( π X i 1 X i 2 ) + 2 X i 3 0.5 2 + X i 4 + 0.5 X i 5 ,
μ 1 ( X i ) = μ 0 ( X i ) + 3 · I ( X i 2 > 0.1 ) .
We consider ρ = 0.5 , p = 10 and σ = 1 .
Simulation 4: No Treatment
μ 0 ( X i ) = sin ( π X i 1 X i 2 ) + 2 X i 3 0.5 2 + X i 4 + 0.5 X i 5 ,
μ 1 ( X i ) = μ 0 ( X i ) .
We consider ρ = 0.5 , e ( x ) = 0.5 , p = 10 and σ = 1 .
Simulation 5: Extremely Unbalanced Data
e i = 0.01 ,
μ 0 ( X i ) = sin ( π X i 1 X i 2 ) + 2 X i 3 0.5 2 + X i 4 + 0.5 X i 5 ,
μ 1 ( X i ) = μ 0 ( X i ) + 3 · I ( X i 2 > 0.1 ) .
To evaluate the tendency of each method to spuriously detect heterogeneity when the true treatment effect is absent, we compute the false positive rate (FPR) as the proportion of units with estimated CATE values exceeding a fixed threshold FPR ( δ ) = 1 n i = 1 n 1 | τ ^ ( X i ) | > δ , where δ { 0.05 , 0.2 } is a pre-specified constant. These thresholds reflect varying levels of tolerance for deviations from zero. The FPR results are reported in Table 1, while the MSE results of Simulations 1–5 are presented in Table 2, Table 3 and Table 4.

4.3. Cases with Different Correlations and Noise

We further consider two scenarios: one concerning the impact of correlations among variables, and the other involving variations in the noise term.
Simulation 6: Different variable correlations and noise terms
μ 0 ( X i ) = sin ( π X i 1 X i 2 ) + 2 X i 3 0.5 2 + X i 4 + 0.5 X i 5 ,
μ 1 ( X i ) = μ 0 ( X i ) + 3 · I ( X i 2 > 0.1 ) .
We consider ρ { 0 , 0.5 } , e ( x ) = 0.5 , p = 10 and σ { 0.5 , 1 , 2 } . The corresponding MSE results are summarized in Table 5.

4.4. Summary of the Results

To visually compare the performances of different methods across all simulation settings, the corresponding results are presented below in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7.
Based on the simulations discussed above and the empirical results shown in the figures and tables, we summarize the key findings as follows:
1.
In Simulations 1 and 2, both the Xlearner and RXlearner outperform the Slearner and Tlearner, with RXlearner achieving the best overall performance. Simulation results consistently demonstrate the superior performance of RXlearner across various noise levels ( σ = 0.5 , 1 , 2 ), covariate dimensions ( p = 5 , 10 ), and sample sizes ( n = 200 to 5000). As the sample size increases, all methods improve, but RXlearner benefits the most, achieving the lowest MSE in nearly all settings. Its adaptive weighting mechanism proves particularly effective under high-noise and small-sample scenarios, where fixed-weight methods like Xlearner tend to suffer. Notably, RXlearner maintains strong performance even in higher dimensions, highlighting its robustness and favorable convergence behavior compared to traditional meta-learners.
2.
In Simulation 3, under confounded treatment assignment, the Tlearner exhibits the highest estimation error, followed by the Slearner. In contrast, the Xlearner and RXlearner maintain robust performance.
3.
In Simulation 4, where the treatment effect is null, the Slearner performs the best, consistent with its underlying model assumptions. Furthermore, our quantitative evaluation of the false positive rate (FPR) under this setting reveals that the Slearner maintains the lowest FPR across varying thresholds and sample sizes, followed by RXlearner, while Xlearner and Tlearner tend to produce spurious heterogeneity more frequently. This underscores the importance of cautious model selection when the true effect is absent.
4.
In Simulation 5, which involves extremely unbalanced treatment assignment, the RXlearner successfully inherits the strength of the Xlearner, while the Tlearner performs the worst due to its separate modeling for each treatment group.
5.
In Simulation 6, the four learners exhibit similar performance across different levels of variable correlation. The estimation accuracy remains stable under noise levels σ = 0.5 and σ = 1 , with only a slight deterioration observed when σ = 2 .
In summary, the RXlearner consistently demonstrates the best performance across various scenarios. By enhancing the weighting mechanism of the Xlearner, RXlearner maintains equal or superior accuracy in nearly all settings.

5. Applications

To illustrate our proposed method, we analyze a large-scale Get-Out-the-Vote (GOTV) experiment, which is the same experiment used by [10] to test the Xlearner. This experiment investigates whether social pressure can be used to increase voter turnout in U.S. elections. The authors considered all voters who participated in the 2004 general election as registered voters, randomly selected a subset, and assigned them to either the treatment or control group. Households in the treatment group were sent a mailer with the message “DO YOUR CIVIC DUTY—VOTE!”, and the outcomes were observed during the 2006 primary election. We follow the method of [10] for simulation, but differ in the selection of covariates. While social pressure is typically transmitted through nearby neighbors, it is also influenced by the number of household members. Therefore, we include eight covariates, adding the number of household members as the eighth covariate, in contrast to [10], who used gender, age, and voting history in the 2000, 2002, and 2004 primary elections and the 2000 and 2002 general elections.
A common challenge in evaluating the accuracy of heterogeneous treatment effect estimators on real data is the lack of ground truth. To address this, we introduce synthetic treatment effects into the original dataset. We use the CATE estimates generated by the random forest-based Slearner, Tlearner, Xlearner and RXlearner as the ground truth. This allows us to assign potential outcomes to each sample and create a complete dataset. We can then verify whether different methods successfully recover the true effects and investigate whether the CATE estimates from different estimators significantly impact the results. We select 1000 and 10,000 samples from the full dataset as training sets, with the remaining data used as test sets. The proportion of treated and control groups in the selected samples matches the full dataset with P ( W = 1 ) = 0.167 . Figure 8 and Figure 9 present the results of this experiment. We find that the CATE estimates from different methods have a relatively minor impact on the overall model performance. However, in smaller samples ( n = 1000 ) , the Slearner outperforms the Tlearner, while the opposite is true for larger samples (n = 10,000).
Notably, we observe that RXlearner achieves a significantly lower MSE compared to other methods when n = 1000 . For the other meta-learners, the performance trends are consistent with those reported in [10], where the performance gaps between methods become more pronounced in low-sample regimes. We believe that in this case, there is still room for improvement in how Xlearner selects the weighting function g ( x ) , and the optimization of g ( x ) in RXlearner leads to a more effective combination of pseudo-responses, resulting in superior performance.
Overall, the RXlearner demonstrates significantly better performance compared to the other estimators.

6. Conclusions

This paper reviews CATE estimation methods under the meta-learner framework, including the Slearner, Tlearner, and Xlearner. Additionally, we propose a new algorithm, the RXlearner. The RXlearner retains the advantages of the Xlearner in handling unbalanced data while being more robust and effective. We conduct extensive simulation studies and real-data experiments, demonstrating that the RXlearner performs excellently in most scenarios. The error bounds of the estimator vary depending on the continuity of the response function. We establish error bounds for the case where the response function is Hölder continuous and show that using KNN as the base learner can achieve these bounds.
There are still many potential improvements for the RXlearner, such as incorporating ideas from DML by applying cross-fitting to (8) [25], or adopting a more direct approach by adding structure to (8), as explored in the works of [26,27]. Another promising direction is to incorporate regularization or complexity control into the second-stage optimization in order to further mitigate the risk of overfitting. We leave the extension of the RXlearner to future work.

Author Contributions

Methodology, Z.Z.; Software, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Youth Academic Innovation Team Construction project of Capital University of Economics and Business, grant number QNTD202303.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

Proof of Theorem 1.
Many ideas in this proof are inspired by [10]; however, our analysis is conducted under the weaker assumption of Hölder continuity. We proceed by contradiction.
Let a = 2 α / ( 2 α + d ) , and suppose a 2 a . Denote by C H the class of Hölder continuous functions f : [ 0 , 1 ] d R . For any f 1 C H , define D ( f 1 ) D m n H as the joint distribution where μ 0 = 0 , μ 1 = f 1 , and ε 0 ε 1 . Then, Equation (8) implies the following:
c ( m a 1 + n a 2 ) sup P D m n H E ( D 0 m × D 1 n ) P τ P ( X ) τ ^ m n ( X ; D 0 m , D 1 n ) 2 sup f 1 C H E ( D 0 m × D 1 n ) D ( f 1 ) μ 1 D ( f 1 ) ( X ) τ ^ m n ( X ; D 0 m , D 1 n ) 2 .
Let P 0 denote the marginal distribution of the control group D 0 m = { ( X i 0 , Y i 0 ) } i = 1 m under D ( f 1 ) . We then have
c ( m a 1 + n a 2 ) sup f 1 C H E D 1 n D 1 ( f 1 ) E D 0 m P 0 μ 1 D 1 ( f 1 ) ( X ) τ ^ m n ( X ; D 0 m , D 1 n ) 2 sup f 1 C H E D 1 n D 1 ( f 1 ) μ 1 D 1 ( f 1 ) ( X ) E D 0 m P 0 τ ^ m n ( X ; D 0 m , D 1 n ) 2 ,
where the last inequality follows from Jensen’s inequality.
Now, let m n be a sequence such that m n a 1 + n a 2 2 n a 2 . Define
μ ^ 1 n ( x ; D 1 n ) : = E D 0 m n P 0 m n τ ^ m n ( x ; D 0 m n , D 1 n ) ,
and observe that
{ D 1 ( f 1 ) : f 1 C H } = { P 1 F H } .
Hence, we can derive
2 c n a 2 c ( m n a 1 + n a 2 ) sup f 1 C H E D 1 n D 1 ( f 1 ) μ 1 D 1 ( f 1 ) ( X ) μ ^ 1 n ( X ; D 1 n ) 2 sup P 1 F H E D 1 n P 1 n μ 1 P 1 ( X ) μ ^ 1 n ( X ; D 1 n ) 2 ,
which contradicts the minimax lower bound for nonparametric regression under Hölder continuity when a 2 > a = 2 α / ( 2 α + d ) . This completes the proof. □

Appendix A.2

To prove Theorem 2, we first introduce the following two auxiliary results.
Lemma A1.
Let x [ 0 , 1 ] d , and suppose X 1 , , X n i . i . d . Unif ( [ 0 , 1 ] d ) with d > 2 . Let X ˜ ( x ) denote the nearest neighbor of x among { X 1 , , X n } . Then there exists a constant c > 0 , independent of n, such that
E X ˜ ( x ) x 2 α c n 2 α / d .
Proof of Lemma A1.
Observe that
P X ˜ ( x ) x δ = 1 P X 1 x δ n 1 c ˜ δ d n e c ˜ δ d n ,
for some constant c ˜ > 0 . Hence,
E X ˜ ( x ) x 2 α = 0 P X ˜ ( x ) x 2 α δ d δ = 0 P X ˜ ( x ) x δ 1 / ( 2 α ) d δ 0 exp c ˜ δ d / ( 2 α ) n d δ = c n 2 α / d ,
for some constant c > 0 , which completes the proof. □
Lemma A2.
Let μ ^ 0 m be the k-nearest neighbors (kNN) estimator with fixed k 0 , computed using only the control group, and let μ ^ 1 n be the kNN estimator with fixed k 1 , computed using only the treated group. Under the assumptions in Theorem 2, we have
E [ μ ^ 1 n μ 1 2 ] σ 2 k 1 + c L 2 k 1 n 2 α / d a n d E [ μ ^ 0 m μ 0 2 ] σ 2 k 0 + c L 2 k 0 m 2 α / d .
Proof of Lemma A2.
We provide the proof for the first bound; the argument for μ ^ 0 m is analogous.
We decompose the mean squared error as follows
E ( μ ^ 1 n ( x ) μ 1 ( x ) ) 2 = E μ ^ 1 n ( x ) E [ μ ^ 1 n ( x ) X 1 , , X n ] 2 + E E [ μ ^ 1 n ( x ) X 1 , , X n ] μ 1 ( x ) 2 = I 1 + I 2 .
For the variance term I 1 , we have
I 1 = E 1 k 1 i = 1 k 1 Y ( i , n ) ( x ) μ 1 ( X ( i , n ) ( x ) ) 2 = 1 k 1 2 i = 1 k 1 E Var ( Y ( i , n ) ( x ) X ( i , n ) ( x ) ) = 1 k 1 2 i = 1 k 1 E σ 2 ( X ( i , n ) ( x ) ) σ 2 k 1 .
For the squared bias term I 2 , using Hölder continuity of μ 1 ( · ) , we have
I 2 = E 1 k 1 i = 1 k 1 μ 1 ( X ( i , n ) ( x ) ) μ 1 ( x ) 2 E 1 k 1 i = 1 k 1 C X ( i , n ) ( x ) x α 2 C 2 E 1 k 1 i = 1 k 1 X ( i , n ) ( x ) x α 2 .
To bound this expression, let N = k 1 n / k 1 , and divide the dataset into k 1 + 1 disjoint subsets. Define X ˜ j x to be the nearest neighbor to x in the j-th subset. Then, X ˜ 1 x , , X ˜ k 1 x are independent and satisfy
i = 1 k 1 X ( i , n ) ( x ) x α j = 1 k 1 X ˜ j x x α .
By Jensen’s inequality
I 2 ( x ) C 2 E 1 k 1 j = 1 k 1 X ˜ j x x α 2 C 2 1 k 1 j = 1 k 1 E X ˜ j x x 2 α = C 2 E X ( 1 , n / k 1 ) ( x ) x 2 α .
Integrating over the distribution of x (with p ( x ) as its density function) and applying Lemma A1, we obtain:
1 C 2 n k 1 2 α / d I 2 ( x ) p ( x ) d x n k 1 2 α / d E X ( 1 , n / k 1 ) ( X ) X 2 α const .
Combining the bounds for I 1 and I 2 completes the proof. □
Proof of Theorem 2.
We begin by decomposing the RXlearner estimator τ ^ 1 m n ( x ) as
τ ^ 1 m n ( x ) = 1 k 1 i = 1 k 1 Y ( i , n ) 1 ( x ) μ ^ 0 m X ( i , n ) 1 ( x ) = μ ^ 1 n ( x ) 1 k 1 i = 1 k 1 μ ^ 0 m X ( i , n ) 1 ( x ) ,
where the stage-one estimators are defined as
μ ^ 0 m ( x ) = 1 k 0 j = 1 k 0 Y ( j , m ) 0 ( x ) , μ ^ 1 n ( x ) = 1 k 1 i = 1 k 1 Y ( i , n ) 1 ( x ) .
We evaluate the mean squared error as follows
E τ ( X ) τ ^ 1 m n ( X ) 2 = E μ 1 ( X ) μ 0 ( X ) μ ^ 1 n ( X ) + 1 k 1 i = 1 k 1 μ ^ 0 m X ( i , n ) 1 ( X ) 2 2 E μ 1 ( X ) μ ^ 1 n ( X ) 2 + 2 E μ 0 ( X ) 1 k 1 i = 1 k 1 μ ^ 0 m X ( i , n ) 1 ( X ) 2 = : 2 I 3 + 2 I 4 .
From Lemma A2, we have
I 3 = E μ ^ 1 n μ 1 2 σ 2 k 1 + c 1 L 2 k 1 n 2 α / d .
To bound I 4 , we decompose it into two terms
I 4 E μ 0 ( X ) 1 k 1 k 0 i = 1 k 1 j = 1 k 0 μ 0 X ( j , m ) 0 X ( i , n ) 1 ( X ) 2 ( a )
+ E 1 k 1 k 0 i = 1 k 1 j = 1 k 0 μ 0 X ( j , m ) 0 X ( i , n ) 1 ( X ) 1 k 1 i = 1 k 1 μ ^ 0 m X ( i , n ) 1 ( X ) 2 ( b ) .
Term (A2) can be bounded by
( b ) max i 1 k 0 2 j = 1 k 0 E μ 0 X ( j , m ) 0 X ( i , n ) 1 ( X ) Y ( j , m ) 0 X ( i , n ) 1 ( X ) 2 σ 2 k 0 ,
using the fact that Y ( j , m ) 0 ( x ) N ( μ 0 ( x ) , σ 2 ) .
Term (A1) is bounded via Hölder continuity and Jensen’s inequality
( a ) E 1 k 1 k 0 i = 1 k 1 j = 1 k 0 L X X ( j , m ) 0 ( X ( i , n ) 1 ( X ) ) α 2 L 2 k 1 k 0 i = 1 k 1 j = 1 k 0 E X X ( j , m ) 0 ( X ( i , n ) 1 ( X ) ) 2 α L 2 1 k 1 i = 1 k 1 E X X ( i , n ) 1 ( X ) 2 α + 1 k 1 k 0 i = 1 k 1 j = 1 k 0 E X ( i , n ) 1 ( X ) X ( j , m ) 0 ( X ( i , n ) 1 ( X ) ) 2 α .
Applying Lemma A1 to each of the above terms yields
( a ) 2 c ˜ L 2 k 1 n 2 α / d + k 0 m 2 α / d .
Putting all pieces together, we obtain
E τ ( X ) τ ^ 1 m n ( X ) 2 2 σ 2 k 1 + 2 ( c 1 + 4 c ˜ ) L 2 k 1 n 2 α / d + 4 σ 2 k 0 + 8 c ˜ L 2 k 0 m 2 α / d C σ 2 k 1 + L 2 k 1 n 2 α / d + σ 2 k 0 + L 2 k 0 m 2 α / d ,
for a constant C = 2 max { 2 , c 1 + 4 c ˜ , 4 c ˜ } . □

References

  1. Robins, J.M.; Mark, S.D.; Newey, W.K. Estimating Exposure Effects by Modelling the Expectation of Exposure Conditional on Confounders. Biometrics 1992, 48, 479–495. [Google Scholar] [CrossRef] [PubMed]
  2. Robinson, P.M. Root-N-Consistent Semiparametric Regression. Econometrica 1988, 56, 931–954. [Google Scholar] [CrossRef]
  3. Ravikumar, P.; Lafferty, J.; Liu, H.; Wasserman, L. Sparse Additive Models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2009, 71, 1009–1030. [Google Scholar] [CrossRef]
  4. Robins, J.; Li, L.; Tchetgen, E.; Vaart, A. Higher Order Influence Functions and Minimax Estimation of Nonlinear Functionals. J. Am. Stat. Assoc. 2008, 2, 335–421. [Google Scholar] [CrossRef]
  5. Horvitz, D.G.; Thompson, D.J. A Generalization of Sampling Without Replacement from a Finite Universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
  6. Hirano, K.; Imbens, G.W.; Ridder, G. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score. Econometrica 2003, 71, 1161–1189. [Google Scholar] [CrossRef]
  7. Robins, J.M.; Rotnitzky, A. Semiparametric Efficiency in Multivariate Regression Models with Missing Data. J. Am. Stat. Assoc. 1995, 90, 122–129. [Google Scholar] [CrossRef]
  8. Imai, K.; Ratkovic, M. Covariate Balancing Propensity Score. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 243–263. [Google Scholar] [CrossRef]
  9. Van der Laan, M.J. Statistical Inference for Variable Importance. Int. J. Biostat. 2006, 2, 1–20. [Google Scholar] [CrossRef]
  10. Künzel, S.R.; Sekhon, J.S.; Bickel, P.J.; Yu, B. Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning. Proc. Natl. Acad. Sci. USA 2019, 116, 4156–4165. [Google Scholar] [CrossRef]
  11. Athey, S.; Imbens, G.W. Machine Learning Methods for Estimating Heterogeneous Causal Effects. Stat 2015, 1050, 1–26. [Google Scholar]
  12. Athey, S.; Imbens, G. Recursive Partitioning for Heterogeneous Causal Effects. Proc. Natl. Acad. Sci. USA 2016, 113, 7353–7360. [Google Scholar] [CrossRef]
  13. Nie, X.; Wager, S. Quasi-Oracle Estimation of Heterogeneous Treatment Effects. Biometrika 2021, 108, 299–319. [Google Scholar] [CrossRef]
  14. Athey, S.; Tibshirani, J.; Wager, S. Generalized Random Forests. Ann. Stat. 2019, 47, 1148–1178. [Google Scholar] [CrossRef]
  15. Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/Debiased Machine Learning for Treatment and Structural Parameters. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
  16. Kennedy, E.H. Towards Optimal Doubly Robust Estimation of Heterogeneous Causal Effects. Electron. J. Stat. 2023, 17, 3008–3049. [Google Scholar] [CrossRef]
  17. Li, R.; Wang, H.; Tu, W. Robust Estimation of Heterogeneous Treatment Effects Using Electronic Health Record Data. Stat. Med. 2021, 40, 2713–2752. [Google Scholar] [CrossRef] [PubMed]
  18. Neyman, J. Sur les Applications de la Théorie des Probabilités aux Expériences Agricoles: Essai des Principes. Rocz. Nauk Rol. 1923, 10, 1–51. [Google Scholar]
  19. Rubin, D.B. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. J. Educ. Psychol. 1974, 66, 688. [Google Scholar] [CrossRef]
  20. Cox, D.R. The Interpretation of the Effects of Non-Additivity in the Latin Square. Biometrika 1958, 45, 69–73. [Google Scholar] [CrossRef]
  21. Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  22. Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
  23. White, C.R.; Seymour, R.S. Mammalian Basal Metabolic Rate is Proportional to Body Mass2/3. Proc. Natl. Acad. Sci. USA 2003, 100, 4046–4049. [Google Scholar] [CrossRef] [PubMed]
  24. Friedman, J.H. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
  25. Schick, A. On Asymptotically Efficient Estimation in Semiparametric Models. Ann. Stat. 1986, 14, 1139–1151. [Google Scholar] [CrossRef]
  26. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  27. Imai, K.; Ratkovic, M. Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation. Ann. Appl. Stat. 2013, 7, 443–470. [Google Scholar] [CrossRef]
Figure 1. The results of Simulation 1 are presented, with p = 5.
Figure 1. The results of Simulation 1 are presented, with p = 5.
Mathematics 13 01739 g001
Figure 2. The results of Simulation 1 are presented, with p = 10.
Figure 2. The results of Simulation 1 are presented, with p = 10.
Mathematics 13 01739 g002
Figure 3. The results of Simulation 2 are presented, with p = 5.
Figure 3. The results of Simulation 2 are presented, with p = 5.
Mathematics 13 01739 g003
Figure 4. The results of Simulation 2 are presented, with p = 10.
Figure 4. The results of Simulation 2 are presented, with p = 10.
Mathematics 13 01739 g004
Figure 5. Results of Simulations 3–5.
Figure 5. Results of Simulations 3–5.
Mathematics 13 01739 g005
Figure 6. The results of Simulation 6 are presented, with ρ = 0.
Figure 6. The results of Simulation 6 are presented, with ρ = 0.
Mathematics 13 01739 g006
Figure 7. The results of Simulation 6 are presented, with ρ = 0.5.
Figure 7. The results of Simulation 6 are presented, with ρ = 0.5.
Mathematics 13 01739 g007
Figure 8. The boxplots of MSE of estimated CATE for each method when the training sample size is n = 1000 , based on synthetic data generated using the fitted CATE from Slearner, Tlearner, Xlearner and RXlearner, respectively.
Figure 8. The boxplots of MSE of estimated CATE for each method when the training sample size is n = 1000 , based on synthetic data generated using the fitted CATE from Slearner, Tlearner, Xlearner and RXlearner, respectively.
Mathematics 13 01739 g008
Figure 9. The boxplots of MSE for each method when the training sample size is n = 10,000, based on synthetic data generated using the fitted CATE from Slearner, Tlearner, Xlearner and RXlearner, respectively.
Figure 9. The boxplots of MSE for each method when the training sample size is n = 10,000, based on synthetic data generated using the fitted CATE from Slearner, Tlearner, Xlearner and RXlearner, respectively.
Mathematics 13 01739 g009
Table 1. False Positive Rates under Simulation 4 where the true CATE is zero.
Table 1. False Positive Rates under Simulation 4 where the true CATE is zero.
δ nFalse Positive Rate
XlearnerSlearnerTlearnerRXlearner
0.052000.9270.5530.9570.906
5000.9020.5160.9460.867
10000.8820.5010.9370.834
20000.8570.4710.9250.793
50000.8220.4310.9110.732
0.22000.7160.0620.8300.640
5000.6260.0490.7860.509
10000.5570.0340.7520.416
20000.4780.0240.7090.315
50000.3780.0160.6580.201
Table 2. Simulation results of Simulation 1.
Table 2. Simulation results of Simulation 1.
Simulationp σ nMSE
XlearnerSlearnerTlearnerRXlearner
Simulation 150.52000.1600.1960.2360.141
5000.1170.1600.1840.103
10000.0900.1400.1500.078
20000.0710.1260.1260.061
50000.0540.1110.1040.045
12000.1560.1860.2300.137
5000.1130.1550.1790.100
10000.0910.1420.1520.079
20000.0720.1270.1280.062
50000.0540.1100.1030.044
22000.4650.2800.7750.372
5000.3450.3050.6290.264
10000.2780.3380.5360.209
20000.2330.3770.4640.171
50000.1750.3790.3800.123
100.52000.0810.1240.0900.083
5000.0520.0740.0650.054
10000.0370.0550.0490.041
20000.0270.0410.0380.026
50000.0170.0290.0280.017
12000.1560.1860.2300.137
5000.1130.1550.1790.100
10000.0910.1420.1520.079
20000.0720.1270.1280.062
50000.0540.1100.1030.044
22000.5150.2830.8320.419
5000.3500.2610.6180.280
10000.2620.2460.4840.204
20000.2000.2450.3960.152
50000.1390.2330.3060.100
Table 3. Simulation results of Simulation 2.
Table 3. Simulation results of Simulation 2.
Simulationp σ nMSE
XlearnerSlearnerTlearnerRXlearner
Simulation 250.52000.1190.1720.1730.115
5000.0850.1290.1320.082
10000.0650.1020.1030.063
20000.0500.0760.0850.048
50000.0370.0580.0670.034
12000.1660.1860.2640.147
5000.1160.1540.2000.101
10000.0910.1250.1660.079
20000.0740.1020.1420.062
50000.0540.0850.1130.044
22000.5020.2490.8600.402
5000.3500.2140.6460.268
10000.2830.1970.5580.209
20000.2290.1830.4750.164
50000.1710.1870.3870.116
100.52000.1000.2050.1360.099
5000.0610.1660.0930.061
10000.0410.1160.0680.041
20000.0290.0820.0530.029
50000.0180.0520.0380.018
12000.1660.1860.2640.147
5000.1160.1540.2000.101
10000.0910.1250.1660.079
20000.0740.1020.1420.062
50000.0540.0850.1130.044
22000.5100.2440.8530.416
5000.3480.2190.6480.272
10000.2500.2020.5000.187
20000.1940.1850.4210.141
50000.1370.1510.3300.094
Table 4. Simulation results of Simulations 3–5.
Table 4. Simulation results of Simulations 3–5.
SimulationnMSE
XlearnerSlearnerTlearnerRXlearner
Simulation 32001.5962.4674.7151.416
5000.7411.8683.6620.610
10000.4061.4052.9310.302
20000.2611.1692.4920.178
50000.1570.9261.9460.096
Simulation 42000.4350.0111.3070.239
5000.2660.0100.8820.137
10000.1890.0080.6560.094
20000.1270.0070.4460.061
50000.0770.0050.2990.037
Simulation 55003.2203.5099.4303.199
10002.1013.3318.1152.079
20001.3323.2277.5461.307
50000.6472.7665.2960.623
Table 5. Simulation results of Simulation 6.
Table 5. Simulation results of Simulation 6.
Simulationp σ nMSE
XlearnerSlearnerTlearnerRXlearner
Simulation 600.52000.9372.1901.8020.848
5000.4271.2471.0500.359
10000.2310.7540.7290.172
20000.1310.4560.4890.087
50000.0790.2860.3430.047
12001.0522.3671.9010.935
5000.4891.3421.1960.392
10000.2790.8020.8270.201
20000.1750.5250.5940.114
50000.1060.3180.4110.063
22001.4512.7462.6001.240
5000.7791.6831.6790.622
10000.4891.0821.2450.347
20000.3150.7170.9290.204
50000.2060.4630.6760.121
0.50.52000.9412.0681.8480.811
5000.4001.1171.0930.323
10000.2250.6550.7200.168
20000.1310.4330.4880.090
50000.0730.2650.3250.045
12000.9952.0811.8820.839
5000.4661.1401.1490.372
10000.2810.7660.8220.207
20000.1780.4990.6030.118
50000.1030.3160.3970.061
22001.4862.4502.6721.239
5000.7521.5591.6820.572
10000.5021.0631.2410.362
20000.3250.6920.9460.212
50000.2110.4960.6690.125
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Z.; Zhou, C. A Meta-Learning Approach for Estimating Heterogeneous Treatment Effects Under Hölder Continuity. Mathematics 2025, 13, 1739. https://doi.org/10.3390/math13111739

AMA Style

Zhao Z, Zhou C. A Meta-Learning Approach for Estimating Heterogeneous Treatment Effects Under Hölder Continuity. Mathematics. 2025; 13(11):1739. https://doi.org/10.3390/math13111739

Chicago/Turabian Style

Zhao, Zhihao, and Congyang Zhou. 2025. "A Meta-Learning Approach for Estimating Heterogeneous Treatment Effects Under Hölder Continuity" Mathematics 13, no. 11: 1739. https://doi.org/10.3390/math13111739

APA Style

Zhao, Z., & Zhou, C. (2025). A Meta-Learning Approach for Estimating Heterogeneous Treatment Effects Under Hölder Continuity. Mathematics, 13(11), 1739. https://doi.org/10.3390/math13111739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop