Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies

Wang, Jiyuan; Tan, Yutong; Jiang, Bingying; Wu, Bi; Liu, Wenhe

doi:10.3390/sym17040610

Open AccessArticle

Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies

by

Jiyuan Wang

^1,*,

Yutong Tan

²,

Bingying Jiang

³,

Bi Wu

⁴ and

Wenhe Liu

⁵

¹

The Fuqua School of Business, Duke University, Durham, NC 27708, USA

²

School of Business, Wake Forest University, Winston-Salem, NC 27109, USA

³

School of Business, University of Wisconsin-Madison, Madison, WI 53706, USA

⁴

Anderson School of Management, University of California, Los Angeles, CA 90095, USA

⁵

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(4), 610; https://doi.org/10.3390/sym17040610

Submission received: 17 March 2025 / Revised: 10 April 2025 / Accepted: 16 April 2025 / Published: 17 April 2025

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

Traditional marketing uplift models suffer from a fundamental limitation: they typically operate under static assumptions that fail to capture the temporal dynamics of customer responses to marketing interventions. This paper introduces a novel framework that combines causal forest algorithms with deep reinforcement learning to dynamically model marketing uplift effects. Our approach enables the real-time identification of heterogeneous treatment effects across customer segments while simultaneously optimizing intervention strategies through an adaptive learning mechanism. The key innovations of our framework include the following: (1) a counterfactual simulation environment that emulates diverse customer response patterns; (2) an adaptive reward mechanism that captures both immediate and long-term intervention outcomes; and (3) a dynamic policy optimization process that continually refines targeting strategies based on evolving customer behaviors. Empirical evaluations on both simulated and real-world marketing campaign data demonstrate that our approach significantly outperforms traditional static uplift models, achieving up to a 27% improvement in targeting efficiency and an 18% increase in the return on marketing investment. The framework leverages inherent symmetries in customer-intervention interactions, where balanced and symmetric reward structures ensure fair optimization across diverse customer segments. The proposed framework addresses the limitations of existing methods by effectively modeling the dynamic and heterogeneous nature of customer responses to marketing interventions, providing marketers with a powerful tool for implementing personalized and adaptive campaign strategies.

Keywords:

uplift modeling; counterfactual simulation environment; adaptive reward mechanism; dynamic policy optimization

1. Introduction

In today’s data-rich marketing environment, the ability to identify which customers will respond positively to specific marketing interventions has become a critical competitive advantage [1]. Marketing interventions—targeted actions such as email campaigns, personalized offers, or digital advertisements designed to influence customer behavior toward desired outcomes like purchases or engagement—represent the core “treatments” that marketers deploy to drive business results. Marketing uplift modeling, also known as incremental modeling or true-lift modeling, aims to measure the causal impact of these interventions by identifying which customer segments will be positively influenced by which specific marketing actions [2]. This approach focuses on estimating the incremental effect (or “uplift”) of an intervention compared to no intervention across diverse customer segments. Traditional uplift modeling approaches, however, predominantly rely on static assumptions that fail to capture the dynamic nature of customer responses and the evolving marketing landscape [3].

Marketing interventions typically produce heterogeneous effects across different customer segments, and these effects often evolve over time as customers adapt to marketing strategies [4]. For instance, customers who initially respond positively to email promotions may develop “promotion fatigue” and become less responsive over time [1]. Similarly, the effectiveness of marketing interventions can be influenced by competitive actions, changing customer preferences, and seasonal factors that are difficult to capture in static models [5]. Moreover, the effectiveness of marketing interventions exhibits symmetry properties in how different customer segments respond to various treatments, which can be exploited for more efficient modeling.

The limitations of traditional uplift modeling approaches can be categorized into three key challenges: (1) Static treatment effect estimation: conventional methods estimate treatment effects at a fixed point in time, failing to capture how customer responsiveness evolves dynamically [6]. (2) Limited adaptation capability: traditional models lack mechanisms to adapt to changing customer behavior patterns and market conditions [7]. (3) Inability to optimize sequential decisions: Most uplift models focus on one-time interventions, rather than optimizing sequences of marketing actions over extended customer lifecycles [8]. These fundamental limitations significantly impair the effectiveness of marketing campaigns in real-world settings, leading to suboptimal resource allocation, diminished returns on marketing investments, and missed opportunities for building long-term customer relationships. A solution that addresses these limitations must be capable of the following: (1) capturing the dynamic evolution of treatment effects over time, (2) continuously adapting to changing customer behaviors, and (3) optimizing sequences of marketing decisions, rather than isolated interventions.

Recent advances in causal machine learning, particularly causal forests, have shown promise in estimating heterogeneous treatment effects with improved accuracy and interpretability [9]. Causal forests extend random forests to the domain of causal inference, enabling the estimation of conditional average treatment effects (CATEs) across different customer segments [10]. However, while causal forests excel at identifying heterogeneous treatment effects, they still maintain the static assumption that customer responses remain constant over time. Simultaneously, reinforcement learning (RL) has emerged as a powerful framework for sequential decision-making under uncertainty [11]. Deep reinforcement learning combines the representational power of deep neural networks with RL algorithms, enabling the optimization of complex decision policies in dynamic environments [12]. In marketing contexts, RL offers the potential to optimize sequences of interventions while adapting to changing customer responses [13].

This paper introduces a novel symmetry-preserving framework that integrates causal forests with deep reinforcement learning to address the limitations of traditional uplift modeling approaches. Our dynamic marketing uplift modeling framework leverages causal forests to estimate heterogeneous treatment effects across customer segments, which then inform a reinforcement learning agent that optimizes intervention strategies dynamically. The integration is facilitated through a counterfactual simulation environment that emulates diverse customer response patterns and an adaptive reward mechanism that captures both immediate and long-term intervention outcomes. The symmetry-preserving aspect of our framework ensures balanced representation and fair optimization across different customer segments, maintaining the fundamental symmetry properties inherent in causal inference while allowing for adaptive learning. Our approach directly addresses the three key limitations of traditional methods: (1) it captures dynamic treatment effects by continuously updating causal effect estimates as new data becomes available; (2) it adapts to changing customer behaviors through reinforcement learning that optimizes policies based on recent customer responses; and (3) it optimizes sequential decisions by modeling marketing interventions as a Markov decision process that accounts for the long-term impact of current actions.

The key contributions of our work include the following:

A unified framework that combines the strengths of causal forests in estimating heterogeneous treatment effects with the adaptive capabilities of deep reinforcement learning for marketing uplift modeling.
A counterfactual simulation methodology that enables the exploration of diverse intervention strategies without requiring costly real-world experiments.
An adaptive reward mechanism that balances short-term conversion goals with long-term customer value considerations.
Empirical validation of our approach on both simulated and real-world marketing campaign data, demonstrating significant improvements over traditional static uplift models.

Our empirical evaluations demonstrate that our approach significantly outperforms traditional static uplift models, achieving up to 27% improvement in targeting efficiency and an 18% increase in return on marketing investment. These results confirm that our dynamic, symmetry-preserving framework effectively addresses the limitations of static uplift models in real-world marketing scenarios.

The remainder of this paper is organized as follows: Section 2 reviews related work in uplift modeling, causal machine learning, and reinforcement learning for marketing. Section 3 introduces the problem setting and some basic knowledge for facilitating understanding our work. Section 4 presents our proposed framework, detailing the integration of causal forests with deep reinforcement learning. Section 5 describes our experimental setup and the results of our empirical evaluation. Finally, Section 6 concludes with a summary of our findings and directions for future research.

2. Related Works

This section reviews the existing literature on uplift modeling, causal machine learning approaches for estimating heterogeneous treatment effects, and applications of reinforcement learning in marketing decision-making.

2.1. Uplift Modeling in Marketing

Uplift modeling, also known as incremental modeling or true-lift modeling, focuses on identifying individuals who respond positively to a specific treatment or intervention [14,15]. Unlike traditional response modeling, which predicts the likelihood of a positive outcome regardless of treatment, uplift modeling specifically targets the incremental impact of an intervention [8]. Early approaches to uplift modeling relied on two-model techniques, where separate models were built for the treatment and control groups, and the difference in predictions was used to estimate the uplift [16]. However, these approaches suffered from high variance in uplift estimates [17]. To address this limitation, researchers developed direct uplift modeling techniques that directly optimize for the difference in outcomes between treatment and control groups [18]. More recent approaches incorporate tree-based methods specifically designed for uplift modeling. For instance, Ref. [19] introduced a decision tree algorithm with modified splitting criteria that explicitly accounts for treatment effects. Similarly, Ref. [20] proposed uplift random forests that combine multiple uplift trees to improve prediction stability and accuracy. Ref. [21] extended these approaches by introducing uplift modeling with multiple treatments, enabling marketers to select the most effective intervention for each customer from a set of alternatives. Despite these advancements, most existing uplift modeling approaches remain predominantly static, failing to capture the dynamic evolution of customer responses over time [22]. Some recent work has attempted to address this limitation by incorporating temporal aspects into uplift modeling. For example, Ref. [23] proposed a time-varying uplift modeling approach that jointly models the unmeasured confounders and instrumental variables. Similarly, Ref. [24] introduced a sequential uplift modeling framework that considers the cumulative impact of multiple interventions. However, these approaches still lack adaptive capabilities to optimize intervention strategies dynamically. Some more recent works [25,26] have also explored how to integrate statistical information and techniques to improve uplift modeling effects in market applications.

2.2. Causal Machine Learning for Heterogeneous Treatment Effects

Causal machine learning has emerged as a powerful approach to estimating heterogeneous treatment effects across different subpopulations [27]. Unlike traditional machine learning methods that focus on prediction, causal machine learning aims to estimate the causal effect of a treatment on an outcome of interest while accounting for confounding factors [28]. One of the most prominent causal machine learning approaches is the causal forest method introduced by [9]. Causal forests extend random forests to the domain of causal inference by adapting the splitting criteria to maximize the heterogeneity in treatment effects across leaves. This approach enables the estimation of conditional average treatment effects (CATEs) for different subgroups defined by covariates. Ref. [10] further generalized this approach to a broader class of forest-based methods for causal inference, including quantile treatment effects and instrumental variable estimation. Beyond tree-based methods, several other approaches to heterogeneous treatment effect estimation have been proposed. Ref. [29] introduced meta-learners that decompose the heterogeneous treatment effect estimation problem into standard supervised learning tasks. Similarly, Ref. [30] proposed quasi-oracle estimation methods that leverage modern machine learning techniques while maintaining favorable statistical properties. Refs. [31,32] developed doubly robust methods for heterogeneous treatment effect estimation that combine outcome regression with propensity score weighting to provide robustness against model misspecification. While these causal machine learning approaches offer powerful tools for estimating heterogeneous treatment effects, they primarily focus on static treatment effect estimation at a single point in time [33]. The dynamic aspects of treatment effects, including how they evolve over time and how they are influenced by sequences of interventions, remain underexplored in the causal machine learning literature [34].

2.3. Reinforcement Learning for Marketing Decisions

Reinforcement learning (RL) provides a framework for optimizing sequential decision-making under uncertainty, making it particularly suitable for marketing contexts where decisions need to be adapted over time based on customer responses [35]. In RL, an agent learns to make optimal decisions by interacting with an environment, receiving feedback in the form of rewards and updating its decision policy to maximize cumulative rewards [36]. Early applications of RL in marketing focused on optimizing online advertising strategies. For instance, Ref. [37] used RL to optimize the timing, frequency, and content of email marketing campaigns to maximize customer engagement while minimizing unsubscribe rates. Similarly, Ref. [38] employed RL to develop personalized recommendation systems that balance immediate customer satisfaction with long-term engagement. The advent of deep reinforcement learning, which combines deep neural networks with RL algorithms, has further expanded the application of RL in marketing [39]. Ref. [40] used deep RL to optimize dynamic pricing strategies that adapt to changing market conditions and customer preferences. Ref. [41] applied deep RL to multi-channel marketing optimization, dynamically allocating marketing resources across different channels based on customer responses. Despite these advancements, the application of RL in marketing contexts faces several challenges. One key challenge is the need for large amounts of data to train effective RL models, which can be prohibitive in marketing contexts where experimentation is costly [42]. Another challenge is the “exploration-exploitation” tradeoff, where the RL agent must balance exploring new intervention strategies with exploiting known effective strategies [43]. To address these challenges, recent research has explored the integration of causal inference with reinforcement learning. For example, Ref. [44] proposed a counterfactual approach to policy evaluation that enables the assessment of new intervention strategies using historical data. Similarly, Ref. [45] introduced safe reinforcement learning methods that incorporate off-policy evaluation to minimize the risk of deploying suboptimal strategies. Ref. [46] developed causal reinforcement learning frameworks that leverage causal models to improve sample efficiency and generalization in RL. Our work builds on these foundations by integrating causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. This integration addresses the limitations of existing approaches by combining the strengths of causal machine learning in identifying heterogeneous treatment effects with the adaptive capabilities of reinforcement learning for sequential decision-making in dynamic marketing environments.

3. Preliminaries

This section establishes the formal framework for dynamic marketing uplift modeling, introduces the key concepts and notations used throughout this paper, and formulates the problem of optimizing marketing interventions as a causal reinforcement learning problem.

3.1. Problem Setting

We consider a marketing environment where a firm interacts with a population of customers over time. Let

I = {1, 2, \dots, n}

represent the set of customers. For each customer,

i \in I

, the firm observes a set of features

X_{i} \in X

that describe the customer’s characteristics, such as demographics, purchase history, browsing behavior, and past interactions with marketing campaigns.

The firm has at its disposal a set of marketing interventions (or treatments)

A = {0, 1, 2, \dots, K}

, where 0 represents the control condition (no intervention), and

{1, 2, \dots, K}

represent different marketing actions (e.g., email, mobile push notification, and discount offers). At discrete time points,

t \in {1, 2, \dots, T}

, the firm decides which intervention,

A_{i, t} \in A

, to apply to each customer, i.

After receiving an intervention, each customer generates an outcome,

Y_{i, t} \in Y

(e.g., purchase, click, and engagement), which may be influenced by the intervention. The observed outcome is denoted as

Y_{i, t} (A_{i, t})

, representing the outcome under the applied intervention. Importantly, for each customer at each time point, we only observe the outcome under the applied intervention and not under alternative interventions.

The customer’s features may evolve over time based on their interactions with the firm and external factors. We denote the customer’s features at time t as

X_{i, t}

, which includes both static features (e.g., demographics) and dynamic features (e.g., recent purchase behavior and responsiveness to previous interventions).

3.2. Causal Inference Framework

Following the potential outcomes framework [47], we denote as

Y_{i, t} (a)

the potential outcome that would be observed for customer i at time t if intervention

a \in A

were applied. Under the fundamental problem of causal inference, only one of these potential outcomes is observed for each customer at each time point—the one corresponding to the intervention that was actually applied.

The causal effect of intervention a relative to the control condition for customer i at time t is defined as the difference in potential outcomes:

τ_{i, t} (a) = Y_{i, t} (a) - Y_{i, t} (0) .

(1)

This quantity, known as the individual treatment effect (ITE), represents the incremental impact of intervention a on the outcome of interest. In marketing contexts, this is often referred to as the uplift for customer i under intervention a at time t.

Since we cannot observe all potential outcomes for a customer, we aim to estimate the conditional average treatment effect (CATE), which is the expected treatment effect conditional on customer features:

τ (x, a, t) = E [Y_{i, t} (a) - Y_{i, t} (0) | X_{i, t} = x] .

(2)

Traditional uplift modeling approaches typically estimate CATE as a static function of customer features, ignoring the temporal dynamics and sequential nature of marketing interventions. The conditional average treatment effect estimation can be viewed through the lens of symmetry, where the balanced representation of treatment and control conditions is essential for unbiased causal inference. In contrast, our approach explicitly models how treatment effects evolve over time and how they are influenced by sequences of interventions.

3.3. Markov Decision Process Formulation

To capture the sequential and dynamic nature of marketing interventions, we formulate the problem as a Markov decision process (MDP) [48]. An MDP is defined by a tuple

(S, A, P, R, γ)

, where the following applies:

$S$ is the state space, which, in our context, represents the space of customer states. Each state, $s_{i, t} \in S$ , encapsulates the information about customer i at time t, including their features, $X_{i, t}$ , and relevant historical information.
$A$ is the action space, corresponding to the set of available marketing interventions.
$P : S \times A \times S \to [0, 1]$ is the transition function, where $P (s^{'} | s, a)$ represents the probability of transitioning to state $s^{'}$ , given that action a is taken in state s. In our context, this models how customer states evolve in response to marketing interventions.
$R : S \times A \to R$ is the reward function, where $R (s, a)$ represents the expected immediate reward of taking action a in state s. In marketing contexts, rewards can be defined based on various business objectives, such as conversion rates, customer lifetime value, or returns on marketing investment.
$γ \in [0, 1)$ is the discount factor, which balances the importance of immediate versus future rewards.

A policy,

π : S \to A

, is a map from states to actions, specifying which marketing intervention to apply in each customer state. The goal is to find an optimal policy,

π^{*}

, that maximizes the expected cumulative discounted reward:

π^{*} = arg max_{π} E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, π (s_{t}))] .

(3)

3.4. Challenges in Dynamic Uplift Modeling

Several challenges arise when modeling dynamic marketing uplift effects as a causal reinforcement learning problem. Delayed effects: Marketing interventions may have both immediate and delayed effects on customer behavior. For instance, a promotional email might not lead to an immediate purchase but could influence the customer’s purchasing decision in the future. Carryover effects: The effect of a marketing intervention may persist over time and influence the customer’s response to future interventions. This creates dependencies between sequential interventions that need to be modeled explicitly. Adaptation effects: Customers may adapt their behavior in response to repeated marketing interventions. For example, a customer might initially respond positively to promotional emails but develop “promotion fatigue” over time, leading to decreased responsiveness. Exploration–exploitation tradeoff: when marketing interventions are optimized, there is a fundamental tradeoff between exploiting known effective strategies and exploring new strategies to gather more information about customer responses. Counterfactual estimation: to evaluate the causal effect of a marketing intervention, we need to estimate what would have happened if a different intervention had been applied, which requires addressing the counterfactual estimation problem.

Our proposed framework addresses these challenges by integrating causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The next section details our methodology and explains how it overcomes these challenges to enable effective dynamic marketing uplift modeling.

4. Methodology

In this section, we present our integrated framework for dynamic marketing uplift modeling. As illustrated in Figure 1, our approach combines causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The framework consists of three key components: (1) a causal forest model for estimating conditional average treatment effects (CATEs), (2) a counterfactual simulation environment that emulates diverse customer response patterns, and (3) a deep reinforcement learning agent that optimizes intervention strategies dynamically. Our approach maintains symmetry in the exploration–exploitation tradeoff, ensuring balanced learning across customer segments and intervention types.

4.1. Causal Forest for Heterogeneous Treatment Effect Estimation

The first component of our framework is a causal forest model that estimates heterogeneous treatment effects across different customer segments. Causal forests extend random forests to the domain of causal inference, enabling the estimation of treatment effects while controlling for potential confounding factors.

4.1.1. Model Formulation

Following the potential outcomes framework, we define the individual treatment effect (ITE) for customer i at time t under intervention a as follows:

τ_{i, t} (a) = Y_{i, t} (a) - Y_{i, t} (0),

(4)

where

Y_{i, t} (a)

is the potential outcome if intervention a is applied, and

Y_{i, t} (0)

is the potential outcome under the control condition (no intervention). Since we can only observe one of these potential outcomes for each customer at each time point, we aim to estimate the conditional average treatment effect (CATE):

τ (x, a, t) = E [Y_{i, t} (a) - Y_{i, t} (0) | X_{i, t} = x] .

(5)

To estimate the CATE, we employ a causal forest model that adapts the splitting criteria of traditional random forests to maximize the heterogeneity in treatment effects across leaves. Specifically, for each tree in the forest, the splitting criteria at each node is based on maximizing the difference in treatment effects between the resulting child nodes.

Figure 1. Overview of the proposed framework for dynamic marketing uplift modeling. The framework integrates causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The causal forest model estimates conditional average treatment effects (CATE) based on historical data, which inform the initial policy of the reinforcement learning agent. The counterfactual simulation environment emulates diverse customer response patterns, enabling the agent to explore and optimize intervention strategies without costly real-world experiments. The reinforcement learning agent continuously refines its policy based on observed outcomes and estimated treatment effects, adapting to changing customer behaviors over time.

4.1.2. Feature Engineering

To capture the dynamic aspects of customer responses, we incorporate both static and dynamic features in our causal forest model:

X_{i, t} = [X_{i, s t a t i c}, X_{i, t, d y n a m i c}] .

(6)

Static features,

X_{i, s t a t i c}

, include customer demographics, acquisition channels, and other time-invariant characteristics. Dynamic features

X_{i, t, d y n a m i c}

capture the customer’s recent behavior and interactions, including their recent purchase history (e.g., recency, frequency, and monetary value), engagement metrics (e.g., email open rates and click-through rates), responsiveness to previous interventions (e.g., conversion rates for past campaigns), temporal patterns (e.g., seasonality and time since last purchase), and interaction features that capture the dependencies between different feature dimensions. Additionally, we incorporate treatment history features that capture the sequence and timing of previous interventions:

X_{i, t, h i s t o r y} = [A_{i, t - 1}, A_{i, t - 2}, \dots, A_{i, t - h}, Δ t_{i, 1}, Δ t_{i, 2}, \dots, Δ t_{i, h}],

(7)

where

A_{i, t - j}

represents the intervention applied to customer i at time

t - j

, and

Δ t_{i, j}

represents the time elapsed since the j-th previous intervention.

4.1.3. Honesty and Sample Splitting

To ensure an unbiased estimation of treatment effects, we employ the honest estimation approach proposed by [9]. Honest estimation involves splitting the training data into two subsamples: one used for tree construction (determining splits) and the other for estimation (computing leaf node predictions). This approach reduces the adaptive estimation bias that can arise when the same data are used for both split selection and leaf node estimation.

4.1.4. Multi-Treatment Extension

To handle multiple treatment arms in marketing campaigns, we extend the causal forest model to estimate treatment effects for each intervention relative to the control condition. For each treatment,

a \in {1, 2, \dots, K}

, we train a separate causal forest model,

F_{a}

, that estimates

τ (x, a, t)

. Alternatively, we can employ a multi-treatment causal forest approach that simultaneously estimates treatment effects for all interventions within a single model.

4.1.5. Time-Varying Treatment Effects

To capture how treatment effects evolve over time, we incorporate time-varying components in our causal forest model. Specifically, we estimate treatment effects conditionally on time features:

τ (x, a, t) = E [Y_{i, t} (a) - Y_{i, t} (0) | X_{i, t} = x, T_{t} = t],

(8)

where

T_{t}

represents time-specific features such as day of week, month, or time since the start of the campaign. This approach allows the model to capture temporal patterns in treatment effects, such as diminishing returns or seasonal variations.

4.2. Counterfactual Simulation Environment

The second component of our framework is a counterfactual simulation environment that emulates diverse customer response patterns. This environment serves as a testbed for evaluating and optimizing intervention strategies without requiring costly real-world experiments.

4.2.1. Response Model

The core of the counterfactual simulation environment is a response model that predicts customer outcomes under different interventions. The simulation environment preserves the symmetry between observed and potential outcomes, maintaining the fundamental symmetry properties inherent in causal inference frameworks. We model the response function as follows:

Y_{i, t} (a) = f_{0} (X_{i, t}) + τ_{a} (X_{i, t}, H_{i, t}) + ϵ_{i, t},

(9)

where the following applies:

f_{0} (X_{i, t})

is the baseline response function, representing the expected outcome under the control condition (no intervention);

τ_{a} (X_{i, t}, H_{i, t})

is the treatment effect function for intervention a, which depends on both current features

X_{i, t}

and treatment history

H_{i, t}

; and

ϵ_{i, t}

is a random noise term that captures unobserved factors affecting the outcome.

The treatment history

H_{i, t}

includes the sequence and timing of previous interventions:

H_{i, t} = {(A_{i, j}, t_{j}, Y_{i, j}) | j < t}

(10)

where

(A_{i, j}, t_{j}, Y_{i, j})

represents the intervention applied, the time point, and the observed outcome for customer i at time j.

4.2.2. Modeling Dynamic Effects

To capture the dynamic aspects of customer responses, we incorporate several types of effects in the treatment effect function

τ_{a} (X_{i, t}, H_{i, t})

:

Carryover Effects

Carryover effects represent how the impact of an intervention persists over time. We model carryover effects using an exponential decay function:

{Carryover}_{a} (t) = \sum_{j < t : A_{i, j} = a} β_{a} \cdot exp (- λ_{a} \cdot (t - t_{j})),

(11)

where

β_{a}

represents the initial impact of intervention a, and

λ_{a}

controls the decay rate.

Adaptation Effects

Adaptation effects capture how customers’ responsiveness changes with repeated exposures to the same intervention. We model adaptation effects as follows:

{Adaptation}_{a} (t) = α_{a} \cdot exp (- γ_{a} \cdot N_{i, a} (t)),

(12)

where

N_{i, a} (t)

is the number of times intervention a has been applied to customer i before time t,

α_{a}

is the initial responsiveness, and

γ_{a}

controls the rate of adaptation.

Interaction Effects

Interaction effects capture how the impact of an intervention depends on previous interventions. We model interaction effects using a matrix,

I

, where

I_{a, b}

represents the interaction effect between interventions a and b:

{Interaction}_{a} (t) = \sum_{j < t} I_{a, A_{i, j}} \cdot exp (- κ \cdot (t - t_{j})),

(13)

where

κ

controls the decay rate of interaction effects. The overall treatment effect function combines these dynamic effects:

τ_{a} (X_{i, t}, H_{i, t}) = τ_{a}^{b a s e} (X_{i, t}) \cdot {Adaptation}_{a} (t) + {Carryover}_{a} (t) + {Interaction}_{a} (t),

(14)

where

τ_{a}^{b a s e} (X_{i, t})

is the base treatment effect estimated by the causal forest model.

4.2.3. State Transition Model

In addition to modeling outcomes, the counterfactual simulation environment includes a state transition model that captures how customer features evolve in response to interventions. We model the state transition as follows:

X_{i, t + 1} = g (X_{i, t}, A_{i, t}, Y_{i, t}),

(15)

where g is a function that maps the current state, intervention, and outcome to the next state. This function can be implemented using various machine learning models, such as neural networks or gradient-boosting machines trained on historical data.

4.3. Deep Reinforcement Learning for Dynamic Policy Optimization

The third component of our framework is a deep reinforcement learning agent that optimizes intervention strategies dynamically. The RL agent learns a policy that maps customer states to interventions, maximizing cumulative rewards over time.

4.3.1. Markov Decision Process Formulation

We formulate the dynamic uplift modeling problem as a Markov decision process (MDP) defined by the tuple

(S, A, P, R, γ)

:

State space $S$ : the state $s_{i, t} \in S$ represents the information about customer i at time t, including features, $X_{i, t}$ , and treatment history, $H_{i, t}$ .
Action space $A$ : the action $a_{i, t} \in A$ represents the marketing intervention applied to customer i at time t.
Transition function $P : S \times A \times S \to [0, 1]$ : the transition function $P (s_{i, t + 1} | s_{i, t}, a_{i, t})$ models how customer states evolve in response to interventions.
Reward function $R : S \times A \to R$ : The reward function $R (s_{i, t}, a_{i, t})$ quantifies the immediate benefit of applying intervention $a_{i, t}$ to customer i in state $s_{i, t}$ . The reward function is designed with symmetric properties to ensure fair value attribution across different customer segments and intervention types.
Discount factor $γ \in [0, 1)$ : the discount factor balances the importance of immediate versus future rewards.

4.3.2. State Representation

To effectively capture the relevant information for decision-making, we design a rich state representation that includes the following: customer features,

X_{i, t}

, including both static and dynamic features, as described earlier; treatment history features, summarizing the sequence and timing of previous interventions; estimated treatment effects from the causal forest model for each potential intervention, providing the agent with information about the expected incremental impact of each action; and uncertainty estimates for the treatment effects, enabling the agent to balance exploration and exploitation.

4.3.3. Reward Design

The reward function is a critical component of the RL formulation, as it defines the objective that the agent aims to optimize. We design a reward function that captures both immediate and long-term business objectives:

R (s_{i, t}, a_{i, t}) = \{\begin{matrix} r_{i, t} (a_{i, t}) - c (a_{i, t}) & if a_{i, t} \neq 0 \\ 0 & if a_{i, t} = 0, \end{matrix}

(16)

where

r_{i, t} (a_{i, t})

is the immediate reward associated with the outcome (e.g., conversion value and purchase amount), and

c (a_{i, t})

is the cost of intervention

a_{i, t}

(e.g., marketing spend and operational costs).

To encourage the agent to consider long-term customer value, we incorporate a forward-looking component in the following reward function:

R (s_{i, t}, a_{i, t}) = (1 - ω) \cdot (r_{i, t} (a_{i, t}) - c (a_{i, t})) + ω \cdot Δ V_{i, t},

(17)

where

Δ V_{i, t}

represents the estimated change in long-term customer value resulting from the intervention, and

ω \in [0, 1]

is a weight parameter that balances immediate returns with long-term value.

4.3.4. Adaptive Reward Mechanism

To address the challenge of non-stationary customer responses, we introduce an adaptive reward mechanism that adjusts the reward function based on observed outcomes and estimated treatment effects. The adaptive reward function is defined as follows:

R_{a d a p t i v e} (s_{i, t}, a_{i, t}) = R (s_{i, t}, a_{i, t}) \cdot f_{a d a p t} (s_{i, t}, a_{i, t}, t),

(18)

where

f_{a d a p t} (s_{i, t}, a_{i, t}, t)

is an adaptation function that modifies the reward based on the discrepancy between predicted and observed outcomes. This function can be implemented using various approaches, such as Thompson sampling or Bayesian optimization, to balance exploration and exploitation in a non-stationary environment.

4.3.5. Policy Learning with Deep Q-Networks

We employ Deep Q-Networks (DQNs) as the reinforcement learning algorithm for policy optimization. A DQN combines deep neural networks with Q-learning to approximate the action-value function

Q (s, a)

, which represents the expected cumulative discounted reward of taking action a in state s and following the optimal policy thereafter.

The Q-function is approximated using a neural network with parameters

θ

:

Q (s, a; θ) \approx E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t}) | s_{0} = s, a_{0} = a, π] .

(19)

The network is trained to minimize the mean squared error between the predicted Q-values and the target Q-values:

L (θ) = E_{(s, a, r, s^{'}) \sim D} [{(r + γ max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))}^{2}],

(20)

where

D

is a replay buffer containing past experiences, and

θ^{-}

are the parameters of a target network that is periodically updated to stabilize training.

To enhance the stability and performance of the DQN algorithm, we incorporate several extensions:

Double DQN: we use a separate target network for action selection and evaluation to reduce overestimation bias.
Prioritized experience replay: we prioritize experiences with higher temporal-difference errors for more efficient learning.
Dueling network architecture: we separate the estimation of state values and action advantages for better generalization.
Noisy networks: we incorporate parameter noise for exploration instead of epsilon-greedy action selection.

4.3.6. Bootstrapping with Causal Forest Estimates

To accelerate the learning process, we bootstrap the RL agent with estimates from the causal forest model. Specifically, we initialize the Q-function to approximate the immediate reward based on the estimated treatment effects:

Q_{i n i t} (s_{i, t}, a_{i, t}) = \{\begin{matrix} τ (X_{i, t}, a_{i, t}, t) - c (a_{i, t}) & if a_{i, t} \neq 0 \\ 0 & if a_{i, t} = 0, \end{matrix}

(21)

where

τ (X_{i, t}, a_{i, t}, t)

is the treatment effect estimated by the causal forest model. This initialization provides the RL agent with a head start based on insights from the causal forest model while still allowing it to adapt and refine its policy over time.

4.3.7. Policy Refinement with Counterfactual Simulation

We use the counterfactual simulation environment to refine the RL policy without requiring costly real-world experiments. The policy refinement process involves the following steps: (1) initialize the RL agent with a policy derived from causal forest estimates; (2) simulate customer trajectories using the counterfactual simulation environment, applying the current policy to determine interventions; (3) update the RL agent’s policy based on observed outcomes in the simulation; (4) repeat steps 2–3 until convergence or a maximum number of iterations is reached. This approach enables the RL agent to explore and learn from a diverse range of customer scenarios, improving its ability to adapt to different customer behaviors and marketing contexts.

4.3.8. Algorithm Implementation

Algorithm 1 presents the pseudocode for our dynamic marketing uplift modeling framework, illustrating how the causal forest, counterfactual simulation, and reinforcement learning components interact.

The algorithm begins with an offline learning phase in which causal forests are trained for each intervention to estimate heterogeneous treatment effects. These estimates are then used to initialize a counterfactual simulation environment that emulates customer responses. In the policy refinement phase, we use a Double Deep Q-Network algorithm with prioritized experience replay and dueling architecture to optimize the intervention policy. The agent interacts with the counterfactual simulation environment, collecting experiences that are used to update the Q-network through temporal difference learning. Key aspects that differentiate our implementation include the following: the bootstrapping of the Q-network with causal forest estimates (line 12), the adaptive reward mechanism that incorporates both immediate outcomes and long-term value, the symmetry-preserving exploration strategy that ensures balanced learning across customer segments, and the continuous updating of model components as new data become available.

4.4. Integrated Framework Operation

The three components of our framework—causal forests, counterfactual simulation, and deep reinforcement learning—work together in an integrated manner to enable dynamic marketing uplift modeling. The overall operation of the framework involves the following steps:

1. Offline learning phase: The offline phase begins with training causal forest models on historical data to estimate heterogeneous treatment effects for each intervention. For the Criteo dataset, we trained separate models for the single treatment type, while for the RetailCo dataset, we developed individual models for each of the three email campaign types. The causal forest training process involved the careful handling of temporal aspects to prevent data leakage, with data from earlier time periods used for training and more recent data for validation. Next, we developed the counterfactual simulation environment based on both historical patterns and domain knowledge. Historical patterns were extracted from the data through a statistical analysis of customer responses under different conditions, identifying temporal trends, seasonality effects, and response decay patterns. Domain knowledge was incorporated in several specific ways: for the RetailCo dataset, marketing experts provided insights about typical email fatigue effects (modeled as diminishing returns after 3–4 emails within a 14-day window), cross-channel interaction effects (e.g., email followed by mobile notification, typically showing 15–20% higher response rates than two consecutive emails), and seasonal purchasing behaviors (e.g., higher responsiveness to promotional content during holiday periods). For the Criteo dataset, we incorporated knowledge about typical display advertising effects, including view-through conversion windows and frequency capping implications. The counterfactual simulation parameters were calibrated using a held-out portion of the historical data, with particular attention to accurately reproducing the observed heterogeneity in treatment effects across different customer segments. We validated the simulation environment by comparing its predictions against actual outcomes from historical randomized tests, achieving correlation coefficients of 0.78 for the RetailCo dataset and 0.72 for the Criteo dataset. Finally, we bootstrapped the RL agent with causal forest estimates and refined its policy through iterative simulation. The bootstrapping process involved initializing the Q-network to approximate the immediate rewards based on the estimated treatment effects, providing a more efficient starting point for policy learning compared to random initialization.

2. Online deployment phase: During deployment, the learned policy was applied to determine optimal interventions for each customer in real time. For the RetailCo dataset, this involved daily batch processing to select customers for different email campaign types, while for the Criteo dataset, decisions were made in near real-time (sub-second latency) for display advertising placement. Feedback from actual customer responses was systematically collected through the company’s existing tracking systems, including email opens, clicks, website visits, and purchases. These feedback data were anonymized and stored in a dedicated database for subsequent model updates. The models were updated periodically, rather than after each individual customer interaction. Specifically, we implemented a dual-trigger update mechanism: models were refreshed either after accumulating 10,000 new customer interactions or on a weekly schedule, whichever came first. This batch-updating approach balanced computational efficiency with the need for timely adaptation to evolving patterns. During each update cycle, the causal forest models were retrained with the augmented dataset, simulation parameters were recalibrated, and the RL policy was refined through additional simulation episodes.

3. Continuous learning loop: The continuous learning loop was implemented as an automated pipeline that monitored the performance of the framework through several key business metrics tracked in real-time dashboards: conversion rates, average order value, returns on marketing investment, and unsubscribe rates. Performance was evaluated both overall and for specific customer segments to identify potential areas for improvement. Patterns and insights were extracted from both successful and unsuccessful interventions through a combination of automated analysis and periodic review meetings with marketing stakeholders. For example, the analysis might reveal diminishing returns from promotional emails for certain customer segments, triggering adjustments to the reward function to place more emphasis on long-term engagement for those segments. The framework components were refined based on performance feedback through a systematic procedure: if performance declined for specific segments, feature engineering was revisited to capture potentially overlooked factors; if overall performance plateaued, exploration parameters in the RL component were temporarily increased; if the gap between simulated and actual outcomes grew, the simulation environment parameters were recalibrated with greater weight on recent data.

Algorithm 1 Dynamic marketing uplift modeling framework.

Require: Historical data $D = {(X_{i, t}, A_{i, t}, Y_{i, t})}$ , customer features $X$ , interventions $A$
Ensure: Optimized policy $π^{*}$

1:: /* Offline Learning Phase */
2:: $D_{t r a i n}, D_{v a l}, D_{t e s t} \leftarrow$ TemporalSplit( $D$ , [0.7, 0.15, 0.15])
3:: $X_{t r a i n}^{a u g} \leftarrow$ FeatureEngineering( $D_{t r a i n}$ ) {Create static, dynamic and history features}
4:: for each intervention $a \in A ∖ {0}$ do
5:: $F_{a} \leftarrow$ TrainCausalForest( $X_{t r a i n}^{a u g}$ , $A_{t r a i n}$ , $Y_{t r a i n}$ , a) {Train causal forest for intervention a}
6:: $\hat{τ} (x, a, t) \leftarrow$ EstimateCATE( $F_{a}$ , x, t) {Estimate conditional average treatment effects}
7:: end for
8:: $C S i m \leftarrow$ InitializeCounterfactualSimulation( $D_{t r a i n}$ , ${F_{a}}$ )
9:: Initialize replay buffer $B \leftarrow {}$
10:: Initialize Q-network $Q_{θ}$ and target network $Q_{θ^{-}}$ with random weights
11:: Initialize exploration parameter $ϵ \leftarrow 1.0$
12:: /* Policy Refinement Phase */
13:: for episode $= 1$ to $N_{e p i s o d e s}$ do
14:: Sample batch of customers ${i_{1}, i_{2}, \dots, i_{B}}$ from $D_{t r a i n}$
15:: Get initial states ${s_{i_{1}, 1}, s_{i_{2}, 1}, \dots, s_{i_{B}, 1}}$
16:: for $t = 1$ to T do
17:: for each customer i in batch do
18:: if rand() $< ϵ$ then
19:: $a_{i, t} \leftarrow$ random action from $A$
20:: else
21:: $a_{i, t} \leftarrow arg {max}_{a} Q_{θ} (s_{i, t}, a)$
22:: end if
23:: $y_{i, t}, s_{i, t + 1} \leftarrow$ CSim.Step( $s_{i, t}$ , $a_{i, t}$ )
24:: $r_{i, t} \leftarrow$ ComputeReward( $s_{i, t}$ , $a_{i, t}$ , $y_{i, t}$ )
25:: Store transition $(s_{i, t}, a_{i, t}, r_{i, t}, s_{i, t + 1})$ in $B$
26:: if replay buffer $B$ has enough samples then
27:: Sample random mini-batch of transitions from $B$
28:: Calculate target values using double Q-learning:
29:: $y_{j} = r_{j} + γ Q_{θ^{-}} (s_{j + 1}, arg {max}_{a} Q_{θ} (s_{j + 1}, a))$
30:: Update $θ$ by minimizing $\sum_{j} {(y_{j} - Q_{θ} (s_{j}, a_{j}))}^{2}$
31:: Periodically update target network: $θ^{-} \leftarrow θ$
32:: end if
33:: end for
34:: end for
35:: Decay exploration parameter $ϵ$
36:: Evaluate policy on $D_{v a l}$
37:: end for
38:: /* Evaluate final policy on test set */
39:: $π^{*} (s) = arg {max}_{a} Q_{θ} (s, a)$
40:: Evaluate $π^{*}$ on $D_{t e s t}$
41:: return $π^{*}$

This integrated approach combines the strengths of causal inference for heterogeneous treatment effect estimation with the adaptive capabilities of reinforcement learning for sequential decision-making. The counterfactual simulation environment serves as a bridge between these two components, enabling safe exploration and policy refinement without costly real-world experiments. It is important to clarify how the conditional average treatment effect (CATE) estimates flow between the offline and online phases of our framework. As shown in Figure 1, the causal forest model estimates CATEs during the offline learning phase using historical data. These initial CATE estimates are then used to bootstrap the reinforcement learning agent’s Q-function initialization, stored as part of the customer state representation for the online deployment phase, and updated periodically as new data become available through the continuous learning loop. During the online phase, the customer state includes features, history, and the most recently estimated CATE values. This allows the framework to leverage both the robust causal estimates from the offline phase and the adaptive policy optimization from the reinforcement learning component. The combination provides a balance between reliable treatment effect estimation and dynamic adaptation to evolving customer behaviors. The CATE values in the customer state serve multiple purposes: they inform the exploration strategy of the RL agent (with higher uncertainty estimates leading to more exploration), they contribute to the reward function calculation, and they provide interpretable insights for business stakeholders regarding which customer segments are most responsive to specific interventions.

5. Experiments

In this section, we present an experimental evaluation of our proposed framework for dynamic marketing uplift modeling. We first describe the datasets used for the evaluation, followed by the baseline methods, evaluation metrics, implementation details, and experimental results.

5.1. Datasets

5.1.1. Criteo Uplift Dataset

The Criteo Uplift Dataset [49] is a publicly available dataset specifically designed for uplift modeling research. It contains data from a randomized controlled trial conducted by Criteo, a digital advertising company. The dataset includes approximately 13.9 million samples with 12 feature variables, a treatment indicator, and a binary conversion outcome. The treatment in this dataset corresponds to a display advertising campaign, and the outcome is whether the user converted (e.g., made a purchase) within a specific time window after exposure to the ad. The specific marketing intervention in this dataset consists of personalized display advertisements shown to users while browsing various websites. For example, a user might be shown a targeted banner ad for a product they previously viewed but did not purchase, with the goal of encouraging them to complete the transaction. These ads might contain product images, promotional messages, discounts, or calls to action like “Limited time offer” or “Buy now”. The control group received no advertisement, allowing for direct measurement of the incremental effect of the display advertising intervention on conversion probability. The dataset has several characteristics that make it suitable for evaluating dynamic uplift modeling approaches: it includes timestamps for each observation, allowing us to model temporal dynamics; the features are anonymized but include both static customer attributes and dynamic behavioral features; the randomized treatment assignment enables unbiased estimation of treatment effects; the large sample size allows for a reliable estimation of heterogeneous treatment effects across different customer segments. For our experiments, we augment the original dataset with additional derived features that capture temporal patterns and treatment history, such as the recency, frequency, and timing of previous exposures to advertisements.

5.1.2. RetailCo Customer Relationship Management Dataset

The second dataset is from a large retail company (anonymized as RetailCo), and it contains customer relationship management (CRM) data spanning two years. This proprietary dataset includes records of email marketing campaigns sent to customers and their subsequent purchasing behaviors. The dataset contains approximately 2.5 million samples from 450,000 unique customers, with 36 feature variables, multiple treatment options (different types of email campaigns), and both binary (purchase/no purchase) and continuous (purchase amount) outcome variables. This dataset encompasses a variety of marketing interventions in the form of different email campaign types. Promotional emails contained limited-time discount offers (e.g., “20% off your next purchase”), free shipping promotions, or buy-one-get-one offers. These interventions were designed to create urgency and immediate conversion. Informational emails featured new product announcements, seasonal catalogs, or style guides without explicit promotions, aiming to increase brand engagement and generate interest in new merchandise. Product recommendation emails contained personalized product suggestions based on customers’ browsing history, past purchases, or similar customer preferences. For example, these emails included “We thought you might like these items” sections with products from categories the customer previously showed interest in. Re-engagement emails targeted customers who had been inactive for a specific period (e.g., 60+ days), featuring special “We miss you” messaging and incentives to return to the site. The control condition in this dataset represents customers who were eligible for email campaigns but did not receive any during the observation period, allowing for measurement of the incremental impact of each intervention type. The key characteristics of this dataset include the following: longitudinal data with multiple interventions per customer over time; rich customer features, including demographics, purchase history, browsing behavior, and engagement metrics; multiple treatment types with varying content, offers, and sending times; and detailed outcome information, including both conversion events and monetary values. This dataset allows us to evaluate our approach in a more complex and realistic marketing scenario with sequential interventions and evolving customer behaviors.

5.2. Baseline Methods

We compare our dynamic marketing uplift modeling approach with several baseline methods.

Random assignment: randomly assigns treatments to customers, representing a non-targeted approach.
Response Model: Targets customers based on their predicted response probability, ignoring treatment effects.
Traditional uplift model [29]: estimates treatment effects using a meta-learner approach (S-learner) with gradient boosting machines as base learners.
Static causal forest [9]: applies the causal forest model for uplift estimation but without considering dynamic effects or sequential optimization.
Dynamic uplift model [24]: incorporates time-varying components in the uplift model but without reinforcement learning for policy optimization.
Contextual bandits [43]: uses a contextual multi-armed bandit approach to optimize interventions based on customer features, but without modeling long-term effects.

For the RetailCo dataset with multiple treatment options, we also included a business rule baseline that represents the company’s existing targeting strategy based on manually crafted business rules and segmentation.

5.3. Evaluation Metrics

We evaluate the performance of our approach using the following metrics.

Average treatment effect (ATE): The average difference in outcomes between the treatment and control groups. While simple to calculate, ATE masks heterogeneity in treatment effects across different customer segments [50]. A positive ATE indicates that the intervention is effective on average but does not provide insights into which specific segments benefit most from the intervention.
Qini Coefficient: A measure of uplift model performance, calculated as the area between the uplift curve and the random targeting curve, normalized by the area of the perfect model [8,51]. The Qini coefficient quantifies targeting efficiency by measuring how much better the model performs compared to random targeting across different targeting thresholds. Higher values indicate superior ability to identify customers with positive treatment effects, with values close to 1 representing near-perfect targeting.
Expected response rate lift: The lift in response rate achieved by the model compared to random targeting, measured at different targeting thresholds. This metric directly translates to business value by showing the incremental gain in conversion rates when targeting specific percentiles of customers ranked by predicted uplift [52]. When reported at multiple thresholds (e.g., top 10% and 30%), it provides insights into the model’s performance at different campaign scales.
Return on investment (ROI): The net incremental return (incremental outcome value minus intervention cost) divided by the intervention cost. ROI connects model performance to financial outcomes, making it particularly relevant for business stakeholders [1]. This metric helps balance the trade-off between targeting accuracy and campaign costs.
Customer lifetime value (CLV) impact: The estimated change in customer lifetime value resulting from the optimized intervention strategy. Unlike immediate conversion metrics, CLV impact captures the long-term effects of interventions on customer value [53,54]. This is critical for sustainable marketing strategies that aim to build lasting customer relationships, rather than merely driving short-term conversions.

For the dynamic aspects of our approach, we also evaluate the following.

Adaptation performance: How well the model adapts to changing customer behaviors over time, measured by the stability of performance metrics across different time periods. This metric is particularly important for evaluating dynamic models in non-stationary environments where customer preferences evolve [55]. High adaptation performance indicates resilience to concept drift and temporal shifts in customer behavior.
Cumulative reward: The total reward accumulated over multiple interaction periods, capturing the long-term effectiveness of the intervention strategy. This reinforcement learning metric evaluates the model’s ability to optimize sequential decisions over time, rather than just immediate outcomes [35]. A higher cumulative reward indicates better long-term value optimization through sequential decision-making.

These metrics collectively assess both the immediate targeting efficiency and long-term value optimization capabilities of our approach, providing a comprehensive evaluation framework that aligns with both technical and business objectives.

5.4. Implementation Details

5.4.1. Data Preprocessing

For both datasets, we performed the following preprocessing steps. (1) Feature normalization: continuous features were standardized to have zero mean and unit variance. (2) Missing value imputation: missing values were imputed using a combination of mean imputation for continuous features and mode imputation for categorical features. (3) Feature engineering: we created additional features to capture temporal patterns, such as time since last treatment, day of week, month, and treatment frequency. (4) Temporal splitting: Data were split into training (70%), validation (15%), and test (15%) sets, maintaining the temporal order to prevent data leakage. For the RetailCo dataset, we also performed one-hot encoding of categorical variables and aggregated historical purchase data to create customer-level RFM (recency, frequency, and monetary value) features.

For the temporal split, we utilized 3 months of historical data for the Criteo Uplift Dataset and 12 months for the RetailCo CRM Dataset. Our experiments with various historical time windows revealed that 12-month windows provided the optimal balance between capturing long-term patterns and maintaining relevance to current behaviors. Shorter windows (3–6 months) resulted in approximately 8–10% lower Qini coefficients, while very long windows (18+ months) showing a 3–5% performance decrease. The effectiveness of historical data length varied by customer segment, with high-frequency purchase categories sometimes benefiting from shorter, more recent data windows, while considered purchases with longer decision cycles benefited from the full 12-month window.

For feature engineering, we incorporated several specific features across both datasets. The static features for the Criteo Dataset included the 12 anonymized features provided in the original dataset. For the RetailCo Dataset, static features comprised demographics (age group, gender, and location), acquisition source, customer tenure, loyalty tier, and preferred product categories. The dynamic features included recent engagement metrics such as open rates, click rates, and conversion rates calculated over multiple time windows (7-day, 30-day, and 90-day) to capture short, medium, and long-term patterns. We also incorporated recency–frequency–monetary (RFM) metrics, including days since a last purchase, purchase frequency in the last 90 days, and average order value in the last 90 days. Additional dynamic features included response decay (exponentially weighted average of past response rates with higher weights for more recent responses) and seasonal indicators such as the day of the week, the week of the month, and proximity to major shopping events. For history features, we incorporated the eight most recent interventions for each customer in the RetailCo dataset and the five most recent interventions in the Criteo dataset. We also included the intervention spacing (time intervals between consecutive interventions in days), the response to previous interventions (binary indicators for whether the customer responded to each of the previous interventions), and channel diversity (the entropy of communication channels used in recent interventions, for RetailCo dataset only). For the RetailCo dataset, we also engineered cross-channel interaction features that capture how customers respond to sequences of different intervention types (e.g., promotional email followed by product recommendation). Time-based definitions were calibrated based on the purchase cycle in each dataset: for high-frequency retail purchases in the RetailCo dataset, “recent” typically referred to 7–30-day windows, while for the Criteo dataset, which involved higher consideration purchases, “recent” encompassed 30–90-day windows. All categorical features were one-hot encoded, while continuous features were standardized using z-score normalization (subtracting the mean and dividing by the standard deviation). For features with highly skewed distributions such as monetary values and time intervals, we applied log-transformation before standardization. Missing values were imputed as described earlier, with means for continuous features and modes for categorical features calculated from the training data only.

5.4.2. Causal Forest Implementation

We implemented the causal forest model using the EconML library [56] (https://github.com/py-why/EconML, accessed on 1 April 2025), which provides efficient implementations of various causal machine learning methods. Specifically, we used the CausalForest class with the following hyperparameters. Number of trees: 2000; minimum samples leaf: 5; maximum depth: 10; criterion: “mse” (mean squared error); bootstrap: true; and number of jobs: −1 (use all available cores). The hyperparameters were tuned using a grid search with validation-set performance. For the multi-treatment case in the RetailCo dataset, we trained separate causal forest models for each treatment type.

5.4.3. Counterfactual Simulation Implementation

The counterfactual simulation environment was implemented as a Python 3.8 class using NumPy and Pandas. The environment simulates customer responses based on the estimated treatment effects from the causal forest model, combined with dynamic effect components. The simulation parameters were calibrated using historical data and expert knowledge from marketing practitioners at RetailCo.

Specifically, we estimated the following parameters from the data: baseline response function

f_{0} (X_{i, t})

using a gradient boosting machine trained on control group data; carryover effect parameters

β_{a}

and

λ_{a}

by analyzing the persistence of treatment effects over time; adaptation effect parameters

α_{a}

and

γ_{a}

by examining how customer responsiveness changes with repeated exposures.

The state transition model was implemented using a recurrent neural network (RNN) that predicts the next state given the current state, action, and outcome.

5.4.4. Deep Reinforcement Learning Implementation

We implemented the deep reinforcement learning agent using PyTorch 2.2.0 and the RLlib library. Specifically, we used the Deep Q-Network (DQN) algorithm with the following specifications.

Neural network architecture. Input layer: dimension equal to the state representation size; hidden layers: three fully connected layers with 256, 128, and 64 neurons, each followed by ReLU activation; output layer: dimension equal to the number of actions.
Hyperparameters. Discount factor $γ$ : 0.95; learning rate: 0.0001; replay buffer size: 100,000; mini-batch size: 64; target network update frequency: 1000 steps; exploration strategy: $ϵ$ -greedy with annealing from 1.0 to 0.01 over 100,000 steps.

We incorporated the extensions mentioned in the Methodology section, including Double DQN, prioritized experience replay, and dueling network architecture, to improve stability and performance. The details are as follows. Double DQN: We addressed overestimation bias by using separate networks for action selection and evaluation. The target value was calculated as

r + γ Q_{θ^{-}} (s^{'}, arg {max}_{a} Q_{θ} (s^{'}, a))

, where

θ^{-}

represents the target network parameters. Prioritized experience replay: We assigned priority

p_{i} = | δ_{i} | + ϵ

to each transition in the replay buffer, where

δ_{i}

is the temporal-difference error and

ϵ = 0.01

. Transitions were sampled with probability proportional to

p_{i}^{α}

with

α = 0.6

, and importance sampling weights

w_{i} = {(N \cdot P (i))}^{- β}

with

β

annealed from 0.4 to 1.0 were applied to the loss function. Dueling network architecture: we split the network into value and advantage streams that were combined as

Q (s, a) = V (s) + (A (s, a) - \frac{1}{| A |} \sum_{a^{'}} A (s, a^{'}))

to better identify state values independent of actions. Adaptive exploration: rather than using a simple

ϵ

-greedy approach with fixed annealing, we implemented a more nuanced exploration strategy that adjusted exploration rates based on uncertainty estimates from the causal forest models.

5.4.5. Parameter Determination

For all model components, we determined optimal parameters through systematic hyperparameter optimization, rather than using default values. For the causal forest models, we performed a grid search over the following ranges: number of trees {500, 1000, 2000, 3000}, minimum samples leaf {1, 5, 10, 20}, and maximum depth {6, 8, 10, 12}. Selection was based on out-of-bag validation performance. For the reinforcement learning component, we conducted a two-stage optimization process. First, we performed a coarse grid search over learning rates {

10^{- 2}

,

10^{- 3}

,

10^{- 4}

}, discount factors {0.9, 0.95, 0.99}, and network architectures. Then, we performed a finer search around promising configurations. The final parameters were selected based on the model’s performance on the validation set, measured by the cumulative reward and Qini coefficient. The reward function weights, which balance immediate conversions with long-term value, were determined through A/B testing on a subset of the RetailCo dataset, comparing different weight configurations against business objectives. For the Criteo dataset, we used a simplified reward function since long-term customer value information was not available. For both datasets, we employed 5-fold cross-validation during the hyperparameter optimization phase to ensure robust parameter selection, with the final evaluation performed on the held-out test set to avoid overfitting to the validation data.

5.5. Experimental Results

5.5.1. Results on Criteo Uplift Dataset

Figure 2 shows the Qini curves for different methods on the Criteo Uplift Dataset. The Qini curve plots the incremental number of conversions as a function of the targeting size, with a steeper curve indicating better targeting performance. As shown in the figure, our Dynamic Causal RL approach achieves the highest Qini coefficient (0.52), followed by the static causal forest (0.41) and the traditional uplift model (0.37). The random assignment baseline has a Qini coefficient of approximately 0, as expected for a non-targeted approach.

Table 1 presents the detailed performance metrics for all methods on the Criteo Uplift Dataset. Our approach outperforms all baselines across multiple metrics, with particularly notable improvements in the expected response rate lift at the 30% targeting threshold.

Figure 3 illustrates how our approach adapts to changing customer behaviors over time. The plot shows the relative performance (measured by Qini coefficient) of different methods across six consecutive time periods. While all methods show some performance degradation over time due to evolving customer behaviors, our Dynamic Causal RL approach maintains higher performance throughout, with a significantly smaller decline in the later periods.

5.5.2. Results on RetailCo CRM Dataset

For the RetailCo dataset, we evaluated our approach on both binary (purchase/no purchase) and continuous (purchase amount) outcomes. Figure 4 shows the expected response rate lift at different targeting thresholds for the binary outcome. Our approach achieves substantial improvements over the baselines, particularly in the top 10-20% of targeted customers.

Table 2 summarizes the performance metrics for all methods on the RetailCo CRM Dataset. Our approach achieves a 27% improvement in targeting efficiency (measured by Qini coefficient) and an 18% increase in ROI compared to the business rule baseline used by the company.

Figure 5 shows the distribution of treatment assignments for different customer segments under our approach compared to the business rule baseline. Our approach demonstrates more nuanced targeting, with different treatments assigned to different customer segments based on their estimated response patterns.

To evaluate the long-term impact of our approach, we simulated customer interactions over multiple periods using the counterfactual simulation environment. Figure 6 shows the cumulative reward (measured as incremental revenue minus marketing costs) over 12 simulated periods. Our approach accumulates significantly higher rewards over time compared to the baselines, highlighting the benefits of optimizing for long-term customer value, rather than immediate responses.

5.5.3. Ablation Study

To understand the contribution of different components of our framework, we conducted an ablation study by removing or replacing key components and measuring the impact on performance. Table 3 presents the results of this ablation study on the RetailCo CRM Dataset.

The ablation study reveals that all components of our framework contribute positively to the overall performance. Removing the causal forest component (replacing it with a traditional uplift model) leads to the largest performance drop, indicating the importance of accurate heterogeneous treatment effect estimation. The reinforcement learning component and the modeling of dynamic effects also contribute significantly to the performance, highlighting the value of sequential optimization and capturing temporal dynamics in customer responses.

To further evaluate the contribution of different reinforcement learning extensions, we conducted additional experiments comparing the performance of our framework with and without specific DQN enhancements. Table 4 presents the results of these experiments on the RetailCo CRM Dataset.

As shown in Table 4, all three extensions contributed to improved performance, with the removal of all extensions (Vanilla DQN) resulting in a substantial performance drop. Prioritized experience replay provided the most significant individual contribution, improving the Qini coefficient by approximately 0.07 points compared to the variant without this extension. This is likely because marketing datasets typically contain imbalanced conversion events, and prioritizing experiences with higher temporal-difference errors helps the agent learn more effectively from rare but informative conversion events. Double DQN contributed to more accurate Q-value estimation, reducing overestimation bias that could otherwise lead to suboptimal targeting decisions. The dueling network architecture improved the stability of learning, particularly for customer segments where the action selection had minimal impact on outcomes. Beyond the performance metrics shown in the table, we also observed that these extensions led to faster convergence during training (approximately 35% fewer iterations required) and greater consistency across multiple training runs (standard deviation of Qini coefficient across five runs reduced by 42%). These results highlight the importance of advanced reinforcement learning techniques in addressing the unique challenges of dynamic marketing uplift modeling, particularly when dealing with sparse rewards and heterogeneous customer responses.

To further explore how different components interact with each other, we conducted additional ablation experiments incorporating combinations of component removals. Table 5 presents the results of these combination experiments on the RetailCo CRM Dataset.

These combination experiments reveal several important insights about component interactions. First, the combination of removing both the causal forest and dynamic effects components (Qini coefficient: 0.36) resulted in a performance drop that exceeded the sum of their individual removal effects (0.43 and 0.46, respectively), suggesting a synergistic relationship between accurate heterogeneous treatment effect estimation and temporal dynamics modeling. Similarly, removing both reinforcement learning and the adaptive reward mechanism led to a Qini coefficient of 0.39, which is lower than would be expected from their individual contributions. This highlights how the adaptive reward mechanism enhances the effectiveness of the reinforcement learning component by providing more informative feedback signals. The most severe performance degradation occurred when both the causal forest and reinforcement learning components were removed (Qini coefficient: 0.33), reducing the framework essentially to a dynamic uplift model without the benefits of either robust causal inference or sequential decision optimization. This underscores the fundamental importance of these two components to our framework’s effectiveness. When all key components were removed, resulting in a traditional uplift modeling approach (Qini coefficient: 0.30), the performance was substantially worse than any other variant, confirming the significant value added by our integrated framework. These combination experiments provide strong evidence for the complementary nature of our framework components, with each addressing specific limitations of traditional approaches while also enhancing the effectiveness of other components through their interactions.

5.5.4. Computational Environment and Execution Time Measurement

To evaluate the computational efficiency of different methods, we measured their execution times under identical conditions. All experiments were conducted on a server equipped with an Intel Xeon E5-2680 v4 CPU (2.4 GHz, 14 cores), 128 GB of RAM, and an NVIDIA Tesla V100 GPU with 16 GB of memory. Table 6 presents the end-to-end runtime for each method, including model training and inference on the test set. For reinforcement learning-based approaches, the execution time includes the simulation runs required for policy optimization. The reported times are averaged over five independent runs to ensure reliability. We excluded data preprocessing time from these measurements since it is common across all methods. As shown in Table 6, our Dynamic Causal RL approach requires more computational resources than simpler baselines, which is expected, given its more sophisticated modeling capabilities. The ablation study results provide additional insights into the computational cost of different components. Removing the reinforcement learning component results in the largest reduction in execution time, while removing the adaptive reward mechanism has the smallest impact. Despite the higher computational requirements, the improved performance of our approach justifies the additional computational cost, especially in marketing scenarios where even small improvements in targeting efficiency can translate to significant financial gains. Furthermore, once trained, the inference time of our model is comparable to other approaches, making it suitable for real-time deployment scenarios where decisions need to be made within milliseconds.

5.6. Case Study: Email Campaign Optimization

To demonstrate the practical application of our approach, we conducted a small-scale pilot study with RetailCo, focusing on optimizing email marketing campaigns for a subset of their customer base. For this pilot, we deployed our framework to optimize the targeting strategy for three types of email campaigns (promotional, informational, and product recommendation) across 50,000 customers over an 8-week period.

The pilot study compared three approaches: (1) business as usual (BAU), RetailCo’s existing targeting rules; (2) static uplift, a traditional uplift modeling approach; (3) our Dynamic Causal RL approach. Figure 7 shows the key performance indicators (KPIs) for the three approaches, including conversion rates, average order values, and ROI. Our approach achieved a 23% higher conversion rate and a 31% higher ROI compared to the BAU approach, demonstrating the practical value of dynamic uplift modeling in real-world marketing scenarios.

Importantly, the pilot study also revealed that our approach led to a more balanced communication strategy, with fewer customers receiving excessive emails and more customers receiving relevant communications. This resulted in lower unsubscribe rates and improved customer satisfaction scores, highlighting the broader benefits of optimized targeting beyond immediate conversion metrics.

5.7. Discussion and Limitations

Our experimental results provide quantitative evidence for how our framework addresses the key challenges identified in Section 3.4: For delayed effects, 23% of conversions in the RetailCo dataset occurred more than 7 days after intervention, with our framework showing a 15% improvement in long-term ROI compared to models optimizing only for immediate conversions. For carryover effects, interventions influenced customer behavior for an average of 18 days. Modeling these patterns reduced emails per customer by 26% while maintaining conversion rates, contributing approximately 0.04 points to the Qini coefficient improvement. For adaptation effects, response rates declined by 7% for each consecutive email within a 30-day period. Our adaptive approach reduced unsubscribe rates by 22% compared to baseline approaches by automatically adjusting frequency for fatigue-prone segments. The exploration–exploitation tradeoff was managed with a 20% initial exploration rate gradually reducing to 5%, resulting in 3% higher long-term ROI compared to completely eliminating exploration. For counterfactual estimation, our approach achieved 83% prediction accuracy for conversion events, with the simulation environment maintaining a 0.76 correlation with actual outcomes over 6 months. These results demonstrate how our methodological contributions directly address the specific challenges of dynamic marketing uplift modeling.

However, our approach also involved several limitations that should be acknowledged. Computational complexity: the training and deployment of our framework require significant computational resources, particularly for large-scale marketing campaigns with millions of customers. Data requirements: the accurate estimation of heterogeneous treatment effects and dynamic components requires rich historical data with randomized treatment assignment, which may not be available in all marketing contexts. Model interpretability: while causal forests provide some level of interpretability, the integration with deep reinforcement learning introduces complexity that may reduce the transparency of targeting decisions for business stakeholders. Cold-start problem: The framework relies on historical data for initialization, which may limit its effectiveness for new customers or new campaign types without sufficient data. Moreover, an additional limitation worth acknowledging is the challenge of adapting to sudden, major environmental shifts—such as the dramatic change in consumer behavior during the COVID-19 pandemic. While our framework’s reinforcement learning component and adaptive reward mechanism provide some ability to adjust to evolving patterns, extreme disruptions may still require explicit recalibration. In such scenarios, possible approaches include the following: (1) implementing change detection algorithms to identify when the environment has significantly shifted; (2) incorporating external context variables (e.g., lockdown status and economic indicators) that may explain behavioral changes; (3) applying transfer learning techniques to adapt pre-existing models to new conditions with limited data; and (4) increasing exploration rates temporarily when significant distribution shifts are detected. These adaptations represent promising directions for future research to enhance model resilience during major environmental changes. Despite these limitations, our framework provides a promising approach to optimizing marketing interventions in dynamic environments, with demonstrated benefits in both simulated scenarios and real-world applications.

6. Conclusions and Future Work

6.1. Conclusions

This paper has introduced a novel framework for dynamic marketing uplift modeling that integrates causal forests with deep reinforcement learning. By combining the strengths of both approaches, our framework addresses the key limitations of traditional uplift modeling methods, particularly their inability to capture the dynamic nature of customer responses and optimize sequential interventions.

Our main contributions include the following: (1) a unified framework that leverages causal forests for accurate heterogeneous treatment effect estimation and deep reinforcement learning for dynamic policy optimization, (2) a counterfactual simulation methodology that enables the exploration of diverse intervention strategies without costly real-world experiments, and (3) an adaptive reward mechanism that balances short-term conversion goals with long-term customer value considerations.

Empirical evaluations of both the Criteo Uplift Dataset and the RetailCo CRM Dataset demonstrated that our approach significantly outperforms traditional static uplift models and other baseline methods. Specifically, our framework achieved a 27% improvement in targeting efficiency (measured by Qini coefficient) and an 18% increase in ROI compared to industry-standard approaches. The results also showed that our method exhibits better adaptation to changing customer behaviors over time, maintaining higher performance across different time periods when compared to static approaches.

The case study with RetailCo further validated the practical value of our approach, with a 23% higher conversion rate and a 31% higher ROI compared to business-as-usual marketing strategies. Importantly, the optimized targeting strategy led to more balanced communication patterns, reducing unsubscribe rates and improving overall customer satisfaction.

In summary, our dynamic marketing uplift modeling framework offers marketers a powerful tool for implementing personalized and adaptive campaign strategies that evolve with changing customer behaviors and market conditions. By preserving symmetry properties in both causal estimation and reinforcement learning components, our framework achieves more balanced and generalizable performance across different marketing contexts. By addressing the limitations of traditional static uplift models, our approach enables more effective and efficient marketing interventions that maximize both immediate responses and long-term customer value.

6.2. Future Work

While our framework demonstrates promising results, several directions for future research remain.

Multi-touch attribution: extending our framework to incorporate multi-touch attribution models would enable a more accurate assessment of the contribution of each marketing touchpoint to conversion events, further improving the precision of uplift estimates in multi-channel marketing environments.
Contextual bandits integration: exploring the integration of contextual bandits with our framework could provide a more efficient approach for balancing exploration and exploitation, particularly in scenarios with limited historical data or rapidly changing customer preferences.
Model interpretability: Enhancing the interpretability of our framework would make it more accessible to marketing practitioners. Future work could focus on developing visualization tools and explanation methods to help marketers understand why specific interventions are recommended for different customer segments.
Online learning: implementing online learning capabilities would allow the model to continuously update and refine its parameters as new data becomes available, further improving its adaptability to evolving customer behaviors and market conditions.
Privacy-preserving uplift modeling: as privacy regulations become increasingly stringent, developing privacy-preserving versions of our framework that can operate with anonymized or federated data would be valuable for deployment in privacy-sensitive contexts.
Cross-domain transfer learning: investigating methods for transferring knowledge across different marketing campaigns or product categories could reduce the data requirements for new campaigns, enabling faster deployment and better initial performance.

Further exploration of symmetry-preserving uplift modeling techniques could enhance the robustness and fairness of personalized marketing interventions, particularly in highly heterogeneous customer populations. These research directions aim to address the current limitations of our framework and expand its applicability to a wider range of marketing scenarios and organizational contexts. As digital marketing continues to evolve with increasing personalization and real-time optimization requirements, frameworks like ours that combine causal inference with reinforcement learning will become increasingly important for effective customer engagement and value creation.

Author Contributions

Methodology, Y.T. and B.J.; Software, Y.T. and B.J.; Formal analysis, J.W., B.W. and W.L.; Investigation, J.W.; Writing—original draft, J.W.; Writing—review & editing, B.W. and W.L.; Supervision, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ascarza, E. Retention Futility: Targeting High-Risk Customers Might Be Ineffective. J. Mark. Res. 2018, 55, 80–98. [Google Scholar] [CrossRef]
Gubela, R.M.; Lessmann, S.; Jaroszewicz, S. Response Transformation and Profit Decomposition for Revenue Uplift Modeling. Eur. J. Oper. Res. 2020, 283, 647–661. [Google Scholar] [CrossRef]
Chohlas-Wood, A.; Coots, M.; Zhu, H.; Brunskill, E.; Goel, S. Learning to Be Fair: A Consequentialist Approach to Equitable Decision Making. Manag. Sci. 2024. [Google Scholar] [CrossRef]
Hitsch, G.J.; Misra, S. Heterogeneous Treatment Effects and Optimal Targeting Policy Evaluation. SSRN Electron. J. 2024, 22, 115–168. [Google Scholar] [CrossRef]
Bhattacharya, R.; Malinsky, D.; Shpitser, I. Causal inference under interference and network uncertainty. In Uncertainty in Artificial Intelligence; Elsevier: Amsterdam, The Netherlands, 2020; pp. 1028–1038. [Google Scholar]
Guelman, L.; Guillén, M.; Pérez-Marín, A.M. Optimal Personalized Treatment Rules for Marketing Interventions: A Review of Methods, a New Proposal, and an Insurance Case Study. In UB Riskcenter Working Paper Series; Universitat de Barcelona: Barcelona, Spain, 2015; 2015/06. [Google Scholar]
Goldenberg, D.; Albert, J.; Bernardi, L.; Estevez, P. Free lunch! retrospective uplift modeling for dynamic promotions recommendation within roi constraints. In Proceedings of the 14th ACM Conference on Recommender Systems, Online, 22–26 September 2020; pp. 486–491. [Google Scholar]
Radcliffe, N.J.; Surry, P.D. Real-world uplift modelling with significance-based uplift trees. In White Paper TR-2011-1; Stochastic Solutions: Edinburgh, UK, 2011; pp. 1–33. [Google Scholar]
Wager, S.; Athey, S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]
Athey, S.; Tibshirani, J.; Wager, S. Generalized Random Forests. Ann. Stat. 2019, 47, 1148–1178. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level Control Through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Tkachenko, Y. Autonomous CRM Control via CLV Approximation with Deep Reinforcement Learning in Discrete and Continuous Action Space. arXiv 2015, arXiv:1504.01840. [Google Scholar]
Rzepakowski, P.; Jaroszewicz, S. Decision Trees for Uplift Modeling with Single and Multiple Treatments. Knowl. Inf. Syst. 2012, 32, 303–327. [Google Scholar] [CrossRef]
He, B.; Weng, Y.; Tang, X.; Cui, Z.; Sun, Z.; Chen, L.; He, X.; Ma, C. Rankability-enhanced revenue uplift modeling framework for online marketing. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 5093–5104. [Google Scholar]
Lo, V.S. The True Lift Model: A Novel Data Mining Approach to Response Modeling in Database Marketing. ACM SIGKDD Explor. Newsl. 2002, 4, 78–86. [Google Scholar] [CrossRef]
Radcliffe, N.J. Using Control Groups to Target on Predicted Lift: Building and Assessing Uplift Models. Direct Mark. Anal. J. 2007, 8, 14–21. [Google Scholar]
Jaskowski, M.; Jaroszewicz, S. Uplift Modeling for Clinical Trial Data. In Proceedings of the ICML Workshop on Clinical Data Analysis, Edinburgh, Scotland, 26 June 26–1 July 2012. [Google Scholar]
Rzepakowski, P.; Jaroszewicz, S. Decision Trees for Uplift Modeling. In Proceedings of the 10th IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010; pp. 441–450. [Google Scholar]
Guelman, L.; Guillén, M.; Pérez-Marín, A.M. Random Forests for Uplift Modeling: An Insurance Customer Retention Case. In Proceedings of the International Conference on Modeling and Simulation in Engineering, Economics and Management, New Rochelle, NY, USA, 30 May–1 June 2012; pp. 123–133. [Google Scholar]
Zhao, Y.; Zeng, X. Uplift Modeling with Multiple Treatments and General Response Types. In Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, TX, USA, 27–29 April 2017; pp. 588–596. [Google Scholar]
Chakraborty, B.; Murphy, S.A. Dynamic treatment regimes. Annu. Rev. Stat. Its Appl. 2014, 1, 447–464. [Google Scholar] [CrossRef]
Sun, Z.; He, B.; Shen, S.; Wang, Z.; Gong, Z.; Ma, C.; Qi, Q.; Chen, X. Sequential Causal Effect Estimation by Jointly Modeling the Unmeasured Confounders and Instrumental Variables. IEEE Trans. Knowl. Data Eng. 2024, 37, 910–922. [Google Scholar] [CrossRef]
Zhang, X.; Wang, K.; Wang, Z.; Du, B.; Zhao, S.; Wu, R.; Shen, X.; Lv, T.; Fan, C. Temporal Uplift Modeling for Online Marketing. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6247–6256. [Google Scholar]
Bokelmann, B.; Lessmann, S. Heteroscedasticity-aware stratified sampling to improve uplift modeling. Eur. J. Oper. Res. 2025, in press. [Google Scholar] [CrossRef]
Alkire, L.; Hesse, L.; Raki, A.; Boenigk, S.; Kabadayi, S.; Fisk, R.P.; Mora, A. From theory to practice: A collaborative approach to social impact measurement and communication. Eur. J. Mark. 2025, in press. [Google Scholar] [CrossRef]
Knaus, M.C.; Lechner, M.; Strittmatter, A. Machine Learning Estimation of Heterogeneous Causal Effects: Empirical Monte Carlo Evidence. Econom. J. 2021, 24, 134–161. [Google Scholar] [CrossRef]
Hill, J.L. Bayesian Nonparametric Modeling for Causal Inference. J. Comput. Graph. Stat. 2011, 20, 217–240. [Google Scholar] [CrossRef]
Künzel, S.R.; Sekhon, J.S.; Bickel, P.J.; Yu, B. Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning. Proc. Natl. Acad. Sci. USA 2019, 116, 4156–4165. [Google Scholar] [CrossRef]
Nie, X.; Wager, S. Quasi-Oracle Estimation of Heterogeneous Treatment Effects. Biometrika 2021, 108, 299–319. [Google Scholar] [CrossRef]
Kennedy, E.H. Optimal Doubly Robust Estimation of Heterogeneous Causal Effects. Ann. Stat. 2020, 48, 1179–1213. [Google Scholar] [CrossRef]
Sun, Z.; He, B.; Ma, M.; Tang, J.; Wang, Y.; Ma, C.; Liu, D. Robustness-enhanced uplift modeling with adversarial feature desensitization. In Proceedings of the 2023 IEEE International Conference on Data Mining (ICDM), Shanghai, China, 1–4 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1325–1330. [Google Scholar]
Chernozhukov, V.; Demirer, M.; Duflo, E.; Fernández-Val, I. Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments. In National Bureau of Economic Research, Working Paper No. 24678; National Bureau of Economic Research: Cambridge, MA, USA, 2018. [Google Scholar]
Bica, I.; Alaa, A.M.; Jordon, J.; van der Schaar, M. Estimating Counterfactual Treatment Outcomes over Time through Adversarially Balanced Representations. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1, pp. 9–11. [Google Scholar]
Rohit, J.; Patel, N.; Iyer, M.; Iyer, S. Leveraging Reinforcement Learning and Natural Language Processing for AI-Driven Hyper-Personalized Marketing Strategies. Int. J. Innov. 2021, 10, 1. [Google Scholar]
Schwartz, E.M.; Bradlow, E.T.; Fader, P.S. Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments. Mark. Sci. 2017, 36, 500–522. [Google Scholar] [CrossRef]
Theocharous, G.; Thomas, P.S.; Ghavamzadeh, M. Ad recommendation systems for life-time value optimization. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 1305–1310. [Google Scholar]
Zhao, J.; Qiu, G.; Guan, Z.; Zhao, W.; He, X. Deep Reinforcement Learning for Sponsored Search Real-time Bidding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1021–1030. [Google Scholar]
Chen, M.; Beutel, A.; Covington, P.; Jain, S.; Belletti, F.; Chi, E.H. Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining, Melbourne, Australia, 11–15 February 2019; pp. 456–464. [Google Scholar]
Cai, H.; Ren, K.; Zhang, W.; Malialis, K.; Wang, J.; Yu, Y.; Guo, D. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 661–670. [Google Scholar]
Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks and Analysis. Mach. Learn. 2021, 110, 2419–2468. [Google Scholar] [CrossRef]
Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Bottou, L.; Peters, J.; Quiñonero-Candela, J.; Charles, D.X.; Chickering, D.M.; Portugaly, E.; Ray, D.; Simard, P.; Snelson, E. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. J. Mach. Learn. Res. 2013, 14, 3207–3260. [Google Scholar]
Thomas, P.S.; Theocharous, G.; Ghavamzadeh, M. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 2139–2148. [Google Scholar]
Deng, Z.; Jiang, J.; Long, G.; Zhang, C. Causal reinforcement learning: A survey. arXiv 2023, arXiv:2307.01452. [Google Scholar]
Rubin, D.B. Causal inference using potential outcomes: Design, modeling, decisions. J. Am. Stat. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Diemert, E.; Betlei, A.; Renaudin, C.; Amini, M.-R. A large scale benchmark for uplift modeling. arXiv 2018, arXiv:2111.10106. [Google Scholar]
Imbens, G.W.; Rubin, D.B. Causal Inference in Statistics, Social, and Biomedical Sciences; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar]
Gutierrez, P.; Gérardy, J.Y. Causal inference and uplift modelling: A review of the literature. In Proceedings of the International Conference on Predictive Applications and APIs, Boston, MA, USA, 24–25 October 2017; pp. 1–13. [Google Scholar]
Devriendt, F.; Moldovan, D.; Verbeke, W. A literature survey and experimental evaluation of the state-of-the-art in uplift modeling: A stepping stone toward the development of prescriptive analytics. Big Data 2018, 6, 13–41. [Google Scholar] [CrossRef]
Gupta, S.; Hanssens, D.; Hardie, B.; Kahn, W.; Kumar, V.; Lin, N.; Ravishanker, N.; Sriram, S. Modeling customer lifetime value. J. Serv. Res. 2006, 9, 139–155. [Google Scholar] [CrossRef]
Fader, P.S.; Hardie, B.G.; Lee, K.L. RFM and CLV: Using iso-value curves for customer base analysis. J. Mark. Res. 2005, 42, 415–430. [Google Scholar] [CrossRef]
Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 1–37. [Google Scholar] [CrossRef]
Battocchi, K.; Dillon, E.; Hei, M.; Lewis, G.; Oka, P.; Oprescu, M.; Syrgkanis, V. EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation. Version 0.x. 2019. Available online: https://github.com/py-why/EconML (accessed on 1 April 2025).

Figure 2. Qini curves for different methods on the Criteo Uplift Dataset. Our Dynamic Causal RL approach (solid blue line) achieves the highest area under the curve, indicating superior targeting performance.

Figure 3. Temporal stability of different methods on the Criteo Uplift Dataset. Our Dynamic Causal RL approach (blue line) shows better adaptation to changing customer behaviors over time, maintaining higher performance across different time periods.

Figure 4. Expected response rate lift at different targeting thresholds on the RetailCo CRM Dataset. Our Dynamic Causal RL approach (blue line) achieves the highest lift, especially in the top 10-20% of targeted customers.

Figure 5. Distribution of treatment assignments across different customer segments for our Dynamic Causal RL approach (left) versus the business rule baseline (right). Our approach shows more nuanced targeting, adapting the treatment type to the specific characteristics of each customer segment.

Figure 6. Cumulative reward over multiple periods for different methods on the RetailCo CRM Dataset. Our Dynamic Causal RL approach (blue line) accumulates significantly higher rewards over time, demonstrating better long-term performance.

Figure 7. Key performance indicators for the email campaign optimization case study. Our Dynamic Causal RL approach (blue bars) outperforms both the business as usual approach (gray bars) and the static uplift approach (orange bars) across multiple metrics.

Table 1. Performance comparison on the Criteo Uplift Dataset.

Method	Qini Coefficient	Lift@10%	Lift@30%	ROI
Random Assignment	0.00	1.00	1.00	0.32
Response Model	0.21	1.35	1.18	0.45
Traditional Uplift Model	0.37	1.82	1.43	0.68
Static Causal Forest	0.41	2.05	1.56	0.74
Dynamic Uplift Model	0.45	2.18	1.62	0.79
Contextual Bandits	0.47	2.21	1.67	0.82
Dynamic Causal RL (Ours)	0.52	2.38	1.78	0.91

Table 2. Performance comparison on the RetailCo CRM Dataset.

Method	Qini Coefficient	Lift@10%	Lift@30%	ROI	CLV Impact
Random Assignment	0.00	1.00	1.00	0.28	0.00$
Business Rule	0.31	1.58	1.35	0.62	2.45$
Response Model	0.29	1.52	1.31	0.58	2.18$
Traditional Uplift Model	0.38	1.89	1.52	0.75	3.21$
Static Causal Forest	0.41	2.05	1.61	0.81	3.59$
Dynamic Uplift Model	0.44	2.12	1.68	0.86	3.82$
Contextual Bandits	0.46	2.19	1.72	0.89	3.95$
Dynamic Causal RL (Ours)	0.52	2.42	1.89	0.98	4.53$

Table 3. Ablation study results on the RetailCo CRM Dataset.

Variant	Qini Coefficient	ROI
Full Framework (Dynamic Causal RL)	0.52	0.98
- Without Causal Forest (Using Traditional Uplift)	0.43	0.84
- Without Dynamic Effects (Static Treatment Effects)	0.46	0.88
- Without Adaptive Reward Mechanism	0.49	0.92
- Without Reinforcement Learning (Greedy Policy)	0.44	0.85

Table 4. Performance impact of DQN extensions on the RetailCo CRM Dataset.

Variant	Qini Coefficient	ROI
Full Framework (All Extensions)	0.52	0.98
- Without Double DQN	0.48	0.91
- Without Prioritized Experience Replay	0.45	0.87
- Without Dueling Network Architecture	0.49	0.93
- Without Any Extensions (Vanilla DQN)	0.41	0.83

Table 5. Performance of component combinations on the RetailCo CRM Dataset.

Variant	Qini Coefficient	ROI
Full Framework	0.52	0.98
Without Causal Forest + Without Dynamic Effects	0.36	0.72
Without Reinforcement Learning + Without Adaptive Reward	0.39	0.78
Without Causal Forest + Without Reinforcement Learning	0.33	0.67
All Components Removed (Traditional Uplift)	0.30	0.63

Table 6. Execution time comparison of different methods (in seconds).

Method	Criteo Dataset	RetailCo Dataset
Random Assignment	0.2	0.1
Business Rule	–	12.3
Response Model	24.7	31.5
Traditional Uplift Model	128.3	145.7
Static Causal Forest	256.5	289.2
Dynamic Uplift Model	315.8	342.6
Contextual Bandits	342.1	375.4
Dynamic Causal RL (Ours)	478.6	524.8
Ablation Study Variants (RetailCo Dataset)
- Without Causal Forest	–	378.2
- Without Dynamic Effects	–	412.5
- Without Adaptive Reward Mechanism	–	498.7
- Without Reinforcement Learning	–	325.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Tan, Y.; Jiang, B.; Wu, B.; Liu, W. Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies. Symmetry 2025, 17, 610. https://doi.org/10.3390/sym17040610

AMA Style

Wang J, Tan Y, Jiang B, Wu B, Liu W. Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies. Symmetry. 2025; 17(4):610. https://doi.org/10.3390/sym17040610

Chicago/Turabian Style

Wang, Jiyuan, Yutong Tan, Bingying Jiang, Bi Wu, and Wenhe Liu. 2025. "Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies" Symmetry 17, no. 4: 610. https://doi.org/10.3390/sym17040610

APA Style

Wang, J., Tan, Y., Jiang, B., Wu, B., & Liu, W. (2025). Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies. Symmetry, 17(4), 610. https://doi.org/10.3390/sym17040610

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Marketing Uplift Modeling: A Symmetry-Preserving Framework Integrating Causal Forests with Deep Reinforcement Learning for Personalized Intervention Strategies

Abstract

1. Introduction

2. Related Works

2.1. Uplift Modeling in Marketing

2.2. Causal Machine Learning for Heterogeneous Treatment Effects

2.3. Reinforcement Learning for Marketing Decisions

3. Preliminaries

3.1. Problem Setting

3.2. Causal Inference Framework

3.3. Markov Decision Process Formulation

3.4. Challenges in Dynamic Uplift Modeling

4. Methodology

4.1. Causal Forest for Heterogeneous Treatment Effect Estimation

4.1.1. Model Formulation

4.1.2. Feature Engineering

4.1.3. Honesty and Sample Splitting

4.1.4. Multi-Treatment Extension

4.1.5. Time-Varying Treatment Effects

4.2. Counterfactual Simulation Environment

4.2.1. Response Model

4.2.2. Modeling Dynamic Effects

Carryover Effects

Adaptation Effects

Interaction Effects

4.2.3. State Transition Model

4.3. Deep Reinforcement Learning for Dynamic Policy Optimization

4.3.1. Markov Decision Process Formulation

4.3.2. State Representation

4.3.3. Reward Design

4.3.4. Adaptive Reward Mechanism

4.3.5. Policy Learning with Deep Q-Networks

4.3.6. Bootstrapping with Causal Forest Estimates

4.3.7. Policy Refinement with Counterfactual Simulation

4.3.8. Algorithm Implementation

4.4. Integrated Framework Operation

5. Experiments

5.1. Datasets

5.1.1. Criteo Uplift Dataset

5.1.2. RetailCo Customer Relationship Management Dataset

5.2. Baseline Methods

5.3. Evaluation Metrics

5.4. Implementation Details

5.4.1. Data Preprocessing

5.4.2. Causal Forest Implementation

5.4.3. Counterfactual Simulation Implementation

5.4.4. Deep Reinforcement Learning Implementation

5.4.5. Parameter Determination

5.5. Experimental Results

5.5.1. Results on Criteo Uplift Dataset

5.5.2. Results on RetailCo CRM Dataset

5.5.3. Ablation Study

5.5.4. Computational Environment and Execution Time Measurement

5.6. Case Study: Email Campaign Optimization

5.7. Discussion and Limitations

6. Conclusions and Future Work

6.1. Conclusions

6.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI