1. Introduction
In today’s data-rich marketing environment, the ability to identify which customers will respond positively to specific marketing interventions has become a critical competitive advantage [
1]. Marketing interventions—targeted actions such as email campaigns, personalized offers, or digital advertisements designed to influence customer behavior toward desired outcomes like purchases or engagement—represent the core “treatments” that marketers deploy to drive business results. Marketing uplift modeling, also known as incremental modeling or true-lift modeling, aims to measure the causal impact of these interventions by identifying which customer segments will be positively influenced by which specific marketing actions [
2]. This approach focuses on estimating the incremental effect (or “uplift”) of an intervention compared to no intervention across diverse customer segments. Traditional uplift modeling approaches, however, predominantly rely on static assumptions that fail to capture the dynamic nature of customer responses and the evolving marketing landscape [
3].
Marketing interventions typically produce heterogeneous effects across different customer segments, and these effects often evolve over time as customers adapt to marketing strategies [
4]. For instance, customers who initially respond positively to email promotions may develop “promotion fatigue” and become less responsive over time [
1]. Similarly, the effectiveness of marketing interventions can be influenced by competitive actions, changing customer preferences, and seasonal factors that are difficult to capture in static models [
5]. Moreover, the effectiveness of marketing interventions exhibits symmetry properties in how different customer segments respond to various treatments, which can be exploited for more efficient modeling.
The limitations of traditional uplift modeling approaches can be categorized into three key challenges: (1) Static treatment effect estimation: conventional methods estimate treatment effects at a fixed point in time, failing to capture how customer responsiveness evolves dynamically [
6]. (2) Limited adaptation capability: traditional models lack mechanisms to adapt to changing customer behavior patterns and market conditions [
7]. (3) Inability to optimize sequential decisions: Most uplift models focus on one-time interventions, rather than optimizing sequences of marketing actions over extended customer lifecycles [
8]. These fundamental limitations significantly impair the effectiveness of marketing campaigns in real-world settings, leading to suboptimal resource allocation, diminished returns on marketing investments, and missed opportunities for building long-term customer relationships. A solution that addresses these limitations must be capable of the following: (1) capturing the dynamic evolution of treatment effects over time, (2) continuously adapting to changing customer behaviors, and (3) optimizing sequences of marketing decisions, rather than isolated interventions.
Recent advances in causal machine learning, particularly causal forests, have shown promise in estimating heterogeneous treatment effects with improved accuracy and interpretability [
9]. Causal forests extend random forests to the domain of causal inference, enabling the estimation of conditional average treatment effects (CATEs) across different customer segments [
10]. However, while causal forests excel at identifying heterogeneous treatment effects, they still maintain the static assumption that customer responses remain constant over time. Simultaneously, reinforcement learning (RL) has emerged as a powerful framework for sequential decision-making under uncertainty [
11]. Deep reinforcement learning combines the representational power of deep neural networks with RL algorithms, enabling the optimization of complex decision policies in dynamic environments [
12]. In marketing contexts, RL offers the potential to optimize sequences of interventions while adapting to changing customer responses [
13].
This paper introduces a novel symmetry-preserving framework that integrates causal forests with deep reinforcement learning to address the limitations of traditional uplift modeling approaches. Our dynamic marketing uplift modeling framework leverages causal forests to estimate heterogeneous treatment effects across customer segments, which then inform a reinforcement learning agent that optimizes intervention strategies dynamically. The integration is facilitated through a counterfactual simulation environment that emulates diverse customer response patterns and an adaptive reward mechanism that captures both immediate and long-term intervention outcomes. The symmetry-preserving aspect of our framework ensures balanced representation and fair optimization across different customer segments, maintaining the fundamental symmetry properties inherent in causal inference while allowing for adaptive learning. Our approach directly addresses the three key limitations of traditional methods: (1) it captures dynamic treatment effects by continuously updating causal effect estimates as new data becomes available; (2) it adapts to changing customer behaviors through reinforcement learning that optimizes policies based on recent customer responses; and (3) it optimizes sequential decisions by modeling marketing interventions as a Markov decision process that accounts for the long-term impact of current actions.
The key contributions of our work include the following:
A unified framework that combines the strengths of causal forests in estimating heterogeneous treatment effects with the adaptive capabilities of deep reinforcement learning for marketing uplift modeling.
A counterfactual simulation methodology that enables the exploration of diverse intervention strategies without requiring costly real-world experiments.
An adaptive reward mechanism that balances short-term conversion goals with long-term customer value considerations.
Empirical validation of our approach on both simulated and real-world marketing campaign data, demonstrating significant improvements over traditional static uplift models.
Our empirical evaluations demonstrate that our approach significantly outperforms traditional static uplift models, achieving up to 27% improvement in targeting efficiency and an 18% increase in return on marketing investment. These results confirm that our dynamic, symmetry-preserving framework effectively addresses the limitations of static uplift models in real-world marketing scenarios.
The remainder of this paper is organized as follows:
Section 2 reviews related work in uplift modeling, causal machine learning, and reinforcement learning for marketing.
Section 3 introduces the problem setting and some basic knowledge for facilitating understanding our work.
Section 4 presents our proposed framework, detailing the integration of causal forests with deep reinforcement learning.
Section 5 describes our experimental setup and the results of our empirical evaluation. Finally,
Section 6 concludes with a summary of our findings and directions for future research.
3. Preliminaries
This section establishes the formal framework for dynamic marketing uplift modeling, introduces the key concepts and notations used throughout this paper, and formulates the problem of optimizing marketing interventions as a causal reinforcement learning problem.
3.1. Problem Setting
We consider a marketing environment where a firm interacts with a population of customers over time. Let represent the set of customers. For each customer, , the firm observes a set of features that describe the customer’s characteristics, such as demographics, purchase history, browsing behavior, and past interactions with marketing campaigns.
The firm has at its disposal a set of marketing interventions (or treatments) , where 0 represents the control condition (no intervention), and represent different marketing actions (e.g., email, mobile push notification, and discount offers). At discrete time points, , the firm decides which intervention, , to apply to each customer, i.
After receiving an intervention, each customer generates an outcome, (e.g., purchase, click, and engagement), which may be influenced by the intervention. The observed outcome is denoted as , representing the outcome under the applied intervention. Importantly, for each customer at each time point, we only observe the outcome under the applied intervention and not under alternative interventions.
The customer’s features may evolve over time based on their interactions with the firm and external factors. We denote the customer’s features at time t as , which includes both static features (e.g., demographics) and dynamic features (e.g., recent purchase behavior and responsiveness to previous interventions).
3.2. Causal Inference Framework
Following the potential outcomes framework [
47], we denote as
the potential outcome that would be observed for customer
i at time
t if intervention
were applied. Under the fundamental problem of causal inference, only one of these potential outcomes is observed for each customer at each time point—the one corresponding to the intervention that was actually applied.
The causal effect of intervention
a relative to the control condition for customer
i at time
t is defined as the difference in potential outcomes:
This quantity, known as the individual treatment effect (ITE), represents the incremental impact of intervention
a on the outcome of interest. In marketing contexts, this is often referred to as the uplift for customer
i under intervention
a at time
t.
Since we cannot observe all potential outcomes for a customer, we aim to estimate the conditional average treatment effect (CATE), which is the expected treatment effect conditional on customer features:
Traditional uplift modeling approaches typically estimate CATE as a static function of customer features, ignoring the temporal dynamics and sequential nature of marketing interventions. The conditional average treatment effect estimation can be viewed through the lens of symmetry, where the balanced representation of treatment and control conditions is essential for unbiased causal inference. In contrast, our approach explicitly models how treatment effects evolve over time and how they are influenced by sequences of interventions.
3.3. Markov Decision Process Formulation
To capture the sequential and dynamic nature of marketing interventions, we formulate the problem as a Markov decision process (MDP) [
48]. An MDP is defined by a tuple
, where the following applies:
is the state space, which, in our context, represents the space of customer states. Each state, , encapsulates the information about customer i at time t, including their features, , and relevant historical information.
is the action space, corresponding to the set of available marketing interventions.
is the transition function, where represents the probability of transitioning to state , given that action a is taken in state s. In our context, this models how customer states evolve in response to marketing interventions.
is the reward function, where represents the expected immediate reward of taking action a in state s. In marketing contexts, rewards can be defined based on various business objectives, such as conversion rates, customer lifetime value, or returns on marketing investment.
is the discount factor, which balances the importance of immediate versus future rewards.
A policy,
, is a map from states to actions, specifying which marketing intervention to apply in each customer state. The goal is to find an optimal policy,
, that maximizes the expected cumulative discounted reward:
3.4. Challenges in Dynamic Uplift Modeling
Several challenges arise when modeling dynamic marketing uplift effects as a causal reinforcement learning problem. Delayed effects: Marketing interventions may have both immediate and delayed effects on customer behavior. For instance, a promotional email might not lead to an immediate purchase but could influence the customer’s purchasing decision in the future. Carryover effects: The effect of a marketing intervention may persist over time and influence the customer’s response to future interventions. This creates dependencies between sequential interventions that need to be modeled explicitly. Adaptation effects: Customers may adapt their behavior in response to repeated marketing interventions. For example, a customer might initially respond positively to promotional emails but develop “promotion fatigue” over time, leading to decreased responsiveness. Exploration–exploitation tradeoff: when marketing interventions are optimized, there is a fundamental tradeoff between exploiting known effective strategies and exploring new strategies to gather more information about customer responses. Counterfactual estimation: to evaluate the causal effect of a marketing intervention, we need to estimate what would have happened if a different intervention had been applied, which requires addressing the counterfactual estimation problem.
Our proposed framework addresses these challenges by integrating causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The next section details our methodology and explains how it overcomes these challenges to enable effective dynamic marketing uplift modeling.
4. Methodology
In this section, we present our integrated framework for dynamic marketing uplift modeling. As illustrated in
Figure 1, our approach combines causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The framework consists of three key components: (1) a causal forest model for estimating conditional average treatment effects (CATEs), (2) a counterfactual simulation environment that emulates diverse customer response patterns, and (3) a deep reinforcement learning agent that optimizes intervention strategies dynamically. Our approach maintains symmetry in the exploration–exploitation tradeoff, ensuring balanced learning across customer segments and intervention types.
4.1. Causal Forest for Heterogeneous Treatment Effect Estimation
The first component of our framework is a causal forest model that estimates heterogeneous treatment effects across different customer segments. Causal forests extend random forests to the domain of causal inference, enabling the estimation of treatment effects while controlling for potential confounding factors.
4.1.1. Model Formulation
Following the potential outcomes framework, we define the individual treatment effect (ITE) for customer
i at time
t under intervention
a as follows:
where
is the potential outcome if intervention
a is applied, and
is the potential outcome under the control condition (no intervention). Since we can only observe one of these potential outcomes for each customer at each time point, we aim to estimate the conditional average treatment effect (CATE):
To estimate the CATE, we employ a causal forest model that adapts the splitting criteria of traditional random forests to maximize the heterogeneity in treatment effects across leaves. Specifically, for each tree in the forest, the splitting criteria at each node is based on maximizing the difference in treatment effects between the resulting child nodes.
Figure 1.
Overview of the proposed framework for dynamic marketing uplift modeling. The framework integrates causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The causal forest model estimates conditional average treatment effects (CATE) based on historical data, which inform the initial policy of the reinforcement learning agent. The counterfactual simulation environment emulates diverse customer response patterns, enabling the agent to explore and optimize intervention strategies without costly real-world experiments. The reinforcement learning agent continuously refines its policy based on observed outcomes and estimated treatment effects, adapting to changing customer behaviors over time.
Figure 1.
Overview of the proposed framework for dynamic marketing uplift modeling. The framework integrates causal forests for heterogeneous treatment effect estimation with deep reinforcement learning for dynamic policy optimization. The causal forest model estimates conditional average treatment effects (CATE) based on historical data, which inform the initial policy of the reinforcement learning agent. The counterfactual simulation environment emulates diverse customer response patterns, enabling the agent to explore and optimize intervention strategies without costly real-world experiments. The reinforcement learning agent continuously refines its policy based on observed outcomes and estimated treatment effects, adapting to changing customer behaviors over time.
4.1.2. Feature Engineering
To capture the dynamic aspects of customer responses, we incorporate both static and dynamic features in our causal forest model:
Static features,
, include customer demographics, acquisition channels, and other time-invariant characteristics. Dynamic features
capture the customer’s recent behavior and interactions, including their recent purchase history (e.g., recency, frequency, and monetary value), engagement metrics (e.g., email open rates and click-through rates), responsiveness to previous interventions (e.g., conversion rates for past campaigns), temporal patterns (e.g., seasonality and time since last purchase), and interaction features that capture the dependencies between different feature dimensions. Additionally, we incorporate treatment history features that capture the sequence and timing of previous interventions:
where
represents the intervention applied to customer
i at time
, and
represents the time elapsed since the
j-th previous intervention.
4.1.3. Honesty and Sample Splitting
To ensure an unbiased estimation of treatment effects, we employ the honest estimation approach proposed by [
9]. Honest estimation involves splitting the training data into two subsamples: one used for tree construction (determining splits) and the other for estimation (computing leaf node predictions). This approach reduces the adaptive estimation bias that can arise when the same data are used for both split selection and leaf node estimation.
4.1.4. Multi-Treatment Extension
To handle multiple treatment arms in marketing campaigns, we extend the causal forest model to estimate treatment effects for each intervention relative to the control condition. For each treatment, , we train a separate causal forest model, , that estimates . Alternatively, we can employ a multi-treatment causal forest approach that simultaneously estimates treatment effects for all interventions within a single model.
4.1.5. Time-Varying Treatment Effects
To capture how treatment effects evolve over time, we incorporate time-varying components in our causal forest model. Specifically, we estimate treatment effects conditionally on time features:
where
represents time-specific features such as day of week, month, or time since the start of the campaign. This approach allows the model to capture temporal patterns in treatment effects, such as diminishing returns or seasonal variations.
4.2. Counterfactual Simulation Environment
The second component of our framework is a counterfactual simulation environment that emulates diverse customer response patterns. This environment serves as a testbed for evaluating and optimizing intervention strategies without requiring costly real-world experiments.
4.2.1. Response Model
The core of the counterfactual simulation environment is a response model that predicts customer outcomes under different interventions. The simulation environment preserves the symmetry between observed and potential outcomes, maintaining the fundamental symmetry properties inherent in causal inference frameworks. We model the response function as follows:
where the following applies:
is the baseline response function, representing the expected outcome under the control condition (no intervention);
is the treatment effect function for intervention
a, which depends on both current features
and treatment history
; and
is a random noise term that captures unobserved factors affecting the outcome.
The treatment history
includes the sequence and timing of previous interventions:
where
represents the intervention applied, the time point, and the observed outcome for customer
i at time
j.
4.2.2. Modeling Dynamic Effects
To capture the dynamic aspects of customer responses, we incorporate several types of effects in the treatment effect function :
Carryover Effects
Carryover effects represent how the impact of an intervention persists over time. We model carryover effects using an exponential decay function:
where
represents the initial impact of intervention
a, and
controls the decay rate.
Adaptation Effects
Adaptation effects capture how customers’ responsiveness changes with repeated exposures to the same intervention. We model adaptation effects as follows:
where
is the number of times intervention
a has been applied to customer
i before time
t,
is the initial responsiveness, and
controls the rate of adaptation.
Interaction Effects
Interaction effects capture how the impact of an intervention depends on previous interventions. We model interaction effects using a matrix,
, where
represents the interaction effect between interventions
a and
b:
where
controls the decay rate of interaction effects. The overall treatment effect function combines these dynamic effects:
where
is the base treatment effect estimated by the causal forest model.
4.2.3. State Transition Model
In addition to modeling outcomes, the counterfactual simulation environment includes a state transition model that captures how customer features evolve in response to interventions. We model the state transition as follows:
where
g is a function that maps the current state, intervention, and outcome to the next state. This function can be implemented using various machine learning models, such as neural networks or gradient-boosting machines trained on historical data.
4.3. Deep Reinforcement Learning for Dynamic Policy Optimization
The third component of our framework is a deep reinforcement learning agent that optimizes intervention strategies dynamically. The RL agent learns a policy that maps customer states to interventions, maximizing cumulative rewards over time.
4.3.1. Markov Decision Process Formulation
We formulate the dynamic uplift modeling problem as a Markov decision process (MDP) defined by the tuple :
State space : the state represents the information about customer i at time t, including features, , and treatment history, .
Action space : the action represents the marketing intervention applied to customer i at time t.
Transition function : the transition function models how customer states evolve in response to interventions.
Reward function : The reward function quantifies the immediate benefit of applying intervention to customer i in state . The reward function is designed with symmetric properties to ensure fair value attribution across different customer segments and intervention types.
Discount factor : the discount factor balances the importance of immediate versus future rewards.
4.3.2. State Representation
To effectively capture the relevant information for decision-making, we design a rich state representation that includes the following: customer features, , including both static and dynamic features, as described earlier; treatment history features, summarizing the sequence and timing of previous interventions; estimated treatment effects from the causal forest model for each potential intervention, providing the agent with information about the expected incremental impact of each action; and uncertainty estimates for the treatment effects, enabling the agent to balance exploration and exploitation.
4.3.3. Reward Design
The reward function is a critical component of the RL formulation, as it defines the objective that the agent aims to optimize. We design a reward function that captures both immediate and long-term business objectives:
where
is the immediate reward associated with the outcome (e.g., conversion value and purchase amount), and
is the cost of intervention
(e.g., marketing spend and operational costs).
To encourage the agent to consider long-term customer value, we incorporate a forward-looking component in the following reward function:
where
represents the estimated change in long-term customer value resulting from the intervention, and
is a weight parameter that balances immediate returns with long-term value.
4.3.4. Adaptive Reward Mechanism
To address the challenge of non-stationary customer responses, we introduce an adaptive reward mechanism that adjusts the reward function based on observed outcomes and estimated treatment effects. The adaptive reward function is defined as follows:
where
is an adaptation function that modifies the reward based on the discrepancy between predicted and observed outcomes. This function can be implemented using various approaches, such as Thompson sampling or Bayesian optimization, to balance exploration and exploitation in a non-stationary environment.
4.3.5. Policy Learning with Deep Q-Networks
We employ Deep Q-Networks (DQNs) as the reinforcement learning algorithm for policy optimization. A DQN combines deep neural networks with Q-learning to approximate the action-value function , which represents the expected cumulative discounted reward of taking action a in state s and following the optimal policy thereafter.
The Q-function is approximated using a neural network with parameters
:
The network is trained to minimize the mean squared error between the predicted Q-values and the target Q-values:
where
is a replay buffer containing past experiences, and
are the parameters of a target network that is periodically updated to stabilize training.
To enhance the stability and performance of the DQN algorithm, we incorporate several extensions:
Double DQN: we use a separate target network for action selection and evaluation to reduce overestimation bias.
Prioritized experience replay: we prioritize experiences with higher temporal-difference errors for more efficient learning.
Dueling network architecture: we separate the estimation of state values and action advantages for better generalization.
Noisy networks: we incorporate parameter noise for exploration instead of epsilon-greedy action selection.
4.3.6. Bootstrapping with Causal Forest Estimates
To accelerate the learning process, we bootstrap the RL agent with estimates from the causal forest model. Specifically, we initialize the Q-function to approximate the immediate reward based on the estimated treatment effects:
where
is the treatment effect estimated by the causal forest model. This initialization provides the RL agent with a head start based on insights from the causal forest model while still allowing it to adapt and refine its policy over time.
4.3.7. Policy Refinement with Counterfactual Simulation
We use the counterfactual simulation environment to refine the RL policy without requiring costly real-world experiments. The policy refinement process involves the following steps: (1) initialize the RL agent with a policy derived from causal forest estimates; (2) simulate customer trajectories using the counterfactual simulation environment, applying the current policy to determine interventions; (3) update the RL agent’s policy based on observed outcomes in the simulation; (4) repeat steps 2–3 until convergence or a maximum number of iterations is reached. This approach enables the RL agent to explore and learn from a diverse range of customer scenarios, improving its ability to adapt to different customer behaviors and marketing contexts.
4.3.8. Algorithm Implementation
Algorithm 1 presents the pseudocode for our dynamic marketing uplift modeling framework, illustrating how the causal forest, counterfactual simulation, and reinforcement learning components interact.
The algorithm begins with an offline learning phase in which causal forests are trained for each intervention to estimate heterogeneous treatment effects. These estimates are then used to initialize a counterfactual simulation environment that emulates customer responses. In the policy refinement phase, we use a Double Deep Q-Network algorithm with prioritized experience replay and dueling architecture to optimize the intervention policy. The agent interacts with the counterfactual simulation environment, collecting experiences that are used to update the Q-network through temporal difference learning. Key aspects that differentiate our implementation include the following: the bootstrapping of the Q-network with causal forest estimates (line 12), the adaptive reward mechanism that incorporates both immediate outcomes and long-term value, the symmetry-preserving exploration strategy that ensures balanced learning across customer segments, and the continuous updating of model components as new data become available.
4.4. Integrated Framework Operation
The three components of our framework—causal forests, counterfactual simulation, and deep reinforcement learning—work together in an integrated manner to enable dynamic marketing uplift modeling. The overall operation of the framework involves the following steps:
1. Offline learning phase: The offline phase begins with training causal forest models on historical data to estimate heterogeneous treatment effects for each intervention. For the Criteo dataset, we trained separate models for the single treatment type, while for the RetailCo dataset, we developed individual models for each of the three email campaign types. The causal forest training process involved the careful handling of temporal aspects to prevent data leakage, with data from earlier time periods used for training and more recent data for validation. Next, we developed the counterfactual simulation environment based on both historical patterns and domain knowledge. Historical patterns were extracted from the data through a statistical analysis of customer responses under different conditions, identifying temporal trends, seasonality effects, and response decay patterns. Domain knowledge was incorporated in several specific ways: for the RetailCo dataset, marketing experts provided insights about typical email fatigue effects (modeled as diminishing returns after 3–4 emails within a 14-day window), cross-channel interaction effects (e.g., email followed by mobile notification, typically showing 15–20% higher response rates than two consecutive emails), and seasonal purchasing behaviors (e.g., higher responsiveness to promotional content during holiday periods). For the Criteo dataset, we incorporated knowledge about typical display advertising effects, including view-through conversion windows and frequency capping implications. The counterfactual simulation parameters were calibrated using a held-out portion of the historical data, with particular attention to accurately reproducing the observed heterogeneity in treatment effects across different customer segments. We validated the simulation environment by comparing its predictions against actual outcomes from historical randomized tests, achieving correlation coefficients of 0.78 for the RetailCo dataset and 0.72 for the Criteo dataset. Finally, we bootstrapped the RL agent with causal forest estimates and refined its policy through iterative simulation. The bootstrapping process involved initializing the Q-network to approximate the immediate rewards based on the estimated treatment effects, providing a more efficient starting point for policy learning compared to random initialization.
2. Online deployment phase: During deployment, the learned policy was applied to determine optimal interventions for each customer in real time. For the RetailCo dataset, this involved daily batch processing to select customers for different email campaign types, while for the Criteo dataset, decisions were made in near real-time (sub-second latency) for display advertising placement. Feedback from actual customer responses was systematically collected through the company’s existing tracking systems, including email opens, clicks, website visits, and purchases. These feedback data were anonymized and stored in a dedicated database for subsequent model updates. The models were updated periodically, rather than after each individual customer interaction. Specifically, we implemented a dual-trigger update mechanism: models were refreshed either after accumulating 10,000 new customer interactions or on a weekly schedule, whichever came first. This batch-updating approach balanced computational efficiency with the need for timely adaptation to evolving patterns. During each update cycle, the causal forest models were retrained with the augmented dataset, simulation parameters were recalibrated, and the RL policy was refined through additional simulation episodes.
3. Continuous learning loop: The continuous learning loop was implemented as an automated pipeline that monitored the performance of the framework through several key business metrics tracked in real-time dashboards: conversion rates, average order value, returns on marketing investment, and unsubscribe rates. Performance was evaluated both overall and for specific customer segments to identify potential areas for improvement. Patterns and insights were extracted from both successful and unsuccessful interventions through a combination of automated analysis and periodic review meetings with marketing stakeholders. For example, the analysis might reveal diminishing returns from promotional emails for certain customer segments, triggering adjustments to the reward function to place more emphasis on long-term engagement for those segments. The framework components were refined based on performance feedback through a systematic procedure: if performance declined for specific segments, feature engineering was revisited to capture potentially overlooked factors; if overall performance plateaued, exploration parameters in the RL component were temporarily increased; if the gap between simulated and actual outcomes grew, the simulation environment parameters were recalibrated with greater weight on recent data.
Algorithm 1 Dynamic marketing uplift modeling framework. |
Require: Historical data , customer features , interventions Ensure: Optimized policy
- 1:
/* Offline Learning Phase */ - 2:
TemporalSplit(, [0.7, 0.15, 0.15]) - 3:
FeatureEngineering() {Create static, dynamic and history features} - 4:
for each intervention do - 5:
TrainCausalForest(, , , a) {Train causal forest for intervention a} - 6:
EstimateCATE(, x, t) {Estimate conditional average treatment effects} - 7:
end for - 8:
InitializeCounterfactualSimulation(, ) - 9:
Initialize replay buffer - 10:
Initialize Q-network and target network with random weights - 11:
Initialize exploration parameter - 12:
/* Policy Refinement Phase */ - 13:
for episode to do - 14:
Sample batch of customers from - 15:
Get initial states - 16:
for to T do - 17:
for each customer i in batch do - 18:
if rand() then - 19:
random action from - 20:
else - 21:
- 22:
end if - 23:
CSim.Step(, ) - 24:
ComputeReward(, , ) - 25:
Store transition in - 26:
if replay buffer has enough samples then - 27:
Sample random mini-batch of transitions from - 28:
Calculate target values using double Q-learning: - 29:
- 30:
Update by minimizing - 31:
Periodically update target network: - 32:
end if - 33:
end for - 34:
end for - 35:
Decay exploration parameter - 36:
Evaluate policy on - 37:
end for - 38:
/* Evaluate final policy on test set */ - 39:
- 40:
Evaluate on - 41:
return
|
This integrated approach combines the strengths of causal inference for heterogeneous treatment effect estimation with the adaptive capabilities of reinforcement learning for sequential decision-making. The counterfactual simulation environment serves as a bridge between these two components, enabling safe exploration and policy refinement without costly real-world experiments. It is important to clarify how the conditional average treatment effect (CATE) estimates flow between the offline and online phases of our framework. As shown in
Figure 1, the causal forest model estimates CATEs during the offline learning phase using historical data. These initial CATE estimates are then used to bootstrap the reinforcement learning agent’s Q-function initialization, stored as part of the customer state representation for the online deployment phase, and updated periodically as new data become available through the continuous learning loop. During the online phase, the customer state includes features, history, and the most recently estimated CATE values. This allows the framework to leverage both the robust causal estimates from the offline phase and the adaptive policy optimization from the reinforcement learning component. The combination provides a balance between reliable treatment effect estimation and dynamic adaptation to evolving customer behaviors. The CATE values in the customer state serve multiple purposes: they inform the exploration strategy of the RL agent (with higher uncertainty estimates leading to more exploration), they contribute to the reward function calculation, and they provide interpretable insights for business stakeholders regarding which customer segments are most responsive to specific interventions.
5. Experiments
In this section, we present an experimental evaluation of our proposed framework for dynamic marketing uplift modeling. We first describe the datasets used for the evaluation, followed by the baseline methods, evaluation metrics, implementation details, and experimental results.
5.1. Datasets
5.1.1. Criteo Uplift Dataset
The Criteo Uplift Dataset [
49] is a publicly available dataset specifically designed for uplift modeling research. It contains data from a randomized controlled trial conducted by Criteo, a digital advertising company. The dataset includes approximately 13.9 million samples with 12 feature variables, a treatment indicator, and a binary conversion outcome. The treatment in this dataset corresponds to a display advertising campaign, and the outcome is whether the user converted (e.g., made a purchase) within a specific time window after exposure to the ad. The specific marketing intervention in this dataset consists of personalized display advertisements shown to users while browsing various websites. For example, a user might be shown a targeted banner ad for a product they previously viewed but did not purchase, with the goal of encouraging them to complete the transaction. These ads might contain product images, promotional messages, discounts, or calls to action like “Limited time offer” or “Buy now”. The control group received no advertisement, allowing for direct measurement of the incremental effect of the display advertising intervention on conversion probability. The dataset has several characteristics that make it suitable for evaluating dynamic uplift modeling approaches: it includes timestamps for each observation, allowing us to model temporal dynamics; the features are anonymized but include both static customer attributes and dynamic behavioral features; the randomized treatment assignment enables unbiased estimation of treatment effects; the large sample size allows for a reliable estimation of heterogeneous treatment effects across different customer segments. For our experiments, we augment the original dataset with additional derived features that capture temporal patterns and treatment history, such as the recency, frequency, and timing of previous exposures to advertisements.
5.1.2. RetailCo Customer Relationship Management Dataset
The second dataset is from a large retail company (anonymized as RetailCo), and it contains customer relationship management (CRM) data spanning two years. This proprietary dataset includes records of email marketing campaigns sent to customers and their subsequent purchasing behaviors. The dataset contains approximately 2.5 million samples from 450,000 unique customers, with 36 feature variables, multiple treatment options (different types of email campaigns), and both binary (purchase/no purchase) and continuous (purchase amount) outcome variables. This dataset encompasses a variety of marketing interventions in the form of different email campaign types. Promotional emails contained limited-time discount offers (e.g., “20% off your next purchase”), free shipping promotions, or buy-one-get-one offers. These interventions were designed to create urgency and immediate conversion. Informational emails featured new product announcements, seasonal catalogs, or style guides without explicit promotions, aiming to increase brand engagement and generate interest in new merchandise. Product recommendation emails contained personalized product suggestions based on customers’ browsing history, past purchases, or similar customer preferences. For example, these emails included “We thought you might like these items” sections with products from categories the customer previously showed interest in. Re-engagement emails targeted customers who had been inactive for a specific period (e.g., 60+ days), featuring special “We miss you” messaging and incentives to return to the site. The control condition in this dataset represents customers who were eligible for email campaigns but did not receive any during the observation period, allowing for measurement of the incremental impact of each intervention type. The key characteristics of this dataset include the following: longitudinal data with multiple interventions per customer over time; rich customer features, including demographics, purchase history, browsing behavior, and engagement metrics; multiple treatment types with varying content, offers, and sending times; and detailed outcome information, including both conversion events and monetary values. This dataset allows us to evaluate our approach in a more complex and realistic marketing scenario with sequential interventions and evolving customer behaviors.
5.2. Baseline Methods
We compare our dynamic marketing uplift modeling approach with several baseline methods.
Random assignment: randomly assigns treatments to customers, representing a non-targeted approach.
Response Model: Targets customers based on their predicted response probability, ignoring treatment effects.
Traditional uplift model [
29]: estimates treatment effects using a meta-learner approach (S-learner) with gradient boosting machines as base learners.
Static causal forest [
9]: applies the causal forest model for uplift estimation but without considering dynamic effects or sequential optimization.
Dynamic uplift model [
24]: incorporates time-varying components in the uplift model but without reinforcement learning for policy optimization.
Contextual bandits [
43]: uses a contextual multi-armed bandit approach to optimize interventions based on customer features, but without modeling long-term effects.
For the RetailCo dataset with multiple treatment options, we also included a business rule baseline that represents the company’s existing targeting strategy based on manually crafted business rules and segmentation.
5.3. Evaluation Metrics
We evaluate the performance of our approach using the following metrics.
Average treatment effect (ATE): The average difference in outcomes between the treatment and control groups. While simple to calculate, ATE masks heterogeneity in treatment effects across different customer segments [
50]. A positive ATE indicates that the intervention is effective on average but does not provide insights into which specific segments benefit most from the intervention.
Qini Coefficient: A measure of uplift model performance, calculated as the area between the uplift curve and the random targeting curve, normalized by the area of the perfect model [
8,
51]. The Qini coefficient quantifies targeting efficiency by measuring how much better the model performs compared to random targeting across different targeting thresholds. Higher values indicate superior ability to identify customers with positive treatment effects, with values close to 1 representing near-perfect targeting.
Expected response rate lift: The lift in response rate achieved by the model compared to random targeting, measured at different targeting thresholds. This metric directly translates to business value by showing the incremental gain in conversion rates when targeting specific percentiles of customers ranked by predicted uplift [
52]. When reported at multiple thresholds (e.g., top 10% and 30%), it provides insights into the model’s performance at different campaign scales.
Return on investment (ROI): The net incremental return (incremental outcome value minus intervention cost) divided by the intervention cost. ROI connects model performance to financial outcomes, making it particularly relevant for business stakeholders [
1]. This metric helps balance the trade-off between targeting accuracy and campaign costs.
Customer lifetime value (CLV) impact: The estimated change in customer lifetime value resulting from the optimized intervention strategy. Unlike immediate conversion metrics, CLV impact captures the long-term effects of interventions on customer value [
53,
54]. This is critical for sustainable marketing strategies that aim to build lasting customer relationships, rather than merely driving short-term conversions.
For the dynamic aspects of our approach, we also evaluate the following.
Adaptation performance: How well the model adapts to changing customer behaviors over time, measured by the stability of performance metrics across different time periods. This metric is particularly important for evaluating dynamic models in non-stationary environments where customer preferences evolve [
55]. High adaptation performance indicates resilience to concept drift and temporal shifts in customer behavior.
Cumulative reward: The total reward accumulated over multiple interaction periods, capturing the long-term effectiveness of the intervention strategy. This reinforcement learning metric evaluates the model’s ability to optimize sequential decisions over time, rather than just immediate outcomes [
35]. A higher cumulative reward indicates better long-term value optimization through sequential decision-making.
These metrics collectively assess both the immediate targeting efficiency and long-term value optimization capabilities of our approach, providing a comprehensive evaluation framework that aligns with both technical and business objectives.
5.4. Implementation Details
5.4.1. Data Preprocessing
For both datasets, we performed the following preprocessing steps. (1) Feature normalization: continuous features were standardized to have zero mean and unit variance. (2) Missing value imputation: missing values were imputed using a combination of mean imputation for continuous features and mode imputation for categorical features. (3) Feature engineering: we created additional features to capture temporal patterns, such as time since last treatment, day of week, month, and treatment frequency. (4) Temporal splitting: Data were split into training (70%), validation (15%), and test (15%) sets, maintaining the temporal order to prevent data leakage. For the RetailCo dataset, we also performed one-hot encoding of categorical variables and aggregated historical purchase data to create customer-level RFM (recency, frequency, and monetary value) features.
For the temporal split, we utilized 3 months of historical data for the Criteo Uplift Dataset and 12 months for the RetailCo CRM Dataset. Our experiments with various historical time windows revealed that 12-month windows provided the optimal balance between capturing long-term patterns and maintaining relevance to current behaviors. Shorter windows (3–6 months) resulted in approximately 8–10% lower Qini coefficients, while very long windows (18+ months) showing a 3–5% performance decrease. The effectiveness of historical data length varied by customer segment, with high-frequency purchase categories sometimes benefiting from shorter, more recent data windows, while considered purchases with longer decision cycles benefited from the full 12-month window.
For feature engineering, we incorporated several specific features across both datasets. The static features for the Criteo Dataset included the 12 anonymized features provided in the original dataset. For the RetailCo Dataset, static features comprised demographics (age group, gender, and location), acquisition source, customer tenure, loyalty tier, and preferred product categories. The dynamic features included recent engagement metrics such as open rates, click rates, and conversion rates calculated over multiple time windows (7-day, 30-day, and 90-day) to capture short, medium, and long-term patterns. We also incorporated recency–frequency–monetary (RFM) metrics, including days since a last purchase, purchase frequency in the last 90 days, and average order value in the last 90 days. Additional dynamic features included response decay (exponentially weighted average of past response rates with higher weights for more recent responses) and seasonal indicators such as the day of the week, the week of the month, and proximity to major shopping events. For history features, we incorporated the eight most recent interventions for each customer in the RetailCo dataset and the five most recent interventions in the Criteo dataset. We also included the intervention spacing (time intervals between consecutive interventions in days), the response to previous interventions (binary indicators for whether the customer responded to each of the previous interventions), and channel diversity (the entropy of communication channels used in recent interventions, for RetailCo dataset only). For the RetailCo dataset, we also engineered cross-channel interaction features that capture how customers respond to sequences of different intervention types (e.g., promotional email followed by product recommendation). Time-based definitions were calibrated based on the purchase cycle in each dataset: for high-frequency retail purchases in the RetailCo dataset, “recent” typically referred to 7–30-day windows, while for the Criteo dataset, which involved higher consideration purchases, “recent” encompassed 30–90-day windows. All categorical features were one-hot encoded, while continuous features were standardized using z-score normalization (subtracting the mean and dividing by the standard deviation). For features with highly skewed distributions such as monetary values and time intervals, we applied log-transformation before standardization. Missing values were imputed as described earlier, with means for continuous features and modes for categorical features calculated from the training data only.
5.4.2. Causal Forest Implementation
We implemented the causal forest model using the EconML library [
56] (
https://github.com/py-why/EconML, accessed on 1 April 2025), which provides efficient implementations of various causal machine learning methods. Specifically, we used the CausalForest class with the following hyperparameters. Number of trees: 2000; minimum samples leaf: 5; maximum depth: 10; criterion: “mse” (mean squared error); bootstrap: true; and number of jobs: −1 (use all available cores). The hyperparameters were tuned using a grid search with validation-set performance. For the multi-treatment case in the RetailCo dataset, we trained separate causal forest models for each treatment type.
5.4.3. Counterfactual Simulation Implementation
The counterfactual simulation environment was implemented as a Python 3.8 class using NumPy and Pandas. The environment simulates customer responses based on the estimated treatment effects from the causal forest model, combined with dynamic effect components. The simulation parameters were calibrated using historical data and expert knowledge from marketing practitioners at RetailCo.
Specifically, we estimated the following parameters from the data: baseline response function using a gradient boosting machine trained on control group data; carryover effect parameters and by analyzing the persistence of treatment effects over time; adaptation effect parameters and by examining how customer responsiveness changes with repeated exposures.
The state transition model was implemented using a recurrent neural network (RNN) that predicts the next state given the current state, action, and outcome.
5.4.4. Deep Reinforcement Learning Implementation
We implemented the deep reinforcement learning agent using PyTorch 2.2.0 and the RLlib library. Specifically, we used the Deep Q-Network (DQN) algorithm with the following specifications.
Neural network architecture. Input layer: dimension equal to the state representation size; hidden layers: three fully connected layers with 256, 128, and 64 neurons, each followed by ReLU activation; output layer: dimension equal to the number of actions.
Hyperparameters. Discount factor : 0.95; learning rate: 0.0001; replay buffer size: 100,000; mini-batch size: 64; target network update frequency: 1000 steps; exploration strategy: -greedy with annealing from 1.0 to 0.01 over 100,000 steps.
We incorporated the extensions mentioned in the Methodology section, including Double DQN, prioritized experience replay, and dueling network architecture, to improve stability and performance. The details are as follows. Double DQN: We addressed overestimation bias by using separate networks for action selection and evaluation. The target value was calculated as , where represents the target network parameters. Prioritized experience replay: We assigned priority to each transition in the replay buffer, where is the temporal-difference error and . Transitions were sampled with probability proportional to with , and importance sampling weights with annealed from 0.4 to 1.0 were applied to the loss function. Dueling network architecture: we split the network into value and advantage streams that were combined as to better identify state values independent of actions. Adaptive exploration: rather than using a simple -greedy approach with fixed annealing, we implemented a more nuanced exploration strategy that adjusted exploration rates based on uncertainty estimates from the causal forest models.
5.4.5. Parameter Determination
For all model components, we determined optimal parameters through systematic hyperparameter optimization, rather than using default values. For the causal forest models, we performed a grid search over the following ranges: number of trees {500, 1000, 2000, 3000}, minimum samples leaf {1, 5, 10, 20}, and maximum depth {6, 8, 10, 12}. Selection was based on out-of-bag validation performance. For the reinforcement learning component, we conducted a two-stage optimization process. First, we performed a coarse grid search over learning rates {, , }, discount factors {0.9, 0.95, 0.99}, and network architectures. Then, we performed a finer search around promising configurations. The final parameters were selected based on the model’s performance on the validation set, measured by the cumulative reward and Qini coefficient. The reward function weights, which balance immediate conversions with long-term value, were determined through A/B testing on a subset of the RetailCo dataset, comparing different weight configurations against business objectives. For the Criteo dataset, we used a simplified reward function since long-term customer value information was not available. For both datasets, we employed 5-fold cross-validation during the hyperparameter optimization phase to ensure robust parameter selection, with the final evaluation performed on the held-out test set to avoid overfitting to the validation data.
5.5. Experimental Results
5.5.1. Results on Criteo Uplift Dataset
Figure 2 shows the Qini curves for different methods on the Criteo Uplift Dataset. The Qini curve plots the incremental number of conversions as a function of the targeting size, with a steeper curve indicating better targeting performance. As shown in the figure, our Dynamic Causal RL approach achieves the highest Qini coefficient (0.52), followed by the static causal forest (0.41) and the traditional uplift model (0.37). The random assignment baseline has a Qini coefficient of approximately 0, as expected for a non-targeted approach.
Table 1 presents the detailed performance metrics for all methods on the Criteo Uplift Dataset. Our approach outperforms all baselines across multiple metrics, with particularly notable improvements in the expected response rate lift at the 30% targeting threshold.
Figure 3 illustrates how our approach adapts to changing customer behaviors over time. The plot shows the relative performance (measured by Qini coefficient) of different methods across six consecutive time periods. While all methods show some performance degradation over time due to evolving customer behaviors, our Dynamic Causal RL approach maintains higher performance throughout, with a significantly smaller decline in the later periods.
5.5.2. Results on RetailCo CRM Dataset
For the RetailCo dataset, we evaluated our approach on both binary (purchase/no purchase) and continuous (purchase amount) outcomes.
Figure 4 shows the expected response rate lift at different targeting thresholds for the binary outcome. Our approach achieves substantial improvements over the baselines, particularly in the top 10-20% of targeted customers.
Table 2 summarizes the performance metrics for all methods on the RetailCo CRM Dataset. Our approach achieves a 27% improvement in targeting efficiency (measured by Qini coefficient) and an 18% increase in ROI compared to the business rule baseline used by the company.
Figure 5 shows the distribution of treatment assignments for different customer segments under our approach compared to the business rule baseline. Our approach demonstrates more nuanced targeting, with different treatments assigned to different customer segments based on their estimated response patterns.
To evaluate the long-term impact of our approach, we simulated customer interactions over multiple periods using the counterfactual simulation environment.
Figure 6 shows the cumulative reward (measured as incremental revenue minus marketing costs) over 12 simulated periods. Our approach accumulates significantly higher rewards over time compared to the baselines, highlighting the benefits of optimizing for long-term customer value, rather than immediate responses.
5.5.3. Ablation Study
To understand the contribution of different components of our framework, we conducted an ablation study by removing or replacing key components and measuring the impact on performance.
Table 3 presents the results of this ablation study on the RetailCo CRM Dataset.
The ablation study reveals that all components of our framework contribute positively to the overall performance. Removing the causal forest component (replacing it with a traditional uplift model) leads to the largest performance drop, indicating the importance of accurate heterogeneous treatment effect estimation. The reinforcement learning component and the modeling of dynamic effects also contribute significantly to the performance, highlighting the value of sequential optimization and capturing temporal dynamics in customer responses.
To further evaluate the contribution of different reinforcement learning extensions, we conducted additional experiments comparing the performance of our framework with and without specific DQN enhancements.
Table 4 presents the results of these experiments on the RetailCo CRM Dataset.
As shown in
Table 4, all three extensions contributed to improved performance, with the removal of all extensions (Vanilla DQN) resulting in a substantial performance drop. Prioritized experience replay provided the most significant individual contribution, improving the Qini coefficient by approximately 0.07 points compared to the variant without this extension. This is likely because marketing datasets typically contain imbalanced conversion events, and prioritizing experiences with higher temporal-difference errors helps the agent learn more effectively from rare but informative conversion events. Double DQN contributed to more accurate Q-value estimation, reducing overestimation bias that could otherwise lead to suboptimal targeting decisions. The dueling network architecture improved the stability of learning, particularly for customer segments where the action selection had minimal impact on outcomes. Beyond the performance metrics shown in the table, we also observed that these extensions led to faster convergence during training (approximately 35% fewer iterations required) and greater consistency across multiple training runs (standard deviation of Qini coefficient across five runs reduced by 42%). These results highlight the importance of advanced reinforcement learning techniques in addressing the unique challenges of dynamic marketing uplift modeling, particularly when dealing with sparse rewards and heterogeneous customer responses.
To further explore how different components interact with each other, we conducted additional ablation experiments incorporating combinations of component removals.
Table 5 presents the results of these combination experiments on the RetailCo CRM Dataset.
These combination experiments reveal several important insights about component interactions. First, the combination of removing both the causal forest and dynamic effects components (Qini coefficient: 0.36) resulted in a performance drop that exceeded the sum of their individual removal effects (0.43 and 0.46, respectively), suggesting a synergistic relationship between accurate heterogeneous treatment effect estimation and temporal dynamics modeling. Similarly, removing both reinforcement learning and the adaptive reward mechanism led to a Qini coefficient of 0.39, which is lower than would be expected from their individual contributions. This highlights how the adaptive reward mechanism enhances the effectiveness of the reinforcement learning component by providing more informative feedback signals. The most severe performance degradation occurred when both the causal forest and reinforcement learning components were removed (Qini coefficient: 0.33), reducing the framework essentially to a dynamic uplift model without the benefits of either robust causal inference or sequential decision optimization. This underscores the fundamental importance of these two components to our framework’s effectiveness. When all key components were removed, resulting in a traditional uplift modeling approach (Qini coefficient: 0.30), the performance was substantially worse than any other variant, confirming the significant value added by our integrated framework. These combination experiments provide strong evidence for the complementary nature of our framework components, with each addressing specific limitations of traditional approaches while also enhancing the effectiveness of other components through their interactions.
5.5.4. Computational Environment and Execution Time Measurement
To evaluate the computational efficiency of different methods, we measured their execution times under identical conditions. All experiments were conducted on a server equipped with an Intel Xeon E5-2680 v4 CPU (2.4 GHz, 14 cores), 128 GB of RAM, and an NVIDIA Tesla V100 GPU with 16 GB of memory.
Table 6 presents the end-to-end runtime for each method, including model training and inference on the test set. For reinforcement learning-based approaches, the execution time includes the simulation runs required for policy optimization. The reported times are averaged over five independent runs to ensure reliability. We excluded data preprocessing time from these measurements since it is common across all methods. As shown in
Table 6, our Dynamic Causal RL approach requires more computational resources than simpler baselines, which is expected, given its more sophisticated modeling capabilities. The ablation study results provide additional insights into the computational cost of different components. Removing the reinforcement learning component results in the largest reduction in execution time, while removing the adaptive reward mechanism has the smallest impact. Despite the higher computational requirements, the improved performance of our approach justifies the additional computational cost, especially in marketing scenarios where even small improvements in targeting efficiency can translate to significant financial gains. Furthermore, once trained, the inference time of our model is comparable to other approaches, making it suitable for real-time deployment scenarios where decisions need to be made within milliseconds.
5.6. Case Study: Email Campaign Optimization
To demonstrate the practical application of our approach, we conducted a small-scale pilot study with RetailCo, focusing on optimizing email marketing campaigns for a subset of their customer base. For this pilot, we deployed our framework to optimize the targeting strategy for three types of email campaigns (promotional, informational, and product recommendation) across 50,000 customers over an 8-week period.
The pilot study compared three approaches: (1) business as usual (BAU), RetailCo’s existing targeting rules; (2) static uplift, a traditional uplift modeling approach; (3) our Dynamic Causal RL approach.
Figure 7 shows the key performance indicators (KPIs) for the three approaches, including conversion rates, average order values, and ROI. Our approach achieved a 23% higher conversion rate and a 31% higher ROI compared to the BAU approach, demonstrating the practical value of dynamic uplift modeling in real-world marketing scenarios.
Importantly, the pilot study also revealed that our approach led to a more balanced communication strategy, with fewer customers receiving excessive emails and more customers receiving relevant communications. This resulted in lower unsubscribe rates and improved customer satisfaction scores, highlighting the broader benefits of optimized targeting beyond immediate conversion metrics.
5.7. Discussion and Limitations
Our experimental results provide quantitative evidence for how our framework addresses the key challenges identified in
Section 3.4: For delayed effects, 23% of conversions in the RetailCo dataset occurred more than 7 days after intervention, with our framework showing a 15% improvement in long-term ROI compared to models optimizing only for immediate conversions. For carryover effects, interventions influenced customer behavior for an average of 18 days. Modeling these patterns reduced emails per customer by 26% while maintaining conversion rates, contributing approximately 0.04 points to the Qini coefficient improvement. For adaptation effects, response rates declined by 7% for each consecutive email within a 30-day period. Our adaptive approach reduced unsubscribe rates by 22% compared to baseline approaches by automatically adjusting frequency for fatigue-prone segments. The exploration–exploitation tradeoff was managed with a 20% initial exploration rate gradually reducing to 5%, resulting in 3% higher long-term ROI compared to completely eliminating exploration. For counterfactual estimation, our approach achieved 83% prediction accuracy for conversion events, with the simulation environment maintaining a 0.76 correlation with actual outcomes over 6 months. These results demonstrate how our methodological contributions directly address the specific challenges of dynamic marketing uplift modeling.
However, our approach also involved several limitations that should be acknowledged. Computational complexity: the training and deployment of our framework require significant computational resources, particularly for large-scale marketing campaigns with millions of customers. Data requirements: the accurate estimation of heterogeneous treatment effects and dynamic components requires rich historical data with randomized treatment assignment, which may not be available in all marketing contexts. Model interpretability: while causal forests provide some level of interpretability, the integration with deep reinforcement learning introduces complexity that may reduce the transparency of targeting decisions for business stakeholders. Cold-start problem: The framework relies on historical data for initialization, which may limit its effectiveness for new customers or new campaign types without sufficient data. Moreover, an additional limitation worth acknowledging is the challenge of adapting to sudden, major environmental shifts—such as the dramatic change in consumer behavior during the COVID-19 pandemic. While our framework’s reinforcement learning component and adaptive reward mechanism provide some ability to adjust to evolving patterns, extreme disruptions may still require explicit recalibration. In such scenarios, possible approaches include the following: (1) implementing change detection algorithms to identify when the environment has significantly shifted; (2) incorporating external context variables (e.g., lockdown status and economic indicators) that may explain behavioral changes; (3) applying transfer learning techniques to adapt pre-existing models to new conditions with limited data; and (4) increasing exploration rates temporarily when significant distribution shifts are detected. These adaptations represent promising directions for future research to enhance model resilience during major environmental changes. Despite these limitations, our framework provides a promising approach to optimizing marketing interventions in dynamic environments, with demonstrated benefits in both simulated scenarios and real-world applications.
6. Conclusions and Future Work
6.1. Conclusions
This paper has introduced a novel framework for dynamic marketing uplift modeling that integrates causal forests with deep reinforcement learning. By combining the strengths of both approaches, our framework addresses the key limitations of traditional uplift modeling methods, particularly their inability to capture the dynamic nature of customer responses and optimize sequential interventions.
Our main contributions include the following: (1) a unified framework that leverages causal forests for accurate heterogeneous treatment effect estimation and deep reinforcement learning for dynamic policy optimization, (2) a counterfactual simulation methodology that enables the exploration of diverse intervention strategies without costly real-world experiments, and (3) an adaptive reward mechanism that balances short-term conversion goals with long-term customer value considerations.
Empirical evaluations of both the Criteo Uplift Dataset and the RetailCo CRM Dataset demonstrated that our approach significantly outperforms traditional static uplift models and other baseline methods. Specifically, our framework achieved a 27% improvement in targeting efficiency (measured by Qini coefficient) and an 18% increase in ROI compared to industry-standard approaches. The results also showed that our method exhibits better adaptation to changing customer behaviors over time, maintaining higher performance across different time periods when compared to static approaches.
The case study with RetailCo further validated the practical value of our approach, with a 23% higher conversion rate and a 31% higher ROI compared to business-as-usual marketing strategies. Importantly, the optimized targeting strategy led to more balanced communication patterns, reducing unsubscribe rates and improving overall customer satisfaction.
In summary, our dynamic marketing uplift modeling framework offers marketers a powerful tool for implementing personalized and adaptive campaign strategies that evolve with changing customer behaviors and market conditions. By preserving symmetry properties in both causal estimation and reinforcement learning components, our framework achieves more balanced and generalizable performance across different marketing contexts. By addressing the limitations of traditional static uplift models, our approach enables more effective and efficient marketing interventions that maximize both immediate responses and long-term customer value.
6.2. Future Work
While our framework demonstrates promising results, several directions for future research remain.
Multi-touch attribution: extending our framework to incorporate multi-touch attribution models would enable a more accurate assessment of the contribution of each marketing touchpoint to conversion events, further improving the precision of uplift estimates in multi-channel marketing environments.
Contextual bandits integration: exploring the integration of contextual bandits with our framework could provide a more efficient approach for balancing exploration and exploitation, particularly in scenarios with limited historical data or rapidly changing customer preferences.
Model interpretability: Enhancing the interpretability of our framework would make it more accessible to marketing practitioners. Future work could focus on developing visualization tools and explanation methods to help marketers understand why specific interventions are recommended for different customer segments.
Online learning: implementing online learning capabilities would allow the model to continuously update and refine its parameters as new data becomes available, further improving its adaptability to evolving customer behaviors and market conditions.
Privacy-preserving uplift modeling: as privacy regulations become increasingly stringent, developing privacy-preserving versions of our framework that can operate with anonymized or federated data would be valuable for deployment in privacy-sensitive contexts.
Cross-domain transfer learning: investigating methods for transferring knowledge across different marketing campaigns or product categories could reduce the data requirements for new campaigns, enabling faster deployment and better initial performance.
Further exploration of symmetry-preserving uplift modeling techniques could enhance the robustness and fairness of personalized marketing interventions, particularly in highly heterogeneous customer populations. These research directions aim to address the current limitations of our framework and expand its applicability to a wider range of marketing scenarios and organizational contexts. As digital marketing continues to evolve with increasing personalization and real-time optimization requirements, frameworks like ours that combine causal inference with reinforcement learning will become increasingly important for effective customer engagement and value creation.