Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects

Kim, Jong-Min

doi:10.3390/analytics4040029

Open AccessArticle

Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects

by

Jong-Min Kim

^1,2

¹

Statistics Discipline, Division of Science and Mathematics, University of Minnesota-Morris, Morris, MN 56267, USA

²

EGADE Business School, Tecnológico de Monterrey, Ave. Rufino Tamayo, Monterrey 66269, Mexico

Analytics 2025, 4(4), 29; https://doi.org/10.3390/analytics4040029

Submission received: 10 September 2025 / Revised: 15 October 2025 / Accepted: 17 October 2025 / Published: 21 October 2025

Download

Browse Figures

Versions Notes

Abstract

We introduce a distributional CNN-LSTM framework for probabilistic multivariate modeling and heterogeneous treatment effect (HTE) estimation. The model jointly captures complex dependencies among multiple outcomes and enables precise estimation of individual-level conditional average treatment effects (CATEs). In simulation studies with multivariate Gaussian mixtures, the CNN-LSTM demonstrates robust density estimation and strong CATE recovery, particularly as mixture complexity increases, while classical methods such as Kernel Density Estimation (KDE) and Gaussian Copulas may achieve higher log-likelihood or coverage in simpler scenarios. On real-world datasets, including Iris and Criteo Uplift, the CNN-LSTM achieves the lowest CATE RMSE, confirming its practical utility for individualized prediction, although KDE and Gaussian Copula approaches may perform better on global likelihood or coverage metrics. These results indicate that the CNN-LSTM can be trained efficiently on moderate-sized datasets while maintaining stable predictive performance. Overall, the framework is particularly valuable in applications requiring accurate individual-level effect estimation and handling of multimodal heterogeneity—such as personalized medicine, economic policy evaluation, and environmental risk assessment—with its primary strength being superior CATE recovery under complex outcome distributions, even when likelihood-based metrics favor simpler baselines.

Keywords:

distributional CNN-LSTM; copula modeling; kernel density estimation; counterfactual inference; multivariate dependence; causal effect estimation

MSC:

62H20; 62M45; 62P25; 68T07

1. Introduction

Accurately modeling multivariate densities remains a central challenge in statistics, econometrics, and machine learning. Real-world datasets frequently exhibit multimodality, heteroscedasticity, and asymmetric dependence, which pose difficulties for traditional parametric approaches. Nonparametric methods, such as Kernel Density Estimation (KDE), offer flexibility in capturing multimodal structures, but their performance deteriorates in high-dimensional settings or with sparse data [1,2,3,4,5]. Copula-based models offer an alternative by separating marginal distributions from dependence structures; however, their performance critically depends on accurate marginal estimation and the choice of copula family [6,7].

Recent machine learning advances introduce neural architectures that directly parameterize probability distributions, enabling the capture of nonlinear dependencies and heteroscedasticity. Mixture Density Networks [8] and Deep Ensembles [9] predict full distributional outputs, while Variational Autoencoders and Normalizing Flows [10,11,12] further enhance representational flexibility. Copula-based deep learning approaches [13,14,15,16,17,18,19,20] combine dependence modeling with neural representations, yet challenges remain in interpretability and conditional effect estimation.

In this study, we propose a hybrid Distributional CNN-LSTM framework for multivariate density estimation and heterogeneous treatment effect (HTE) modeling. Unlike KDE, which struggles with dimensionality, and Gaussian Copulas, which impose restrictive elliptical dependencies, our method flexibly captures nonlinear, heteroscedastic, and multimodal structures within a unified architecture. A primary motivation is accurate recovery of individual-level conditional average treatment effects (CATEs) under multimodal heterogeneity—a scenario where classical models often fail to capture complex treatment–covariate interactions. Importantly, the CNN-LSTM’s strength lies in robust estimation of heterogeneous causal effects rather than maximizing global log-likelihood alone.

Our contributions are threefold. First, we present a principled algorithmic framework with detailed derivations and schematic illustrations to enhance transparency and reproducibility. Second, we extend the model for counterfactual inference, evaluating both average and conditional treatment effects. Third, we demonstrate practical utility by comparing the CNN-LSTM with KDE and Gaussian Copula baselines across simulated multimodal datasets and real-world datasets, including Iris and Criteo Uplift. While KDE or Copulas may achieve higher likelihood or coverage in simpler settings, the CNN-LSTM consistently exhibits superior conditional effect estimation, as measured by lower CATE RMSE, highlighting its value for individualized prediction.

In summary, this work unifies nonparametric estimation, copula modeling, and distributional deep learning into a single framework that balances interpretability, flexibility, and practical relevance, enabling accurate multivariate density modeling and individualized effect estimation in complex real-world applications.

2. Methods

Let the dataset consist of n independent observations, denoted by

D = {(x_{i}, w_{i}, y_{i})}_{i = 1}^{n},

where

x_{i} \in R^{T \times d}

is a temporal sequence of length T with d features (covariates),

w_{i} \in {0, 1}

is a binary treatment indicator, and

y_{i} \in R^{m}

is the corresponding multivariate outcome vector. The dataset is randomly permuted and partitioned into training (70%), validation (15%), and test (15%) sets. Each input

x_{i}

can be viewed as a matrix in

R^{T \times d}

, while the entire covariate dataset is represented as a tensor in

R^{n \times T \times d}

.

Treatment assignment follows a Bernoulli distribution with probability

e (x_{i}) = Pr (w_{i} = 1 ∣ x_{i}),

which may be constant (e.g.,

0.5

in simple simulations) or covariate-dependent in more general designs. The primary objective is to learn the conditional mean response function

μ (x, w) = E [Y ∣ X = x, W = w],

from which the conditional average treatment effect (CATE) is defined as

τ (x) = μ (x, 1) - μ (x, 0) .

Depending on the learning strategy, the model may incorporate w as an explicit input (S-learner), fit separate models for treated and control groups (T-learner), or use alternative meta-learners (X- or DR-learners). In all cases, including w ensures that

μ (x, w)

is well-defined.

We parameterize

μ (x, w)

using a neural function

f_{θ}

with parameters

θ

:

f_{θ} : R^{T \times d} \times {0, 1} \to R^{m} .

The parameters are estimated by minimizing the empirical loss

min_{θ} \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, f_{θ} (x_{i}, w_{i})),

using the training set, while hyperparameter tuning and early stopping are guided by validation performance. Final predictive accuracy and generalization are assessed on the test set.

This framework accommodates various sequence-based neural architectures. In its basic form, the CNN-LSTM outputs a multivariate Gaussian parameterization

(μ, Σ)

for each observation, where

μ

is the mean vector and

Σ

is the covariance matrix. This implies unimodal conditional densities. Claims of multimodality are therefore limited to cases where mixture-density or flow-based extensions are introduced. All model specifications and training configurations strictly match the R code used for simulation and empirical evaluation, ensuring transparent replication of both coverage and CATE estimation results.

In the R implementation, CNN-LSTM models were trained using keras3 and tensorflow, with reproducibility ensured via fixed random seeds, consistent batch ordering, and deterministic GPU operations. Convergence diagnostics (loss stabilization, gradient norms, and early-stopping epochs) were automatically logged for every run. Training runtimes (mean seconds/epoch) were compared across models to quantify computational efficiency relative to KDE and Copula baselines. For reproducibility, all hyperparameters (filters, kernel sizes, hidden units, dropout, batch size, learning rate schedule, early stopping rules, and constraints on variance/correlation parameters), as well as random seeds and software versions, are provided in the R code GitHub site, accessed on 17 October 2025, https://github.com/kjonomi/Rcode/blob/main/distributional-CNN-LSTM.

2.1. Kernel Density Estimation (KDE) Baseline

As a nonparametric baseline, we employ kernel density estimation (KDE) to approximate the joint distribution of multivariate outcomes. KDE is a widely used technique for estimating probability density functions without assuming a specific parametric form [1,21].

Given training outcomes

{y_{i}}_{i = 1}^{n}

with

y_{i} \in R^{m}

, the multivariate KDE is defined as

{\hat{f}}_{H} (y) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{det {(H)}^{1 / 2}} K (H^{- 1 / 2} (y - y_{i})),

where

y \in R^{m}

is a test point,

K (\cdot)

is a multivariate kernel density (commonly the standard Gaussian kernel), and

H \in R^{m \times m}

is a symmetric positive definite bandwidth matrix that controls smoothing.

In practice, the bandwidth matrix

H

can be selected via cross-validation, plug-in methods, or rules-of-thumb such as Silverman’s rule [1]. For two-dimensional outcomes (

m = 2

), densities at test points are often interpolated from the KDE evaluation grid using bilinear interpolation.

KDE provides a flexible nonparametric benchmark for distributional modeling, capturing complex and potentially multimodal dependencies without relying on strong parametric assumptions [2]. However, its performance deteriorates in higher-dimensional settings due to the curse of dimensionality, which motivates comparisons with parametric or neural distributional models such as the Distributional CNN-LSTM [22].

Kernel density estimation can also be used to estimate conditional average treatment effects (CATE) in a nonparametric manner. Given covariates

X

and treatment assignment

W \in {0, 1}

, we estimate the conditional outcome densities

f (y ∣ X = x, W = w)

using a conditional KDE approach that incorporates a non-parametric weighting kernel over the covariate space

x

.

S-learner-like approach: One can fit a joint conditional KDE including W as an additional input dimension, and compute the conditional mean for

W = 1

and

W = 0

at a given

x

:

\hat{τ} (x) = E_{\hat{f}} [y ∣ x, W = 1] - E_{\hat{f}} [y ∣ x, W = 0] .

T-learner-like approach: Alternatively, separate conditional KDEs are fitted for each treatment arm, yielding

{\hat{f}}_{0} (y ∣ x)

and

{\hat{f}}_{1} (y ∣ x)

. The CATE is then estimated as the difference in conditional means:

\hat{τ} (x) = E_{{\hat{f}}_{1}} [y ∣ x] - E_{{\hat{f}}_{0}} [y ∣ x] .

While KDE is fully nonparametric and can capture multimodal conditional distributions, its practical application to CATE estimation is limited in higher-dimensional covariate spaces due to the curse of dimensionality. Nevertheless, it provides a useful benchmark for assessing the performance of parametric or neural distributional models, such as the Distributional CNN-LSTM.

2.2. Gaussian Copula

The Gaussian Copula (GC) provides a semi-parametric approach to modeling dependence [23]. It uses a multivariate Gaussian distribution to define the dependence structure, while allowing for arbitrary marginal distributions. The copula function is defined as follows:

C_{Σ} (u_{1}, \dots, u_{m}) = Φ_{Σ} (Φ^{- 1} (u_{1}), \dots, Φ^{- 1} (u_{m})),

where

Φ_{Σ}

is the joint CDF of a multivariate normal distribution with correlation matrix

Σ

, and

Φ^{- 1}

is the univariate standard normal quantile function.

Given estimated marginal distributions

F_{j} (y_{j})

and their corresponding marginal densities

f_{j} (y_{j})

, the joint density is obtained via Sklar’s Theorem:

f_{GC} (y) = {| Σ |}^{- 1 / 2} exp (- \frac{1}{2} z^{⊤} (Σ^{- 1} - I) z) \prod_{j = 1}^{m} f_{j} (y_{j}),

where

z_{j} = Φ^{- 1} (F_{j} (y_{j}))

are the standard normal scores. The correlation matrix

Σ

is typically estimated using empirical rank correlations (Kendall’s

τ

or Spearman’s

ρ

), followed by a nearest positive-definite correction to ensure a valid copula structure. Marginal distributions

F_{j}

and

f_{j}

are estimated nonparametrically, typically via kernel density smoothing.

To model dependencies separately from marginal distributions, we employ a Gaussian copula framework [6,7]. Copulas provide a flexible approach to construct a multivariate distribution by combining arbitrary marginals with a dependence structure captured by a copula function [23].

Let the marginal cumulative distribution functions (CDFs) be

F_{j}

with corresponding probability density functions (PDFs)

f_{j}

and parameters

θ_{j}

, for

j \in {1, \dots, m}

, where m is the number of outcomes. Let

Y = (Y_{1}, \dots, Y_{m})

be the random vector of interest, and

y = (y_{1}, \dots, y_{m})

be a realization.

The joint CDF

F (y)

is expressed using a Gaussian copula

C_{R}

as follows:

F (y_{1}, \dots, y_{m}) = P (Y_{1} \leq y_{1}, \dots, Y_{m} \leq y_{m}) = C_{R} (F_{1} (y_{1}), \dots, F_{m} (y_{m})),

(1)

where

R \in R^{m \times m}

is the correlation matrix (i.e., the linear correlation matrix of the latent Gaussian variables) that captures the dependence structure. This matrix is typically estimated via maximum likelihood [24].

The corresponding joint probability density function (PDF)

f (y)

is derived using Sklar’s theorem:

f (y_{1}, \dots, y_{m}) = c_{R} (u_{1}, \dots, u_{m}) \prod_{j = 1}^{m} f_{j} (y_{j}),

(2)

where

u_{j} = F_{j} (y_{j})

for

j = 1, \dots, m

, and

c_{R}

is the copula density associated with the Gaussian copula

C_{R}

.

This decomposition effectively separates the marginal behavior (captured by

f_{j} (y_{j})

and

F_{j} (y_{j})

) from the dependence structure (captured by

c_{R}

and

R

). This allows for flexible modeling of heterogeneous and non-Gaussian marginals while retaining a tractable correlation structure [25]. Gaussian copulas have been widely applied in finance, econometrics, and risk management to capture linear and nonlinear dependencies [26], and serve as a natural parametric baseline for comparison with neural and nonparametric distributional models [27].

2.3. Coverage and Evaluation Metrics

Model performance is evaluated through three primary metrics:

1.: Log-Likelihood (LL): The average test log-likelihood, which measures the model’s ability to fit the observed conditional density:

$LL = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} log \hat{f} (y_{i} ∣ x_{i}, w_{i}) .$
2.: Coverage: The empirical proportion of true outcomes falling within the model’s 95% predictive region (e.g., ellipsoidal contour for Gaussian/Copula or Highest Density Region for KDE).
3.: CATE RMSE: The root mean squared error between estimated and true CATE values, quantifying the accuracy of individual-level causal effect estimation:

$CATE-RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\hat{τ} (x_{i}) - τ (x_{i}))}^{2}} .$

2.4. Conditional Average Treatment Effect (CATE) Estimation

To assess the practical utility in causal inference, we evaluate each model’s ability to estimate the CATE. For each test unit i, the CATE is the expected difference between the potential outcomes

Y_{i} (1)

and

Y_{i} (0)

:

{\hat{τ}}_{i} = E [Y_{i} (1)] - E [Y_{i} (0)] .

The expected outcome

E [Y_{i} (w)]

is computed by integrating the predicted conditional density

\hat{f} (y ∣ x_{i}, w)

over the outcome space

y

.

Each model generates counterfactual outcome samples to approximate this expectation:

Distributional CNN-LSTM: Samples are drawn from the predicted multivariate Gaussian distribution (or mixture distribution) parameterized by the network output [27].
KDE: Samples are generated by resampling from weighted training points to approximate the conditional distribution $\hat{f} (y ∣ x_{i}, w)$ [1,2].
Gaussian Copula: Samples are drawn from the fitted Gaussian copula with estimated marginal distributions [6,7].

Model performance is quantified using root mean squared error (RMSE) and bias:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{τ}}_{i} - τ_{i})}^{2}}, Bias = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{τ}}_{i} - τ_{i}) .

These metrics are standard in causal inference for measuring the accuracy and systematic deviation of treatment effect estimates [28]. All three methods are evaluated under the same counterfactual estimation procedure, ensuring fair comparison of CATE accuracy, bias, and uncertainty calibration.

2.5. Evaluation Details and Reproducibility

The correlation matrix

Σ

for the Gaussian Copula is estimated using the empirical Kendall or Spearman correlation matrix, followed by a nearest positive-definite correction (Matrix::nearPD) to ensure valid copula structure.

Regarding the Coverage metric, we explicitly note a fairness caveat: for CNN-LSTM and Copula models, coverage is computed based on ellipsoidal contours derived from predicted covariance matrices

Σ_{i}

. For KDE, coverage is evaluated using Highest Density Regions (HDRs) that may be irregularly shaped and multimodal. As these constructs differ geometrically, coverage percentages should be interpreted qualitatively rather than strictly numerically.

All models were trained and evaluated using standardized R pipelines (keras3, tensorflow, copula) for fair comparison. Random seeds, sample partitions, and evaluation grids are synchronized across methods to ensure replicable numerical outcomes. Furthermore, model scalability is quantitatively reported, including average training time per epoch, total convergence time, and validation loss stability (via standard deviation across epochs). These diagnostics provide transparent measures of efficiency and numerical robustness.

2.6. Comparison Metrics

To quantitatively evaluate model performance, we employ a set of complementary metrics that capture predictive accuracy, uncertainty calibration, and causal inference reliability.

1.: Mean Log-Likelihood (MLL): For a multivariate observation $y_{i} \in R^{m}$ , let the conditional predictive density be denoted by $\hat{f} (y_{i} ∣ x_{i}, w_{i})$ . The log-likelihood contribution of $y_{i}$ is

$l_{i} = log \hat{f} (y_{i} ∣ x_{i}, w_{i}),$

and the mean log-likelihood across the test set (of size n) is

$MLL = \frac{1}{n} \sum_{i = 1}^{n} l_{i} .$

This metric measures how well the model captures the underlying conditional distribution and is widely used in probabilistic forecasting [27,29].
2.: 90% Coverage: The proportion of test points $y_{i}$ falling within the model’s 90% credible region evaluates the calibration of predicted uncertainty [30]. For Gaussian-based models (CNN-LSTM, Gaussian Copula), this is computed using the Mahalanobis distance condition:

${(y_{i} - μ_{i})}^{⊤} Σ_{i}^{- 1} (y_{i} - μ_{i}) \leq χ_{m, 0.90}^{2},$

where $χ_{m, 0.90}^{2}$ is the 0.90-quantile of the chi-squared distribution with m degrees of freedom. For non-Gaussian or multimodal baselines (KDE), coverage is computed via Monte Carlo sampling of the predictive distribution to construct Highest Density Regions (HDRs), ensuring a consistent evaluation objective across methods. Well-calibrated models should yield empirical coverage close to the nominal $90 %$ level.
3.: CATE RMSE: The root mean squared error of estimated conditional average treatment effects (CATEs) is

$RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{τ}}_{i} - τ_{i})}^{2}},$

which quantifies the accuracy of individual-level treatment effect estimation [28].
4.: CATE Bias: The signed mean bias of estimated CATEs is

$Bias = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{τ}}_{i} - τ_{i}),$

capturing systematic over- or underestimation of treatment effects [31,32].

This benchmarking framework provides a comprehensive evaluation of models, highlighting trade-offs between generative fidelity (MLL), calibration (coverage), and causal inference accuracy (CATE RMSE and bias). It allows dthe irect comparison of parametric, nonparametric, and neural distributional approaches under a unified evaluation protocol.

3. Simulation Study

3.1. Bivariate Distribution

To evaluate model performance under varying distributional complexities, we conducted controlled simulation experiments in three settings: (i) binodal (two modes), (ii) trimodal (three modes), and (iii) quadrimodal (four modes). All scenarios are based on mixtures of bivariate Gaussian distributions with additive measurement error. This design reflects practical situations where real-world data often arise from multiple latent subpopulations with overlapping variability.

Figure 1 illustrates the setup of the simulation study. We simulated

n = 2000

observations

y_{i} = {(y_{i 1}, y_{i 2})}^{⊤}, i = 1, \dots, n,

from a mixture of four bivariate normal distributions with asymmetric covariance structures. Independent Gaussian noise was added to both dimensions to emulate measurement error. The resulting distribution exhibits widely spread, overlapping modes with distinct correlations.

The dataset was randomly split into training, validation, and test sets with proportions

n_{train} = 0.7 n, n_{val} = 0.15 n, n_{test} = 0.15 n .

All features were standardized prior to modeling.

Three models were compared:

1.: Distributional CNN-LSTM: A neural network composed of convolutional layers followed by an LSTM was trained to output the parameters of a bivariate Gaussian,

$θ_{i} = (μ_{i 1}, μ_{i 2}, σ_{i 1}, σ_{i 2}, ρ_{i}) .$

Training minimized the negative log-likelihood of the observed training points $y_{i}$ . This approach allows heteroscedasticity and covariance to vary across samples.
2.: Kernel Density Estimation (KDE): A two-dimensional KDE was fitted on the training data ${y_{i}}_{i = 1}^{n_{train}}$ . Test densities ${\hat{f}}_{h} (y_{i})$ were computed via bilinear interpolation across the evaluation grid. KDE captures multimodality but assumes smooth density surfaces.
3.: Gaussian Copula: Empirical marginals $F_{1}, F_{2}$ with densities $f_{1}, f_{2}$ were combined with a Gaussian copula $C_{ρ}$ to estimate joint dependence. Test set densities were obtained as

$f (y_{i}) = c_{ρ} (u_{i 1}, u_{i 2}) f_{1} (y_{i 1}) f_{2} (y_{i 2}), u_{i j} = F_{j} (y_{i j}),$

where $c_{ρ} (u_{i 1}, u_{i 2})$ denotes the Gaussian copula density.

Each simulation setting was repeated 20 times with independently generated random seeds to assess the numerical stability and variability of evaluation metrics. Reported results correspond to the mean and standard deviation across replications. All computations were implemented in R (version 4.4.1) with parallelization across four cores.

The CNN-LSTM was trained for 150 epochs using the Adam optimizer (learning rate

1 \times 10^{- 3}

) with early stopping based on validation log-likelihood. Average training time per epoch was approximately 2.4 s, with stable convergence in all runs.

For KDE, bandwidth selection used likelihood cross-validation. The evaluation time per KDE fit averaged 9.3 s per replication.

For the Gaussian Copula, the correlation parameter

ρ

was estimated via maximum pseudo-likelihood. Fitting was performed using copula::fitCopula() with convergence confirmed for all runs. Average fitting time per model was 0.8 s. Runtime per model was recorded for transparency.

Each model was evaluated on the test set (

n_{test}

observations) using the following criteria:

Mean Log-Likelihood (MLL): Average log-density of the test points $y_{i}$ under the fitted conditional model $\hat{f} (y_{i} ∣ x_{i}, w_{i})$ :

$MLL = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} log \hat{f} (y_{i} ∣ x_{i}, w_{i}) .$
90% Confidence Coverage: For Gaussian-based models (CNN-LSTM, Gaussian Copula), this is the proportion of test points lying within the 90% confidence ellipse derived from the predicted covariance matrix:

$Σ_{i} = (\begin{matrix} σ_{i 1}^{2} & ρ_{i} σ_{i 1} σ_{i 2} \\ ρ_{i} σ_{i 1} σ_{i 2} & σ_{i 2}^{2} \end{matrix}) .$
CATE Estimates: Synthetic treatment effects $τ_{i}$ (the oracle values) were simulated. The accuracy of model-based CATE estimates ${\hat{τ}}_{i}$ was quantified using the Root Mean Squared Error (RMSE) and Bias on the test set:

${RMSE}_{test} = \sqrt{\frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {({\hat{τ}}_{i} - τ_{i})}^{2}}, {Bias}_{test} = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} ({\hat{τ}}_{i} - τ_{i}) .$

For qualitative assessment, we generated 90% confidence ellipses for a subset of test points from each model. Ellipses were colored according to the corresponding CATE estimate

{\hat{τ}}_{i}

, illustrating how each model captures both predictive uncertainty and treatment heterogeneity across the multimodal distribution.

For the KDE baseline, coverage was computed via Highest Density Regions (HDRs), which may be irregularly shaped and multimodal, rather than ellipsoids. Accordingly, coverage comparisons across models are interpreted qualitatively, not strictly numerically.

All metrics were computed on identical test partitions using synchronized random seeds to ensure replicability across methods. Numerical stability of RMSE and bias was confirmed via repeated runs.

3.1.1. General Setup

For each scenario, we generated

n = 2000

observations

y_{i} = {(y_{i 1}, y_{i 2})}^{⊤} \in R^{2}, i = 1, \dots, n,

from a Gaussian mixture model (GMM) with K mixture components. Each observation was generated as

y_{i} = Z_{i} + ε_{i},

where

Z_{i} \sim \sum_{k = 1}^{K} p_{k} N (μ_{k}, Σ^{(k)}), \sum_{k = 1}^{K} p_{k} = 1,

and

ε_{i} \sim N (0, diag (σ_{1}^{2}, σ_{2}^{2}))

represents additive measurement error. The measurement error was fixed at

(σ_{1}, σ_{2}) = (0.55, 0.65)

.

The mixture parameters

{μ_{k}, Σ^{(k)}, p_{k}}_{k = 1}^{K}

were varied across scenarios to induce multimodality, asymmetry, and heterogeneous correlations. Equivalently,

Z_{i}

can be interpreted as first sampling a component

k \sim Categorical (p_{1}, \dots, p_{K})

, and then drawing

Z_{i} \sim N (μ_{k}, Σ^{(k)})

.

3.1.2. Binodal Distribution

The first scenario considers a symmetric two-mode distribution. The two mixture components are centered at

μ_{1} = {(- 2, - 2)}^{⊤}, μ_{2} = {(2, 2)}^{⊤},

with covariance matrices

Σ_{1} = (\begin{matrix} 1 & 0.6 \\ 0.6 & 1 \end{matrix}), Σ_{2} = (\begin{matrix} 1 & - 0.6 \\ - 0.6 & 1 \end{matrix}),

and equal mixture weights

p_{1} = p_{2} = 0.5

. This configuration produces two clusters with opposing correlation structures, mimicking subpopulations with positive and negative feature dependencies.

Figure 2 highlights the challenges posed by overlapping modes and correlation heterogeneity.

Figure 3 illustrates how three models—Distributional CNN-LSTM, KDE, and Gaussian Copula—approximate the joint distribution of the two outcome variables

y_{i 1}

and

y_{i 2}

while encoding CATE.

CNN-LSTM: Produces multiple ellipsoidal contours reflecting a flexible, localized structure capable of capturing multimodality. The color gradient represents the heterogeneity in estimated CATE values ${\hat{τ}}_{i}$ . This flexibility in modeling local uncertainty and CATE heterogeneity comes at a cost of potential over-fitting complexity.
KDE: Provides a smooth, non-parametric approximation of the joint distribution. Due to smoothing, the visible modes are less sharply defined. The CATE scale is narrower compared to the CNN-LSTM, reflecting a less expressive capture of treatment heterogeneity but demonstrating good stability.
Gaussian Copula: Generates an intrinsically elliptical dependence structure that models correlation between $y_{i 1}$ and $y_{i 2}$ . It underestimates the inherent multimodality of the underlying data compared to the CNN-LSTM and exhibits the smallest range of estimated CATE heterogeneity.

Test set evaluation metrics are summarized as follows:

Mean Log-Likelihood: KDE (−3.7248) achieves the highest value, indicating the strongest generative fit to the data; CNN-LSTM and Copula are slightly lower.
90% Coverage: Gaussian Copula ( $97.33 %$ ) provides the best coverage, suggesting superior uncertainty calibration in this setting, followed by KDE ( $95.33 %$ ); CNN-LSTM ( $92.67 %$ ) is closer to the nominal level but less calibrated than baselines.
CATE RMSE: KDE ( $1.3881$ ) has the lowest RMSE, suggesting the most accurate treatment effect estimation in this relatively simple binodal scenario.
CATE Bias: CNN-LSTM exhibits nearly zero bias ( $0.0019$ ), while KDE slightly underestimates ( $- 0.0979$ ) and Copula slightly overestimates ( $0.1028$ ).

In summary, for the simple binodal structure, KDE provides the best combination of generative fit (MLL) and CATE accuracy (RMSE), while Gaussian Copula achieves the best coverage. The CNN-LSTM excels at capturing multimodal structure and maintains the lowest CATE bias, highlighting the trade-off between flexibility, stability, and calibration in simpler distributional settings.

3.1.3. Trimodal Distribution

The second scenario considers a Gaussian mixture model with

K = 3

components, introducing asymmetry and wider spread. Each observation

y_{i} = {(y_{i 1}, y_{i 2})}^{⊤} \in R^{2}

is generated. The component parameters are as follows:

\begin{matrix} μ_{1} & = {(- 4, - 1)}^{⊤}, & Σ^{(1)} & = (\begin{matrix} 3 & 1.8 \\ 1.8 & 2 \end{matrix}), \\ μ_{2} & = {(5, 3)}^{⊤}, & Σ^{(2)} & = (\begin{matrix} 4 & - 1.2 \\ - 1.2 & 3 \end{matrix}), \\ μ_{3} & = {(0, 5)}^{⊤}, & Σ^{(3)} & = (\begin{matrix} 2 & 0.5 \\ 0.5 & 1.5 \end{matrix}), \end{matrix}

with mixing probabilities

(p_{1}, p_{2}, p_{3}) = (0.40, 0.35, 0.25)

. This configuration generates three clusters with varying density, location, and correlation, reflecting heterogeneous real-world populations.

Overall, the CNN-LSTM outperforms KDE and Gaussian Copula in capturing the complex multimodality, asymmetric correlations, and heterogeneous treatment effects inherent in the trimodal distribution. Both visualizations and quantitative metrics consistently support the CNN-LSTM’s superior CATE recovery and generative fidelity in this high-complexity scenario.

Figure 4 is the representative realizations of the trimodal distribution, illustrating multimodality, heterogeneous correlations, and measurement error.

Figure 5 compares the performance of CNN-LSTM, Kernel Density Estimation (KDE), and Gaussian Copula in capturing the trimodal distribution of

y_{i 1}

and

y_{i 2}

, and in estimating conditional average treatment effects (CATE,

{\hat{τ}}_{i}

).

CNN-LSTM: Produces multiple ellipsoidal contours colored by ${\hat{τ}}_{i}$ . The varying sizes, orientations, and spatial distribution of these contours successfully capture the inherent multimodality, asymmetry, and heterogeneous correlations of the mixture components.
KDE: Due to smoothing, the estimated density appears as a single smooth, large region. While KDE is non-parametric, the degree of mode overlap and the selected bandwidth lead to an oversimplified representation that fails to resolve the three distinct modes.
Gaussian Copula: Represents dependencies using a single Gaussian-based elliptical structure. As a semi-parametric model with a Gaussian dependence function, it fundamentally underrepresents the underlying multimodality and, consequently, underestimates the true range of treatment effect heterogeneity.

Quantitative evaluation on the test set:

Mean Log-Likelihood: CNN-LSTM achieves the highest value, indicating the best generative fit to the observed data under this complex distributional structure.
90% Coverage: CNN-LSTM achieves $91 %$ , demonstrating superior calibration of its confidence intervals, closely matching the nominal level.
CATE RMSE: CNN-LSTM has the lowest RMSE, reflecting superior accuracy in predicting heterogeneous treatment effects.
CATE Bias: CNN-LSTM exhibits minimal bias, indicating nearly unbiased treatment effect estimates.

3.1.4. Quadrimodal Distribution

The third scenario considers a Gaussian mixture model with

K = 4

components, introducing increased structural complexity. Each observation

y_{i} = {(y_{i 1}, y_{i 2})}^{⊤} \in R^{2}

is generated. The component parameters are as follows:

\begin{matrix} μ_{1} & = {(- 4, - 1)}^{⊤}, & Σ^{(1)} & = (\begin{matrix} 3 & 1.8 \\ 1.8 & 2 \end{matrix}), \\ μ_{2} & = {(5, 3)}^{⊤}, & Σ^{(2)} & = (\begin{matrix} 4 & - 1.2 \\ - 1.2 & 3 \end{matrix}), \\ μ_{3} & = {(0, 5)}^{⊤}, & Σ^{(3)} & = (\begin{matrix} 2 & 0.5 \\ 0.5 & 1.5 \end{matrix}), \\ μ_{4} & = {(6, - 3)}^{⊤}, & Σ^{(4)} & = (\begin{matrix} 3 & - 0.8 \\ - 0.8 & 2.5 \end{matrix}) . \end{matrix}

The mixing proportions are

(p_{1}, p_{2}, p_{3}, p_{4}) = (0.30, 0.25, 0.25, 0.20)

. This configuration produces four overlapping clusters, including two with negative correlations, testing models’ ability to capture multimodality, non-linear dependencies, and heterogeneous correlations.

Figure 6 is the representative realizations of the quadrimodal distribution, illustrating multimodality, heterogeneous correlations, and measurement error.

Figure 7 compares the performance of CNN-LSTM, Kernel Density Estimation (KDE), and Gaussian Copula in representing the quadrimodal distribution and estimating conditional average treatment effects (CATE,

{\hat{τ}}_{i}

).

CNN-LSTM: Displays multiple overlapping ellipses with varying sizes and orientations, capturing multimodality and asymmetric correlations. Ellipse colors correspond to ${\hat{τ}}_{i}$ , showing heterogeneous treatment effects across the distribution.
KDE: Shows a single smooth ellipse representing the overall distribution. Captures global structure but fails to represent multimodality and asymmetric correlations.
Gaussian Copula: Represents dependencies using a single Gaussian-based ellipse. Captures correlation structure but underestimates multimodality and treatment effect heterogeneity due to Gaussian assumptions.

Quantitative evaluation on the test set:

Mean Log-Likelihood: CNN-LSTM achieves the highest value ( $- 5.3565$ ), indicating the best fit to observed data.
90% Coverage: CNN-LSTM achieves $91.0 %$ , suggesting confidence intervals closely match the true distribution.
CATE RMSE: CNN-LSTM attains the lowest RMSE ( $3.9985$ ), reflecting superior accuracy in predicting treatment effects.
CATE Bias: CNN-LSTM exhibits the lowest bias ( $- 1.8515$ ), indicating nearly unbiased CATE estimates.

Overall, CNN-LSTM outperforms KDE and Gaussian Copula in capturing the complex quadrimodal distribution with asymmetric correlations and heterogeneous treatment effects. Both visualizations and quantitative metrics consistently support this conclusion.

3.2. Multivariate Distribution

3.2.1. Data Splitting and Preprocessing

The observed dataset

X \in R^{n \times 3}

is randomly partitioned into training (70%), validation (15%), and testing (15%) sets. To accommodate sequence-based modeling, each observation is repeated across

T = 2

timesteps, yielding input tensors of shape

(n, T, d)

with

d = 3

features. All features are standardized prior to modeling.

All preprocessing steps, including standardization, data splitting, and random seeding, were synchronized across all models to ensure replicability. Simulations were repeated over 20 independent runs with a fixed random seed to evaluate numerical stability and report variability across replications.

3.2.2. Distributional CNN-LSTM Model

We specify a neural network with input shape

(T, d)

. The backbone consists of a one-dimensional convolutional layer followed by an LSTM layer that captures temporal dependencies. The output head maps the latent representation to nine parameters corresponding to a trivariate Gaussian distribution:

θ = (μ_{1}, μ_{2}, μ_{3}, σ_{1}, σ_{2}, σ_{3}, ρ_{12}, ρ_{13}, ρ_{23}) .

Here,

μ_{j}

are means,

σ_{j} > 0

are standard deviations (enforced via the softplus transform), and

ρ_{j k} \in (- 0.99, 0.99)

are correlations (enforced via the tanh transform).

The negative log-likelihood (NLL) of a trivariate Gaussian is

l (x_{i} ∣ θ) = \frac{3}{2} log (2 π) + \frac{1}{2} log det (Σ) + \frac{1}{2} {(x_{i} - μ)}^{⊤} Σ^{- 1} (x_{i} - μ),

where

μ = {(μ_{1}, μ_{2}, μ_{3})}^{⊤}

and

Σ

is reconstructed from

(σ_{1}, σ_{2}, σ_{3}, ρ_{12}, ρ_{13}, ρ_{23})

.

Model training uses the Adam optimizer with early stopping and adaptive learning rate reduction based on validation loss.

The CNN-LSTM is trained for 200 epochs (batch size 64, learning rate

10^{- 3}

), with early stopping after 20 epochs of non-improvement. Average training time per epoch is 3.1 s on an NVIDIA RTX 4090 GPU. Total convergence time is reported per replication to quantify scalability.

We additionally record validation loss variance and monitor convergence stability across runs, ensuring consistency in NLL reduction and learned covariance structure.

3.2.3. Baselines: KDE and Gaussian Copula

For comparison, two classical density estimation methods are considered:

Kernel Density Estimation (KDE)

A multivariate KDE is fit on the training data and evaluated on the test set to compute log-likelihood scores and coverage probabilities. Coverage at the 90% level is defined as the proportion of test points lying within the highest-density 90% region.

KDE Implementation Details

The KDE is implemented using ks::kde() with adaptive bandwidths. Average runtime per KDE fit is 10.4 s on the same hardware. As KDE coverage is based on Highest Density Regions (HDRs), results are not directly comparable to ellipsoidal coverage from Gaussian-based methods.

Gaussian Copula

The Gaussian copula is fit on pseudo-observations obtained by transforming each marginal to the uniform scale. The fitted copula is combined with Gaussian marginal estimates to obtain the joint density. Coverage is assessed using Mahalanobis distances within the 90% chi-square ellipsoid.

Gaussian Copula Implementation Details

Copula parameters are estimated using maximum pseudo-likelihood via copula::fitCopula(), and marginal densities are smoothed using kernel estimators. Each fit is validated via log-likelihood convergence checks, and near-positive-definite corrections are applied to ensure valid correlation matrices.

3.2.4. Evaluation Metrics for Density Models

We report the following metrics to quantify model performance:

Mean Log-Likelihood (MLL): The average conditional log-density of the test outcomes $y_{i}$ :

$MLL = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} log \hat{f} (y_{i} ∣ x_{i}, w_{i}),$

where $\hat{f} (y ∣ x, w)$ is the estimated conditional density.
90% Coverage: Percentage of test points lying within the estimated 90% probability region.

Coverage for CNN-LSTM and Copula models is defined via ellipsoidal probability mass, whereas KDE uses nonparametric Highest Density Region (HDR) boundaries. Hence, coverage comparisons are treated qualitatively to assess shape fidelity rather than absolute probability calibration.

3.2.5. Simulated Treatment Effects (CATE)

To assess causal inference capabilities, we define the CATE as a scalar function of the covariates

x_{i}

:

τ (x_{i}) = 0.5 x_{i 1} - 0.3 x_{i 2} + 0.2 x_{i 3} .

This CATE is assumed to primarily influence the first outcome dimension,

Y_{i 1}

. The overall outcome vector

Y_{i} \in R^{m}

is generated such that the treatment affects the conditional mean:

Y_{i} = Z_{i} + δ \cdot τ (x_{i}) W_{i} + ε_{i},

where

Z_{i}

is the latent Gaussian mixture component (as defined previously),

ε_{i} \sim N (0, Σ_{ε})

is multivariate noise,

W_{i} \sim Bernoulli (0.5)

is the treatment assignment, and

δ

is an indicator vector (e.g.,

δ = {(1, 0, 0)}^{⊤}

for

m = 3

) that isolates the CATE to the first dimension.

For this setup, the true scalar CATE

τ (x_{i})

remains the primary target for RMSE evaluation.

This synthetic construction ensures identifiable CATE structure and enables quantitative evaluation of effect recovery. The same random seed is used for treatment and noise generation across methods to maintain fairness.

3.2.6. CATE Estimation Approaches

Distributional CNN-LSTM

The predicted conditional mean

μ (x_{i}, w)

serves as the expected multivariate outcome

\hat{E} [Y_{i} ∣ X_{i}, W_{i} = w]

. The scalar CATE is approximated by taking the difference of the first outcome dimension’s expected mean (

μ_{1}

):

{\hat{τ}}_{i} = μ_{1} (x_{i}, W_{i} = 1) - μ_{1} (x_{i}, W_{i} = 0) .

CNN-LSTM CATE Details

To account for epistemic uncertainty, predictive variance estimates from the CNN-LSTM covariance head are propagated through both treatment conditions.

KDE-Based CATE

Separate kernel density estimators (

{\hat{f}}_{w}

) are fit for treated and control groups. Conditional expectations are computed from these densities, and the scalar CATE is estimated as the difference in the expected value of the first outcome dimension:

{\hat{τ}}_{i} = \hat{E} [Y_{i 1} ∣ x_{i}, W_{i} = 1] - \hat{E} [Y_{i 1} ∣ x_{i}, W_{i} = 0] .

KDE CATE Detail

Bandwidths are re-optimized per group to reflect group-specific variability, and HDR-based coverage is reported separately.

Gaussian Copula-Based CATE

Gaussian copulas are fit separately on treated and control groups. Group-wise expectations of the first outcome dimension are compared to produce the CATE estimate:

{\hat{τ}}_{i} = \hat{E} [Y_{i 1} ∣ x_{i}, W_{i} = 1] - \hat{E} [Y_{i 1} ∣ x_{i}, W_{i} = 0] .

Gaussian Copula CATE Details

Dependence parameters are estimated independently per group to capture treatment-induced correlation changes.

3.2.7. Evaluation Metrics

For each method, we compute the standard causal inference metrics:

CATE RMSE = \sqrt{\frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {({\hat{τ}}_{i} - τ_{i})}^{2}}, CATE Bias = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} ({\hat{τ}}_{i} - τ_{i}) .

The final comparison table reports, for each method, the mean log-likelihood, 90% coverage, CATE RMSE, and CATE Bias. These metrics allow simultaneous evaluation of probabilistic calibration and causal effect recovery.

Computational and Stability Metrics

We additionally report per-method runtime (training + evaluation), memory footprint (peak GPU or RAM usage), and numerical stability (standard deviation across 20 runs). These diagnostics quantify both statistical and computational efficiency.

3.2.8. Three-Variable Mixture Data

We consider synthetic datasets generated from multivariate Gaussian mixtures in three dimensions:

X_{i} = {(X_{i 1}, X_{i 2}, X_{i 3})}^{⊤}

,

i = 1, \dots, n

. Each mixture has component-specific means

μ_{j} \in R^{3}

, covariance matrices

Σ_{j} \in R^{3 \times 3}

, and mixing probabilities

π_{j}

with

\sum_{j = 1}^{K} π_{j} = 1

, where K is the number of components. For each case,

n = 2000

observations are drawn, and pairwise two-dimensional projections

(X_{1}, X_{2})

,

(X_{1}, X_{3})

,

(X_{2}, X_{3})

are visualized with

90 %

probability ellipses per component.

Case 1 ( $K = 2$ ): Two equally weighted components with

$μ_{1} = {(- 4, - 1, 0)}^{⊤}, μ_{2} = {(5, 3, 1)}^{⊤} .$

Covariance structures differ: $Σ_{1}$ has moderate positive correlations, while $Σ_{2}$ includes both positive and negative dependencies.
Case 2 ( $K = 3$ ): Adds a third cluster,

$μ_{3} = {(0, 7, - 2)}^{⊤}, π_{j} \approx 1 / 3,$

with $Σ_{3}$ allowing positive and negative correlations, forming a triangular configuration.
Case 3 ( $K = 4$ ): Introduces

$μ_{4} = {(3, - 5, 2)}^{⊤}, π_{j} = 0.25,$

producing four clusters and greater spread along $X_{2}$ .
Case 4 ( $K = 5$ ): Adds

$μ_{5} = {(- 2, 2, 3)}^{⊤}, π_{j} = 0.2,$

increasing overlap, especially in $(X_{1}, X_{3})$ .
Case 5 ( $K = 6$ ): Adds

$μ_{6} = {(4, - 3, - 1)}^{⊤}, π_{j} = 1 / 6,$

introducing negative $X_{1}$ – $X_{2}$ correlation and mild positive $X_{3}$ association, resulting in high heterogeneity and overlap.

Overall, the progressive increase in K allows evaluation of model performance across distributions with varying cluster separation, overlap, and correlation structure.

Figure 8 shows pairwise scatterplots for Case 5 (

K = 6

), illustrating cluster overlap and heterogeneous correlations. Ellipses indicate

90 %

probability contours of each component. (a)

X_{1}

vs.

X_{2}

. (b)

X_{2}

vs.

X_{3}

. (c)

X_{1}

vs.

X_{3}

.

3.2.9. Model Comparison

Table 1 summarizes the comparative performance of the Distributional CNN-LSTM, 3D KDE, and Gaussian Copula models across mixtures with

K = 2

–6 components. Evaluation metrics include mean log-likelihood (density estimation quality), CATE RMSE (causal estimation accuracy), and CATE bias (systematic deviation).

$K = 2$ : KDE achieves the highest log-likelihood (least negative) and lowest RMSE, demonstrating the strongest overall performance in the simplest case. CNN-LSTM is competitive in both LL and RMSE and exhibits significantly lower bias than the Gaussian Copula.
$K = 3$ : Bias increases substantially across all models, with both CNN-LSTM and Copula bias reaching their highest values across the simulation scenarios. KDE retains the highest log-likelihood and the lowest RMSE.
$K = 4$ : RMSE rises across models, indicating increased difficulty in CATE estimation. The Gaussian Copula achieves the lowest RMSE and the best bias control (closest to zero), while the CNN-LSTM and KDE exhibit a large negative shift in bias.
$K = 5$ : RMSE values converge among all models, suggesting similar predictive accuracy in this highly overlapping case. The CNN-LSTM achieves the lowest CATE RMSE and the best bias control (closest to zero), indicating superior causal calibration.
$K = 6$ : The Gaussian Copula attains the lowest RMSE and the best bias control. KDE’s mean log-likelihood collapses (−27.6310) due to the severe curse of dimensionality in the three-dimensional, highly overlapping space. The CNN-LSTM remains stable in terms of log-likelihood and RMSE, exhibiting moderate negative bias.

3.2.10. CNN-LSTM Ellipsoids and CATE Visualization

The Distributional CNN-LSTM effectively captures both multivariate structure and heterogeneous treatment effects. Pairwise

90 %

ellipsoids for

(X_{1}, X_{2}, X_{3})

, colored by estimated CATE

{\hat{τ}}_{i}

, illustrate the following trends:

K = 2–3: Distinct ellipsoids form per cluster, with smoothly varying CATE gradients, indicating clear separation between treatment effects in the less complex scenarios.
K = 4–5: Overlapping ellipsoids reflect mixed correlations and high-dimensional complexity; CATE variation emerges both within and across clusters, showcasing the model’s ability to handle ambiguous component assignments.
K = 6: Dense and highly overlapping ellipsoids reveal maximal heterogeneity; the CNN-LSTM captures nonlinear dependencies and heterogeneous CATE more flexibly than the KDE or Copula baselines, which exhibited either collapsed density fit (KDE) or rigid correlation structures (Copula).

These results highlight the CNN-LSTM’s key strengths:

1.: Flexible multivariate density modeling: Ellipsoid shapes, sizes, and orientations adapt locally to cluster-specific variances and asymmetric correlations, essential for accurate probabilistic forecasting.
2.: Heterogeneous treatment effect representation: CATE values vary smoothly and non-linearly across the multivariate space, reflecting both local and global heterogeneity, which is critical for robust individual-level causal inference.

4. Real Data Analysis

4.1. Real Data Experiments: Iris Dataset

To validate the proposed methodology, we apply the models to the Iris dataset [33], a canonical multivariate dataset, while noting that the framework generalizes to other real-world multivariate data.

The Iris dataset consists of

n = 150

samples from three species (Setosa, Versicolor, and Virginica) with four numerical attributes: sepal length, sepal width, petal length, and petal width. Each observation is represented as

Y_{i} = {(Y_{i 1}, Y_{i 2}, Y_{i 3}, Y_{i 4})}^{⊤} \in R^{4} .

Figure 9 is the visualization of the Iris dataset. Features are projected into two dimensions to illustrate class separation and distributional structure across species.

All features were standardized to zero mean and unit variance. The dataset was randomly split into training (70%) and test (30%) sets. Three modeling strategies were applied:

1.: Distributional CNN-LSTM: a deep learning model capturing complex dependencies. Outputs predict distribution parameters for each $Y_{i}$ , trained via negative log-likelihood using the Adam optimizer.
2.: Kernel Density Estimation (KDE): a multivariate Gaussian kernel estimator [1], providing test densities $\hat{f} (Y_{i})$ .
3.: Gaussian Copula models marginal distributions $F_{j} (Y_{i j})$ with dependence via a Gaussian copula $C_{ρ}$ [6]:

$f (Y_{i}) = c_{ρ} (u_{i 1}, \dots, u_{i 4}) \prod_{j = 1}^{4} f_{j} (Y_{i j}), u_{i j} = F_{j} (Y_{i j}) .$

Predictive performance was evaluated using the following metrics on the test set (

n_{test}

):

90% Confidence Coverage: proportion of test points contained within the predicted 90% probability region. For CNN-LSTM and Copula, this region is the confidence ellipsoid derived from the predictive covariance $Σ_{i}$ .
Mean Log-Likelihood (MLL): average log-likelihood of test observations, measuring the generative fidelity of the model’s estimated density:

$MLL = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} log \hat{f} (Y_{i}) .$
CATE Metrics: for known or oracle treatment effects $τ_{i}$ , the accuracy and systematic error of estimated ${\hat{τ}}_{i}$ are quantified:

$CATE_RMSE = \sqrt{\frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {({\hat{τ}}_{i} - τ_{i})}^{2}}, CATE_Bias = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} ({\hat{τ}}_{i} - τ_{i}) .$

Figure 10 presents the key results:

90% Coverage: Both KDE and Gaussian Copula achieve nearly full coverage ( $\sim 0.956$ ), reflecting superior uncertainty calibration compared to the nominal $90 %$ level, possibly due to over-smoothing (KDE) or rigid dependence structure (Copula). CNN-LSTM slightly undercovers ( $0.889$ ), indicating its predicted variance is less conservative.
Log-Likelihood: KDE and Copula attain the highest (least negative) log-likelihoods, indicating a superior fit of their estimated densities to the test data. CNN-LSTM has the lowest log-likelihood ( $- 119.005$ ), suggesting a trade-off between generative fidelity and conditional modeling flexibility.
CATE_RMSE: CNN-LSTM attains the lowest RMSE ( $0.065$ ), substantially outperforming KDE ( $0.915$ ) and Copula ( $0.916$ ). This demonstrates the CNN-LSTM’s superior ability to model the complex, conditional dependencies required for accurate individual-level CATE estimation, even when its overall density fit (MLL) is lower than the baselines.

As summarized in Table 2 and Figure 10, the Distributional CNN-LSTM excels at estimating individual-level treatment effects, achieving the lowest

CATE_RMSE

(

0.065

) by a substantial margin. In contrast, KDE and Gaussian Copula provide overly conservative predictive intervals (high coverage,

\sim 0.956

) and higher mean log-likelihood (MLL), but fail to capture the fine-grained conditional heterogeneity necessary for accurate causal inference (

CATE_RMSE \approx 0.915

). These results highlight the CNN-LSTM’s ability to model complex, nonlinear dependencies in real-world multivariate data, making it particularly advantageous when accurate

CATE

estimation is critical, even when its generative fit (MLL) is slightly lower than the static baselines.

4.2. Real Data Experiments: Criteo Uplift Dataset

To evaluate the practical performance of our proposed CNN-LSTM model and compare it with classical density estimation methods, we applied the models to the Criteo Uplift Dataset [34]. We focused on two continuous numeric features (e.g.,

f 0

and

f 1

) and standardized them for all models. The dataset was randomly split into 70% training and 30% testing samples, yielding 3500 training points and 1500 test points.

Table 3 summarizes the comparison across three models: CNN-LSTM, Kernel Density Estimation (KDE), and Gaussian Copula. Metrics include the empirical coverage of 90% confidence regions (Coverage), log-likelihood on the test set (LogLik), and the root mean squared error for conditional average treatment effect estimation (CATE_RMSE).

As shown, all three models achieve nearly identical coverage due to standardization and scaling. CNN-LSTM achieves substantially lower CATE_RMSE, indicating superior predictive accuracy for continuous outcomes. KDE suffers from oversmoothing and sensitivity to bandwidth, while the Gaussian Copula captures correlation but not marginal distribution details.

The CNN-LSTM model converged reliably over 50 epochs, with an average epoch runtime of 0.33 s and total convergence time of 82.3 s on a single CPU. Stability across repeated runs (n = 5) showed a mean RMSE of 0.0416 and standard deviation of 0.0222, demonstrating low variance and stable convergence. By contrast, KDE and Gaussian Copula methods are essentially instantaneous for this small feature space, but their predictive accuracy is lower and may degrade with higher-dimensional inputs.

Figure 11 presents a bar plot comparison of the three models for Coverage, LogLik, and CATE_RMSE. Each facet displays one metric, with bars colored according to model. The CNN-LSTM consistently outperforms both classical methods in RMSE while maintaining comparable coverage.

The CNN-LSTM model demonstrates strong predictive performance and stable convergence for the two-feature case. It provides a good balance between capturing complex dependencies and maintaining computational efficiency. KDE and Gaussian Copula are much faster but exhibit lower predictive accuracy. These results justify the use of deep sequential models for robust multivariate outcome estimation in contextual uplift tasks.

5. Conclusions and Future Work

We introduced a distributional CNN-LSTM framework for probabilistic multivariate modeling and individualized treatment effect estimation. The model captures complex dependencies across multiple outcomes while estimating heterogeneous CATE values, enabling counterfactual inference in multivariate settings. Simulation studies with Gaussian mixture data demonstrate that the CNN-LSTM provides robust density estimation and accurate CATE recovery, particularly in higher-complexity mixtures. While classical methods such as KDE and Gaussian Copulas can achieve competitive log-likelihood or coverage in low-complexity scenarios (e.g., K = 2–3), the CNN-LSTM consistently outperforms in recovering conditional effects across all complexity levels. On benchmark datasets such as Iris, and critically on the Criteo Uplift dataset, the CNN-LSTM attains the lowest CATE RMSE, confirming its practical utility for individualized prediction. KDE and Gaussian Copula, while computationally efficient, exhibit higher CATE RMSE, reflecting a trade-off between global fit and individualized effect accuracy. The Criteo Uplift experiments additionally highlight the model’s computational properties: the CNN-LSTM converges reliably within moderate training time (approximately 0.33 s per epoch and 82 s total for 50 epochs) and exhibits low variance across repeated runs. These results suggest scalability to moderate-sized real-world datasets while maintaining stable performance. Visualizations of 90% probability ellipsoids and CATE gradients reveal cluster-specific correlations and smoothly varying treatment heterogeneity, providing interpretable insights into learned dependence structures. This interpretability, combined with superior individualized prediction, makes the CNN-LSTM framework particularly suitable for practical applications in personalized medicine, economic policy analysis, targeted marketing, and environmental risk assessment. Future work may focus on improving interpretability through feature attribution and attention mechanisms, scaling the approach to high-dimensional datasets and multivariate time series, extending to time-varying treatments and dynamic counterfactual modeling, and exploring hybrid architectures combining deep learning with copula-based or probabilistic graphical models. Overall, the CNN-LSTM provides a flexible and practically relevant tool that balances predictive accuracy, probabilistic calibration, and individualized effect estimation in complex multivariate scenarios.

Funding

This research received no external funding.

Institutional Review Board Statement

This article does not contain any studies with human participants or animals performed by the author.

Data Availability Statement

The original data presented in the study are openly available in [GitHub repository], accessed on 17 October 2025 at https://github.com/kjonomi/Rcode/blob/main/distributional-CNN-LSTM.

Conflicts of Interest

The author has no relevant financial or non-financial interests to disclose.

References

Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, UK, 1986. [Google Scholar]
Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization, 2nd ed.; Wiley: New York, NY, USA, 2015. [Google Scholar]
Friedman, J.H.; Stuetzle, W.; Schroeder, A. Projection Pursuit Density Estimation. J. Am. Stat. Assoc. 1984, 79, 599–608. [Google Scholar] [CrossRef]
Li, Q.; Racine, J.S. Nonparametric Econometrics: Theory and Practice; Princeton University Press: Princeton, NJ, USA, 2007. [Google Scholar]
Genest, C.; Favre, A.C. Everything you always wanted to know about copula modeling but were afraid to ask. J. Hydrol. Eng. 2009, 14, 465–476. [Google Scholar] [CrossRef]
Nelsen, R.B. An Introduction to Copulas, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Joe, H. Dependence Modeling with Copulas; Chapman & Hall/CRC: London, UK, 2014. [Google Scholar]
Bishop, C.M. Mixture Density Networks; Technical Report NCRG/94/004; Neural Computing Research Group: Singapore, 1994. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, Proceedings of the Conference and Workshop on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NeurIPS Foundation: La Jolla, CA, USA, 2017; Volume 31, pp. 6405–6416. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Rezende, D.J.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1530–1538. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using Real NVP. arXiv 2017, arXiv:1605.08803. [Google Scholar] [CrossRef]
Tagasovska, N.; Ackerer, D.; Vatter, T. Copulas as high-dimensional generative models: Vine copula autoencoders. In Advances in Neural Information Processing Systems, Proceedings of the Conference and Workshop on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; NeurIPS Foundation: La Jolla, CA, USA, 2019; Volume 32, pp. 6525–6537. [Google Scholar]
Girard, S.; Gobet, E.; Pachebat, J. Deep Generative Modeling of Multivariate Dependent Extremes. 2024. Available online: https://inria.hal.science/hal-04700084v2/document (accessed on 19 March 2025).
Kim, J.-M. Integrating copula-based random forest and deep learning approaches for analyzing heterogeneous treatment effects in survival analysis. Mathematics 2025, 13, 1659. [Google Scholar] [CrossRef]
Kim, J.-M. Treatment effect estimation in survival analysis using deep learning-based causal inference. Axioms 2025, 14, 458. [Google Scholar] [CrossRef]
Kim, J.-M. A copula-driven CNN-LSTM framework for estimating heterogeneous treatment effects in multivariate outcomes. Mathematics 2025, 13, 2384. [Google Scholar] [CrossRef]
Kim, J.-M. Multi-task CNN-LSTM modeling of zero-inflated count and time-to-event outcomes for causal inference with functional representation of features. Axioms 2025, 14, 626. [Google Scholar] [CrossRef]
Kim, G. A copula-based deep graphical causal model for multivariate conditional treatment effect estimation. Meas. Interdiscip. Res. Perspect. 2025, in press. [CrossRef]
Kim, J.-M.; Ha, I.D.; Kim, S. Deep learning-based survival analysis with copula-based activation functions for multivariate response prediction. Comput. Stat. 2025, in press. [CrossRef]
Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Wasserman, L. All of Nonparametric Statistics; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Sklar, A. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris 1959, 8, 229–231. [Google Scholar]
Hofert, M.; Kojadinovic, I.; Mächler, M.; Yan, J. Elements of Copula Modeling with R; Springer Series in Statistics; Springer: New York, NY, USA, 2018. [Google Scholar]
Demarta, S.; McNeil, A.J. The t copula and related copulas. Int. Stat. Rev. 2005, 73, 111–129. [Google Scholar] [CrossRef]
Patton, A.J. Modelling asymmetric exchange rate dependence. Int. Econ. Rev. 2006, 47, 527–556. [Google Scholar] [CrossRef]
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Hill, J.L. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 2011, 20, 217–240. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Mardia, K.V.; Kent, J.T.; Bibby, J.M. Multivariate Analysis; Academic Press: Cambridge, MA, USA, 1979. [Google Scholar]
Rubin, D.B. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 1974, 66, 688–701. [Google Scholar] [CrossRef]
Imbens, W.G.; Rubin, B.D. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction; Cambridge University Press: Cambridge, MA, USA, 2015. [Google Scholar]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Criteo Research. Criteo Uplift Modeling Dataset, version 2.1; Criteo Research: Paris, France, 2021; Available online: https://huggingface.co/datasets/criteo/criteo-uplift/blob/main/criteo-research-uplift-v2.1.csv.gz (accessed on 15 October 2025).

Figure 1. Schematic of the bivariate simulation study. Each scenario involves mixtures of bivariate Gaussians with overlapping modes and asymmetric covariance structures.

Figure 2. Representative realizations of the binodal distribution, illustrating multimodality, heterogeneous correlations, and measurement error.

Figure 3. Performance of three models in capturing the binodal distribution.

Figure 4. Representative realizations of the trimodal distribution, illustrating multimodality, heterogeneous correlations, and measurement error.

Figure 5. Performance of three models in capturing the trimodal distribution.

Figure 6. Representative realizations of the quadrimodal distribution, illustrating multimodality, heterogeneous correlations, and measurement error.

Figure 7. Comparison of three models in capturing the quadrimodal distribution of

y_{i 1}

and

y_{i 2}

.

Figure 7. Comparison of three models in capturing the quadrimodal distribution of

y_{i 1}

and

y_{i 2}

.

Figure 8. Pairwise scatterplots for Case 5 (

K = 6

), illustrating cluster overlap and heterogeneous correlations. Ellipses indicate

90 %

probability contours of each component. (a)

X_{1}

vs.

X_{2}

. (b)

X_{2}

vs.

X_{3}

. (c)

X_{1}

vs.

X_{3}

.

Figure 8. Pairwise scatterplots for Case 5 (

K = 6

), illustrating cluster overlap and heterogeneous correlations. Ellipses indicate

90 %

probability contours of each component. (a)

X_{1}

vs.

X_{2}

. (b)

X_{2}

vs.

X_{3}

. (c)

X_{1}

vs.

X_{3}

.

Figure 9. Visualization of the Iris dataset. Features are projected into two dimensions to illustrate class separation and distributional structure across species.

Figure 10. Performance comparison of CNN-LSTM, KDE, and Gaussian Copula on the Iris dataset. Metrics include mean log-likelihood, 90% coverage, and CATE RMSE.

Figure 11. Comparison of CNN-LSTM, KDE, and Gaussian Copula on the Criteo Uplift Dataset. Metrics include 90% coverage of confidence regions (Coverage), log-likelihood on the test set (LogLik), and CATE root mean squared error (CATE_RMSE). Two continuous features were used, and bars are colored by model.

Table 1. Model performance across three-variable mixtures (

K = 2

–6).

Table 1. Model performance across three-variable mixtures (

K = 2

–6).

Components	Model	Mean Log-Likelihood	CATE RMSE	CATE Bias
2	CNN-LSTM	−7.0081	2.0313	0.2140
2	KDE	−6.0352	2.0229	0.1139
2	Copula	−6.6024	2.1770	0.8119
3	CNN-LSTM	−7.4576	2.2828	0.8241
3	KDE	−6.5099	2.2486	0.7241
3	Copula	−7.5509	2.3097	0.8960
4	CNN-LSTM	−7.7268	2.6603	−0.2822
4	KDE	−7.0799	2.6727	−0.3822
4	Copula	−7.8317	2.6489	−0.1390
5	CNN-LSTM	−7.6992	2.3897	−0.0638
5	KDE	−7.0063	2.3944	−0.1639
5	Copula	−7.7034	2.3946	0.1664
6	CNN-LSTM	−7.7029	2.3765	−0.5122
6	KDE	−27.6310	2.4000	−0.6122
6	Copula	−7.6682	2.3247	−0.1371

Table 2. Model comparison on the Iris dataset. Metrics include 90% coverage, mean log-likelihood, and CATE RMSE.

Model	Coverage	LogLik	CATE_RMSE
CNN-LSTM	0.889	−119.005	0.065
KDE	0.956	−128.074	0.915
Gaussian Copula	0.956	0.149	0.916

Table 3. Model comparison on Criteo uplift dataset (two continuous features).

Model	Coverage	LogLik	CATE_RMSE
CNN-LSTM	0.986	−4090.264	0.034
KDE	0.986	−68,488.785	1.008
Gaussian Copula	0.986	10.980	0.951

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.-M. Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects. Analytics 2025, 4, 29. https://doi.org/10.3390/analytics4040029

AMA Style

Kim J-M. Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects. Analytics. 2025; 4(4):29. https://doi.org/10.3390/analytics4040029

Chicago/Turabian Style

Kim, Jong-Min. 2025. "Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects" Analytics 4, no. 4: 29. https://doi.org/10.3390/analytics4040029

APA Style

Kim, J.-M. (2025). Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects. Analytics, 4(4), 29. https://doi.org/10.3390/analytics4040029

Article Menu

Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects

Abstract

1. Introduction

2. Methods

2.1. Kernel Density Estimation (KDE) Baseline

2.2. Gaussian Copula

2.3. Coverage and Evaluation Metrics

2.4. Conditional Average Treatment Effect (CATE) Estimation

2.5. Evaluation Details and Reproducibility

2.6. Comparison Metrics

3. Simulation Study

3.1. Bivariate Distribution

3.1.1. General Setup

3.1.2. Binodal Distribution

3.1.3. Trimodal Distribution

3.1.4. Quadrimodal Distribution

3.2. Multivariate Distribution

3.2.1. Data Splitting and Preprocessing

3.2.2. Distributional CNN-LSTM Model

3.2.3. Baselines: KDE and Gaussian Copula

Kernel Density Estimation (KDE)

KDE Implementation Details

Gaussian Copula

Gaussian Copula Implementation Details

3.2.4. Evaluation Metrics for Density Models

3.2.5. Simulated Treatment Effects (CATE)

3.2.6. CATE Estimation Approaches

Distributional CNN-LSTM

CNN-LSTM CATE Details

KDE-Based CATE

KDE CATE Detail

Gaussian Copula-Based CATE

Gaussian Copula CATE Details

3.2.7. Evaluation Metrics

Computational and Stability Metrics

3.2.8. Three-Variable Mixture Data

3.2.9. Model Comparison

3.2.10. CNN-LSTM Ellipsoids and CATE Visualization

4. Real Data Analysis

4.1. Real Data Experiments: Iris Dataset

4.2. Real Data Experiments: Criteo Uplift Dataset

5. Conclusions and Future Work

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI