Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features

Kim, Jong-Min

doi:10.3390/axioms14080626

Open AccessFeature PaperArticle

Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features

by

Jong-Min Kim

^1,2

¹

Statistics Discipline, Division of Science and Mathematics, University of Minnesota-Morris, Morris, MN 56267, USA

²

EGADE Business School, Tecnológico de Monterrey, Ave. Rufino Tamayo, Monterrey 66269, Mexico

Axioms 2025, 14(8), 626; https://doi.org/10.3390/axioms14080626

Submission received: 14 July 2025 / Revised: 7 August 2025 / Accepted: 9 August 2025 / Published: 11 August 2025

(This article belongs to the Special Issue Functional Data Analysis and Its Application)

Download

Browse Figure

Versions Notes

Abstract

We propose a novel deep learning framework for counterfactual inference on the COMPAS dataset, utilizing a multi-task CNN-LSTM architecture. The model jointly predicts multiple outcome types: (i) count outcomes with zero inflation, modeled using zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), and negative binomial (NB) distributions; (ii) time-to-event outcomes, modeled via the Cox proportional hazards model. To effectively leverage the structure in high-dimensional tabular data, we integrate functional data analysis (FDA) techniques by transforming covariates into smooth functional representations using B-spline basis expansions. Specifically, we construct a pseudo-temporal index over predictor variables and fit basis expansions to each subject’s feature vector, yielding a low-dimensional set of coefficients that preserve smooth variation while reducing noise. This functional representation enables the CNN-LSTM model to capture both local and global temporal patterns in the data, including treatment-covariate interactions. Our approach estimates both population-average and individual-level treatment effects (ATE and CATE) for each outcome and evaluates predictive performance using metrics such as Poisson deviance, root mean squared error (RMSE), and the concordance index (C-index). Statistical inference on treatment effects is supported via bootstrap-based confidence intervals and hypothesis testing. Overall, this comprehensive framework facilitates flexible modeling of heterogeneous treatment effects in structured, high-dimensional data, advancing causal inference methodologies in criminal justice and related domains.

Keywords:

counterfactual inference; deep learning; CNN-LSTM; zero-inflated Poisson; zero-inflated negative binomial; negative binomial; survival analysis; Cox proportional hazards model; functional data analysis; treatment effect estimation

MSC:

62R10; 62J12; 62N02

1. Introduction

The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset has become a widely studied benchmark in criminal justice research, particularly for evaluating fairness and predictive accuracy in recidivism risk assessments [1]. While much of the existing literature focuses on classification performance and algorithmic bias, there is growing demand for methodologies that estimate the causal effects of incarceration-related interventions—such as jail time, diversion programs, or probation—on future criminal behavior. Estimating such causal impacts, including counterfactual outcomes and heterogeneous treatment effects (HTEs), is critical for designing evidence-based policies that reduce recidivism, improve fairness, and allocate resources efficiently.

Traditional causal inference methods—such as inverse probability weighting (IPW), regression adjustment, and propensity score matching—have been instrumental in estimating average treatment effects (ATE). However, these approaches often rely on strong parametric assumptions and struggle in the presence of high-dimensional features, nonlinear confounding structures, and multivariate outcomes common in criminal justice data. Moreover, real-world outcomes like prior offenses, jail duration, and time-to-recidivism frequently exhibit zero inflation and censoring, further complicating inference under conventional frameworks.

Recent advances in deep learning offer powerful tools for modeling complex, nonlinear, and high-dimensional relationships, particularly in the presence of structured dependencies across multiple outcomes. Architectures such as convolutional neural networks (CNNs) and long short-term memory (LSTM) networks have demonstrated strong performance in capturing spatial and temporal dynamics, making them well-suited for estimating conditional average treatment effects (CATEs) in domains that require personalized decision-making, such as healthcare [2] and time-series forecasting [3]. Building on this foundation, recent work has integrated deep learning with functional data analysis (FDA) [4], zero-inflated modeling [5], and survival analysis frameworks [6,7] to better address heterogeneity in outcome distributions and treatment response. These developments are particularly relevant for applications like criminal justice, where capturing individualized treatment heterogeneity across multiple, structurally diverse outcomes is critical [8,9,10,11,12].

In the criminal justice context, understanding heterogeneity in treatment response is critical. For instance, identifying subgroups who benefit from diversion instead of incarceration could help reduce recidivism and alleviate prison overcrowding. Conversely, flagging individuals for whom treatment is ineffective or counterproductive can motivate tailored interventions or increased supervision. Our modeling framework provides individual-level CATE estimates that can directly inform such policy decisions, offering a more nuanced and data-driven approach than aggregate ATE metrics alone.

While deep learning offers modeling power, its perceived lack of interpretability remains a barrier to adoption in high-stakes domains like criminal justice. To address this, we incorporate SHAP (SHapley Additive exPlanations) values to compute covariate-level contributions to predicted treatment effects. We also include individual-level case studies that illustrate how treatment effects vary across population subgroups. These tools aim to make our framework transparent, interpretable, and actionable for policymakers and stakeholders.

To contextualize the performance of our model, we compare it against a suite of standard and state-of-the-art causal inference methods, including IPW, regression adjustment, TARNet [8], and Bayesian Additive Regression Trees (BART) [13]. These comparisons are critical to demonstrating the robustness, flexibility, and accuracy of our proposed approach, particularly in handling multivariate, zero-inflated, and censored outcomes.

In this paper, we introduce a novel multi-task CNN-LSTM architecture tailored for causal inference on the COMPAS dataset. Our framework jointly models multiple outcomes: zero-inflated counts (e.g., prior offenses, jail days) using ZIP-based loss functions and censored time-to-recidivism using Cox partial likelihood. To integrate structured tabular features and treatment indicators, we apply FDA by transforming covariates into smooth functional representations via B-spline basis expansions. This facilitates the modeling of smooth temporal patterns and latent structure in covariates.

Our contributions are threefold:

We propose a unified, deep learning-based framework for estimating CATEs across heterogeneous outcomes, including zero-inflated and censored data structures, relevant for criminal justice applications.
We enhance interpretability by integrating SHAP-based variable attribution and case-level analysis to improve transparency and stakeholder trust.
We perform a comprehensive empirical comparison against classical and deep causal inference methods using real-world COMPAS data, demonstrating improved performance and actionable insights for policy.

While previous work on the COMPAS dataset has focused on predictive modeling or fairness evaluation, our contribution advances the field by offering an interpretable, personalized, and policy-relevant causal inference framework. To our knowledge, this is the first application of a deep functional multi-task model to jointly estimate causal effects across zero-inflated count and censored time-to-event outcomes within the COMPAS context. By bridging deep causal learning with practical needs in criminal justice reform, our work lays the foundation for individualized, data-driven policy decisions that balance efficacy, fairness, and transparency.

2. Methods and Model Description

2.1. Functional Representation of Features

Let

X \in R^{n \times p}

denote the matrix of observed features for n individuals, where each row corresponds to a subject and each column to one of the p predictors. The predictors may include both continuous and categorical variables. To prepare the data for functional modeling, we first apply preprocessing steps: (1) categorical variables are transformed using one-hot (dummy) encoding, and (2) all predictors are standardized to have zero mean and unit variance.

To capture smooth latent structures and reduce dimensionality, we interpret the p predictors for each individual as discrete observations of a smooth underlying function. Specifically, we define a time grid

T = {t_{j}}_{j = 1}^{p}, t_{j} \in [0, 1],

that maps each predictor index j to a location on a normalized interval. This allows us to treat the vector of covariates for subject i, denoted

X_{i} = {(X_{i 1}, \dots, X_{i p})}^{⊤}

, as a discretized realization of a smooth function

x_{i} (t)

defined over

t \in [0, 1]

, where

X_{i j} \approx x_{i} (t_{j})

.

To estimate this smooth function, we employ a basis expansion using B-spline basis functions [4]. Let

{ϕ_{k} (t)}_{k = 1}^{K}

denote a set of K B-spline basis functions. Then, each function

x_{i} (t)

is approximated as

x_{i} (t) \approx \sum_{k = 1}^{K} c_{i k} ϕ_{k} (t),

where

c_{i k}

are subject-specific basis coefficients.

The coefficients

c_{i} = {(c_{i 1}, \dots, c_{i K})}^{⊤}

are estimated by solving a penalized least squares problem:

min_{c_{i}} \sum_{j = 1}^{p} {(X_{i j} - \sum_{k = 1}^{K} c_{i k} ϕ_{k} (t_{j}))}^{2} + λ \int {(D^{2} x_{i} (t))}^{2} d t,

where

D^{2} x_{i} (t)

denotes the second derivative of the approximated function, and the penalty term

\int {(D^{2} x_{i} (t))}^{2} d t

controls the roughness of the estimate. The smoothing parameter

λ > 0

balances fidelity to the data with the smoothness of the functional approximation [14].

Solving this optimization for each subject yields a coefficient matrix

C \in R^{n \times K}

, where each row corresponds to the smoothed functional representation of an individual’s predictors. This matrix

C

is used as input to subsequent modeling steps, such as functional principal component analysis (FPCA), copula modeling, or deep learning architectures like CNN-LSTM.

2.2. CNN-LSTM Model Architecture

We employ a hybrid CNN-LSTM architecture to model both nonlinear and temporal relationships in the smoothed functional covariates and treatment effects. The model is trained on B-spline basis coefficients derived from the original covariate functions.

Let the input tensor be denoted as

X \in R^{n \times T \times C},

where n is the number of observations,

T = K

is the number of B-spline basis coefficients (timesteps), and

C = 2

denotes the number of input channels. The two channels correspond to the smoothed functional covariates and a binary treatment indicator, which is repeated across all time steps.

The CNN-LSTM architecture transforms the input

X

through the following sequence of operations [9,15]:

1D Convolution: A one-dimensional convolutional layer with $F = 16$ filters and a kernel size of 5 is applied to extract local temporal patterns. The activation function is ReLU:

$H_{1} = ReLU (Conv 1 D (X)) .$
Max Pooling: A temporal max-pooling operation with pool size 2 reduces the temporal dimensionality and retains salient features:

$H_{2} = MaxPool 1 D (H_{1}) .$
LSTM Encoding: A unidirectional LSTM layer with 32 units models long-range dependencies across the basis functions:

$h = LSTM (H_{2}),$

where $h \in R^{32}$ denotes the final hidden state in the temporal sequence [16].
Dropout Regularization: Dropout with a rate of $p = 0.3$ is applied to prevent overfitting:

$h_{drop} = Dropout (h) .$
Fully Connected Layer: A dense layer with 32 units and ReLU activation transforms the sequence encoding into a shared latent representation:

$z = ReLU (W h_{drop} + b),$

where $W \in R^{32 \times 32}$ , $b \in R^{32}$ , and $z \in R^{32}$ .

From the shared representation

z_{i}

, we derive three outcome-specific prediction heads to model heterogeneous treatment effects across multiple outcomes:

\{\begin{matrix} {\hat{μ}}_{1, i} = exp (w_{1}^{⊤} z_{i} + b_{1}), & Count outcome 1 (e . g ., number of priors) \\ {\hat{μ}}_{2, i} = exp (w_{2}^{⊤} z_{i} + b_{2}), & Count outcome 2 (e . g ., days in jail) \\ {\hat{η}}_{i} = w_{3}^{⊤} z_{i} + b_{3}, & Survival risk score (e . g ., time to recidivism) \end{matrix}

where

w_{j} \in R^{32}

and

b_{j} \in R

for

j = 1, 2, 3

, and the subscript i indexes individual observations.

The exponential link ensures positivity for the count outcome means

{\hat{μ}}_{1, i}

and

{\hat{μ}}_{2, i}

, which can be modeled using Poisson or negative binomial (NB) distributions. The linear risk score

{\hat{η}}_{i}

is used in survival models such as the Cox proportional hazards model or the DeepSurv framework.

This multi-task design enables the model to share information across outcome types while learning outcome-specific treatment effects. Regularization, nonlinearity, and temporal encoding are incorporated to capture complex dependencies between the functional inputs and outcomes.

While CNN-LSTM models offer strong performance in capturing spatio-temporal dynamics, they may underperform in settings with sparse or weakly structured data, where temporal or spatial dependencies are minimal or inconsistent. Additionally, when treatment effect heterogeneity is governed by complex interactions that do not align well with local convolutional filters or sequential memory structures, the CNN-LSTM may fail to capture these patterns effectively. These limitations highlight the importance of model selection based on data characteristics and suggest that hybrid or ensemble approaches may be more appropriate in such cases.

2.3. Count Outcome Models and Loss Functions

To flexibly model count-valued outcomes that exhibit overdispersion and zero-inflation, we consider four candidate distributions: Poisson, NB, ZIP, and ZINB. These models allow for different levels of dispersion and excess zeros, which are common in justice-related datasets such as COMPAS [17]. In all models, the parameters

λ

and

μ

represent the expected count, though their distributional assumptions differ with respect to dispersion.

The Poisson distribution assumes that the count variable Y has mean

λ > 0

and equidispersion (i.e., variance equals the mean). The probability mass function (PMF) is given by

P (Y = y) = \frac{λ^{y} e^{- λ}}{y!}, y \in {0, 1, 2, \dots} .

The ZIP model introduces a latent zero-inflation mechanism [5]. With probability

π

, the count is deterministically zero, and with probability

1 - π

, the count follows a Poisson distribution with mean

λ

:

P (Y = 0) = π + (1 - π) e^{- λ}, P (Y = y) = (1 - π) \frac{λ^{y} e^{- λ}}{y!}, y > 0 .

The negative binomial (NB) model extends the Poisson distribution to account for overdispersion. It can be derived as a Poisson–Gamma mixture and is parameterized by a mean

μ > 0

and a dispersion parameter

θ > 0

[18]. The probability mass function (PMF) is given by

P (Y = y) = \frac{Γ (y + θ)}{Γ (θ) y!} {(\frac{μ}{μ + θ})}^{y} {(\frac{θ}{μ + θ})}^{θ}, y \in {0, 1, 2, \dots} .

Here, y denotes the number of events (e.g., counts or successes),

μ

is the expected value of the distribution, and

θ

is the dispersion parameter. As

θ \to \infty

, the NB distribution converges to the Poisson distribution with mean

μ

, reflecting diminishing overdispersion.

The term

\frac{Γ (y + θ)}{Γ (θ) y!}

is a generalized binomial coefficient, which can also be expressed as

(\binom{y + θ - 1}{y}),

extending the classical binomial coefficient to cases where

θ

is not necessarily an integer.

The ZINB model further extends the ZIP by allowing the non-zero counts to follow an NB distribution, thus capturing both overdispersion and excess zeros [19,20]:

\begin{matrix} P (Y = 0) & = π + (1 - π) {(\frac{θ}{μ + θ})}^{θ}, \\ P (Y = y) & = (1 - π) \cdot \frac{Γ (y + θ)}{Γ (θ) y!} {(\frac{μ}{μ + θ})}^{y} {(\frac{θ}{μ + θ})}^{θ}, y > 0 . \end{matrix}

For each model, we define the count outcome loss as the negative log-likelihood averaged across observations:

L_{count} = - \frac{1}{n} \sum_{i = 1}^{n} log P (Y_{i} ∣ {\hat{θ}}_{i}),

where

{\hat{θ}}_{i}

denotes model-specific outputs from the CNN-LSTM for subject i, such as

{\hat{μ}}_{i}

,

{\hat{λ}}_{i}

, or

{\hat{π}}_{i}

, depending on the task.

2.4. Survival Outcome Model and Loss Function

For time-to-event outcomes, we model the log-risk score

η_{i}

using a neural network and adopt the Cox proportional hazards model framework [6]. The model is trained by minimizing the negative Cox partial log-likelihood:

L_{Cox} = - \sum_{i = 1}^{n} δ_{i} (η_{i} - log \sum_{j \in R (T_{i})} e^{η_{j}}),

where

$T_{i}$ is the observed (possibly censored) event time for subject i,
$δ_{i} \in {0, 1}$ is the event indicator (1 if the event is observed, 0 if censored),
$R (T_{i})$ denotes the risk set, i.e., the set of individuals still at risk just before time $T_{i}$ .

The Cox loss is differentiable with respect to

η_{i}

and is commonly used in deep survival models such as DeepSurv.

2.5. Joint Training Objective and Counterfactual Estimation

The model is trained to jointly minimize the total loss across all tasks. We first define the individual-level loss contribution for observation i as

L_{i} = ℓ_{count} (y_{i}, {\hat{y}}_{i}) + ℓ_{Cox} (T_{i}, δ_{i}, {\hat{η}}_{i}),

where

ℓ_{count}

denotes the loss for the count outcome (e.g., negative log-likelihood under a Poisson or negative binomial model), and

ℓ_{Cox}

is the partial log-likelihood loss for the Cox proportional hazards model, defined at the individual level.

The total objective function is then the average of the individual-level contributions:

L = \frac{1}{n} \sum_{i = 1}^{n} L_{i} = \frac{1}{n} \sum_{i = 1}^{n} [ℓ_{count} (y_{i}, {\hat{y}}_{i}) + ℓ_{Cox} (T_{i}, δ_{i}, {\hat{η}}_{i})] .

The treatment indicator is included as a time-invariant input channel in the CNN-LSTM model, allowing for counterfactual estimation by altering its value. After training, counterfactual predictions are generated under both treatment conditions (

t = 1

and

t = 0

), enabling the estimation of individual and average treatment effects.

For a given outcome (e.g., count or survival), the conditional average treatment effect (CATE) for subject i is defined as

{CATE}_{i} = {\hat{y}}_{i}^{(t = 1)} - {\hat{y}}_{i}^{(t = 0)},

and the population-level average treatment effect (ATE) is computed as

ATE = \frac{1}{n} \sum_{i = 1}^{n} {CATE}_{i} .

3. Model Evaluation

After training the multi-task CNN-LSTM model on the observed data, we evaluate its performance in estimating treatment effects and predictive accuracy across the three outcomes: priors count, days in jail, and survival risk.

3.1. Counterfactual Predictions and Treatment Effect Estimation

Let

X_{input} \in R^{n \times p \times m}

denote the tensor of covariates for n individuals, where p is the number of preprocessed features and m the number of functional basis evaluations (e.g., time points or basis expansion locations). Let

T_{i} \in {0, 1}

denote the binary treatment assignment for subject i, and let

T = (T_{1}, \dots, T_{n})

denote the treatment vector.

To estimate counterfactual outcomes, we generate two feature tensors for each individual:

\begin{matrix} X^{(1)} & = X_{input} with treatment set to T_{i} = 1, \\ X^{(0)} & = X_{input} with treatment set to T_{i} = 0 . \end{matrix}

The trained CNN-LSTM model f maps the covariate tensor to predicted outcomes:

{\hat{Y}}^{(1)} = f (X^{(1)}), {\hat{Y}}^{(0)} = f (X^{(0)}),

where each

{\hat{Y}}^{(t)}

includes predictions for all three outcomes under treatment

t \in {0, 1}

.

The CATE for individual i and outcome

k \in {1, 2, 3}

is computed as

{\hat{CATE}}_{i}^{(k)} = {\hat{Y}}_{i}^{(1, k)} - {\hat{Y}}_{i}^{(0, k)} .

The corresponding ATE for outcome k is

{\hat{ATE}}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{CATE}}_{i}^{(k)} .

3.2. Bootstrap Confidence Intervals

To quantify uncertainty in ATE estimates, we apply a nonparametric bootstrap procedure. Specifically, we repeatedly sample (with replacement) from the individual-level CATE estimates to construct a bootstrap distribution:

{\hat{ATE}}_{b}^{(k) *} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{CATE}}_{i, b}^{(k)}, b = 1, \dots, B,

where B denotes the number of bootstrap replicates.

A

(1 - α) \times 100 %

confidence interval for

{\hat{ATE}}^{(k)}

is then computed using the empirical

α / 2

and

1 - α / 2

quantiles of the bootstrap distribution.

3.3. Hypothesis Testing

To assess whether the average treatment effect is significantly different from zero, we test the null hypothesis:

H_{0} : E [{\hat{CATE}}^{(k)}] = 0 .

A one-sample t-test is performed using the empirical distribution of

{{\hat{CATE}}_{i}^{(k)}}_{i = 1}^{n}

.

3.4. Model Fit Metrics

We evaluate model predictive performance for each outcome using outcome-appropriate metrics:

Count Outcomes (Priors Count and Days in Jail):

For count data modeled using Poisson or NB-based distributions, we report the Poisson deviance. Given observed counts

y_{i}

and predicted means

{\hat{μ}}_{i}

, the deviance is

D = 2 \sum_{i = 1}^{n} [y_{i} log (\frac{y_{i}}{{\hat{μ}}_{i}}) - (y_{i} - {\hat{μ}}_{i})],

with the convention that terms with

y_{i} = 0

are set to zero.

For continuous-like count outcomes such as days in jail, we also report the root mean squared error (RMSE):

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{μ}}_{i})}^{2}} .

Survival Outcome (Time to Recidivism):

We evaluate censored time-to-event predictions using the concordance index (C-index), which quantifies the agreement between predicted risk scores

{\hat{η}}_{i}

and observed survival times

T_{i}

:

C - index = \frac{Number of concordant pairs}{Number of comparable pairs},

where a value closer to 1 indicates superior discrimination performance.

These metrics jointly assess the model’s predictive ability and treatment effect estimation performance across heterogeneous outcomes, with appropriate uncertainty quantification and hypothesis testing.

4. Data Description and Data Analysis

4.1. Data Description

We used the publicly available COMPAS dataset [1], which contains criminal justice data including demographic information, prior offenses, jail durations, and recidivism outcomes. The dataset was preprocessed as follows:

Date fields c_jail_in and c_jail_out were converted to date format, and the jail duration (days_in_jail) was computed as the difference in days.
Key features selected include: age, sex, race, priors_count, decile_score, and days_in_jail.
Observations with missing values in any of these features or in the recidivism indicator (is_recid) were removed.
The treatment variable $T_{i} \in {0, 1}$ , corresponding to is_recid, was binarized, where 1 indicates recidivism.
Survival time was defined based on days_b_screening_arrest (the count of days between screening date and (original) arrest date), with negative or missing values imputed using the time elapsed since jail release to the current date. The event indicator corresponds to recidivism occurrence [21].

To enable functional deep learning, the structured features were transformed into functional inputs via B-spline basis expansions [4]. The transformation pipeline is as follows:

A design matrix $X$ was constructed using one-hot encoding for categorical variables and standardization for continuous variables to ensure numerical stability and consistent scale [22].
The set of input features defines a pseudo-temporal grid over the unit interval $[0, 1]$ , enabling the functional representation of covariate profiles across this abstract time domain.
A cubic B-spline basis with $K = 15$ functions was employed, along with a second-order roughness penalty of $λ = 10^{- 2}$ , chosen to balance smoothness and flexibility. These choices were guided by prior work on functional representations in high-dimensional settings [23], which confirmed the robustness of model performance over a reasonable range of K and $λ$ values.
For each subject, smoothed basis coefficients were estimated by projecting the scaled features onto the B-spline basis using penalized least squares, resulting in a compact functional encoding.

These B-spline coefficient matrices form the input tensor

X \in R^{n \times K \times C}

, where

C = 2

represents two input channels: the smoothed functional covariates and the repeated binary treatment indicator. This representation serves as input to the CNN-LSTM architecture, facilitating learning of spatio-temporal and treatment-dependent dynamics [9].

Outcomes modeled include the following:

Prior offense counts (priors_count),
Days spent in jail (days_in_jail),
Survival outcome (recidivism indicator and survival time).

This representation allows the model to capture complex nonlinear relationships and treatment heterogeneity for counterfactual inference [8].

Table 1 summarizes numerical features:

Age: Mean 34.92 years; interquartile range from 25 to 43.
Priors Count: Skewed; 25% have no priors, but mean is 3.29.
Decile Score: Ranges from −1 to 10; median is 4.
Days in Jail: Median is only 1 day, but outliers raise the mean to 25.47.
Survival Time: Highly censored; many individuals rearrested within 1 day or censored.
Event (Recid.): Approximately 34.4% experienced recidivism.
Treatment T: Same as event indicator, used for modeling purposes.

Table 2 summarizes categorical variables:

Sex: Male (79.5%), Female (20.5%).
Race: Largest groups: African-American (48.3%) and Caucasian (34.0%).

Table 1 and Table 2 provide detailed descriptive statistics for all variables used in modeling. These tables highlight skewed distributions and demographic imbalances that must be accounted for in interpretation and subgroup fairness analysis.

4.2. Data Analysis

Let

T_{i} \in {0, 1}

denote the binary treatment indicator for individual i. The model jointly estimates three outcomes

Y_{i}^{(k)}

for

k = 1, 2, 3

, corresponding to

Priors Count ( $k = 1$ ),
Days in Jail ( $k = 2$ ),
Time to Recidivism ( $k = 3$ , censored survival outcome).

For each outcome k, the conditional average treatment effect (CATE) and average treatment effect (ATE) are estimated as

{\hat{CATE}}_{i}^{(k)} = {\hat{Y}}_{i}^{(k)} (T = 1) - {\hat{Y}}_{i}^{(k)} (T = 0), {\hat{ATE}}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{CATE}}_{i}^{(k)} .

Model Comparison Table

Table 3 summarizes model fit metrics and average treatment effect (ATE) estimates—with bootstrapped 95% confidence intervals—for the three outcomes across four count-based models: ZIP, Poisson, ZINB, and NB.

Survival Fit: The ZIP model achieved the highest C-index (0.602), indicating better discriminative ability in modeling the censored time-to-recidivism outcome.
Count Fit: Poisson and NB models yielded lower deviance values for priors count and jail days but differed markedly in ATE directions and magnitudes.
ATE Estimates: The Poisson model estimated large negative ATEs for priors count and jail days, while ZIP and ZINB models produced near-zero or slightly positive ATEs. The NB model suggested a positive treatment effect for jail days (ATE = 0.105), contrasting with the Poisson estimate.
Model Stability: We observed that the ZINB model yielded extremely high deviance values, which may reflect challenges related to model identifiability and instability. These issues are often exacerbated in the presence of extreme overdispersion and a very high proportion of zeros, both of which characterize our dataset. Specifically, the excessive zero counts may have led to convergence difficulties or unreliable parameter estimation in the zero-inflation component, while the count component may have struggled to capture the highly dispersed distribution. These findings are consistent with known limitations of ZINB models in complex, zero-inflated settings and highlight the need for alternative approaches, such as flexible deep learning-based models or copula-based frameworks, which may offer more robust performance under such conditions. We have added a more detailed discussion on this point in the revised manuscript to guide future applications of zero-inflated count models.

CNN-LSTM Model Outcome-Specific Results of ZIP Model

Interpretation of Outcome-Specific Findings:

Priors Count (Outcome 1): The ZIP deviance of 59,592.49 indicates a reasonable fit to the overdispersed count data. The ATE estimate of 0.017 with a narrow 95% confidence interval is statistically significant ( $p < 0.001$ ), though the practical effect size is small.
Days in Jail (Outcome 2): The high deviance (870,126.42) and RMSE (73.415) reflect considerable variability and skewness in jail durations. The estimated ATE of 0.032 is positive and statistically significant.
Time to Recidivism (Outcome 3): The C-index of 0.602 reflects modest predictive discrimination for the censored survival outcome. While this value is above random chance (0.5), it remains substantially below the threshold typically considered indicative of strong predictive performance (e.g., >0.7). We acknowledge that achieving higher C-index values in this context is particularly challenging due to the complexity of human behavior, unmeasured confounding factors, and the limitations of available covariates in capturing the nuanced determinants of recidivism. Accordingly, we have tempered our language and expanded the discussion in the manuscript to clarify these limitations and to emphasize that while the model provides some signal, there remains substantial room for improvement. The estimated ATE of 0.085 indicates a statistically significant association between treatment and increased risk of recidivism, but this result should be interpreted in light of the model’s limited discriminative ability.

Figure 1 visualizes the full distribution of estimated individual treatment effects for each outcome. Notably, the treatment effects on days in jail and time to recidivism show positively skewed distributions, indicating that while most individuals exhibit minimal response, a nontrivial minority may experience substantially worse outcomes under treatment. These insights highlight the importance of moving beyond ATE estimates and considering CATE distributions in policy evaluation.

As summarized in Table 4, the CNN–LSTM models demonstrate strong performance across all three outcomes, with statistically significant average treatment effect (ATE) estimates at the

p < 0.001

level. For the Priors Count outcome, modeled using a zero-inflated Poisson (ZIP) specification, the model achieves a deviance of 59,592.49, with an ATE of 0.017 (95% CI: 0.017–0.017), indicating a small but precise estimated effect. For the Days in Jail outcome (ZIP model), the deviance is 870,126.42, with an RMSE of 73.42 and an ATE of 0.032 (95% CI: 0.032–0.033). Finally, for the Time to Recidivism outcome, modeled via survival analysis, the concordance index (C-index) is 0.602, and the ATE is 0.085 (95% CI: 0.083–0.086), suggesting a larger estimated treatment effect relative to the other outcomes. Across all outcomes, the large t-statistics and extremely low p-values reinforce the statistical significance of the findings, supporting the robustness of the estimated effects.

Figure 1 presents histograms of the estimated conditional average treatment effects (CATEs) produced by the CNN-LSTM model under the zero-inflated Poisson (ZIP) assumption across the three outcomes.

Top Panel: CATE for Priors Count (ZIP)

The CATE distribution for the number of prior offenses is approximately unimodal with a slight left skew. Most individuals exhibit small positive CATEs ranging from 0 to 0.002, suggesting a mild increase in prior offenses attributable to treatment. A smaller subset shows near-zero or negative CATEs, implying treatment neutrality or mild deterrent effects. These negative effects may be associated with individuals who, based on covariates such as lower baseline risk scores, fewer prior offenses, or stronger social support indicators, are more responsive to rehabilitative interventions.

Middle Panel: CATE for Jail Days (ZIP)

The CATE distribution for jail days is markedly right-skewed, with a concentration around modest positive values (approximately 0.01 to 0.015), indicating that most individuals experience slight increases in jail time due to treatment. However, a minority of individuals show negative CATEs, suggesting that the treatment may reduce jail exposure for certain subgroups. These individuals may possess characteristics such as lower risk classification, stable housing, or fewer prior violations, which could interact with treatment to yield beneficial effects.

Bottom Panel: CATE for Survival Risk (ZIP)

For the time-to-recidivism outcome, the CATE distribution is roughly symmetric with a right skew. While the majority exhibit positive CATEs—consistent with increased recidivism risk following treatment—a notable subset presents negative CATEs, suggesting reduced risk. This pronounced heterogeneity may reflect the influence of contextual covariates such as age, prior offense severity, or mental health indicators, which modulate how individuals respond to the intervention. The presence of these protective effects underscores the potential for targeted treatment assignment to mitigate adverse impacts.

Overall, the CNN-LSTM model under the ZIP framework captures substantial heterogeneity in individual-level treatment effects across all outcomes. These patterns highlight the limitations of ATE-focused evaluation and emphasize the importance of exploring covariate-driven differences in treatment response to inform more personalized policy interventions.

5. Conclusions

This paper introduces a novel deep learning framework that integrates functional data representation with a multi-task CNN-LSTM architecture for causal inference on recidivism outcomes. The model jointly predicts zero-inflated count and censored survival outcomes under both treatment (

T = 1

) and control (

T = 0

) conditions, using loss functions tailored to each outcome type (i.e., zero-inflated Poisson loss and Cox partial likelihood loss).

Empirical results demonstrate that the proposed framework captures individual-level treatment effect heterogeneity across multiple outcome dimensions. In addition to aggregate

{\hat{ATE}}^{(k)}

estimates, the learned

{\hat{CATE}}_{i}^{(k)}

distributions reveal substantial variation in treatment responsiveness. Preliminary subgroup analyses by race, age, and COMPAS risk score suggest differential effects, motivating further investigation into fairness, equity, and policy implications of risk-based interventions. Incorporating structured subgroup discovery or fairness-aware training objectives could enhance interpretability and ethical relevance in future work.

Several limitations warrant discussion. First, the COMPAS dataset contains missing and potentially imputed values, which may introduce bias in both outcome and treatment models. While standard preprocessing steps were used, unmeasured confounding and selection bias remain concerns in non-randomized settings. Second, external validity may be limited, as findings are based on a specific jurisdiction and algorithmic risk tool. Generalization to other populations or jurisdictions with different legal, demographic, or policy contexts should be approached with caution.

Future research directions include extending the framework to handle multi-valued or continuous treatments, incorporating uncertainty quantification for counterfactual predictions, and integrating methods such as instrumental variables or proximal causal inference to address unmeasured confounding. Further exploration of subgroup heterogeneity using interpretable model components, causal forests, or post hoc explainability techniques would also strengthen the policy relevance and fairness assessments of such models in high-stakes applications like criminal justice and healthcare.

Funding

This research received no external funding.

Institutional Review Board Statement

Not related in this research.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

We thank the three respected referees, Associated Editor, and Editor for their constructive and helpful suggestions, which led to substantial improvement in the revised version. For the sake of transparency and reproducibility, the R code for this study can be found in the following GitHub repository: R code GitHub site (accessed on 14 July 2025) (https://github.com/kjonomi/Rcode/blob/main/compas_fda_dl).

Conflicts of Interest

The author declares no conflicts of interest.

References

Angwin, J.; Larson, J.; Mattu, S.; Kirchner, L. Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks. ProPublica. 2016. Available online: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (accessed on 10 May 2025).
Alaa, A.M.; van der Schaar, M. Deep multi-task Gaussian processes for survival analysis with competing risks. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2326–2334. [Google Scholar]
Lim, B.; Alaa, A.; van der Schaar, M. Forecasting treatment responses over time using recurrent marginal structural networks. In Proceedings of the NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 7494–7504. [Google Scholar]
Ramsay, J.O.; Silverman, B.W. Functional Data Analysis, 2nd ed.; Springer: New York, NY, USA, 2005. [Google Scholar]
Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Cox, D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. (Methodol.) 1972, 34, 187–202. [Google Scholar] [CrossRef]
Kvamme, H.; Borgan, Ø.; Scheel, I. Time-to-event prediction with neural networks and Cox regression. J. Mach. Learn. Res. 2019, 20, 1–30. [Google Scholar]
Shalit, U.; Johansson, F.D.; Sontag, D. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3076–3085. [Google Scholar]
Schwab, P.; Linhardt, L.; Bauer, S.; Karlen, W. Learning counterfactual representations for estimating individual dose-response curves. Proc. AAAI Conf. Artif. Intell. 2019, 33, 5612–5619. [Google Scholar] [CrossRef]
Kim, J.-M. Integrating Copula-Based Random Forest and Deep Learning Approaches for Analyzing Heterogeneous Treatment Effects in Survival Analysis. Mathematics 2025, 13, 1659. [Google Scholar] [CrossRef]
Kim, J.-M. Treatment effect estimation in survival analysis using deep learning-based causal inference. Axioms 2025, 14, 458. [Google Scholar] [CrossRef]
Kim, J.-M. A Copula-Driven CNN-LSTM Framework for Estimating Heterogeneous Treatment Effects in Multivariate Outcomes. Mathematics 2025, 13, 2384. [Google Scholar] [CrossRef]
Hill, J.L. Bayesian Nonparametric Modeling for Causal Inference. J. Comput. Graph. Stat. 2011, 20, 217–240. [Google Scholar] [CrossRef]
Ruppert, D.; Wand, M.P.; Carroll, R.J. Semiparametric Regression; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Yao, F.; Müller, H.-G.; Wang, J.-L. Functional data analysis for sparse longitudinal data. J. Am. Stat. Assoc. 2018, 103, 577–590. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Neelon, B. Bayesian Zero-Inflated Negative Binomial Regression Based on Pólya-Gamma Mixtures. Bayesian Anal. 2019, 14, 829–855. [Google Scholar] [CrossRef] [PubMed]
Lawless, J.F. Negative binomial and mixed poisson regression. Can. J. Stat. 1987, 15, 209–225. [Google Scholar] [CrossRef]
Ridout, M.; Demétrio, C.G.B.; Hinde, J. Models for count data with many zeros. Int. Biom. Conf. 1998, 19, 179–192. [Google Scholar]
Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Schmidt, P.; Witte, A.D. Predicting Recidivism Using Survival Models; Springer: New York, NY, USA, 1988. [Google Scholar]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
Gertheiss, J.; Rügamer, D.; Liew, B.X.W.; Greven, S. Functional Data Analysis: An Introduction and Recent Developments. Biom. J. 2024, 66, e202300363. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Histograms of estimated conditional average treatment effects (CATEs) from the CNN-LSTM model under the zero-inflated Poisson (ZIP) assumption for three outcomes: (Top) Priors Count, (Middle) Days in Jail, and (Bottom) Time to Recidivism. The X-axes represent individual-level treatment effect estimates; the Y-axes denote frequency counts. The distributions highlight the heterogeneity of treatment effects, with long right tails for outcomes (Middle,Bottom), suggesting a small subset of individuals for whom the treatment substantially increases jail time or risk of rearrest. Vertical lines indicate sample means.

Table 1. Descriptive statistics for numerical variables in the COMPAS dataset. This table provides minimum, quartile, median, and maximum values, highlighting skewed distributions (e.g., Priors Count, Jail Days) and heavy censoring in survival time.

Variable	Min	1st Qu.	Median	Mean	3rd Qu.	Max
Age	18	25	32	34.92	43	96
Priors Count	0	0	1	3.29	4	43
Decile Score	−1	2	4	4.40	7	10
Days in Jail	0	1	1	25.47	13	2153
Survival Time	0	1	1	18.39	1	1057
Event (Recid.)	0	0	0	0.344	1	1
Treatment T	0	0	0	0.344	1	1

Table 2. Frequencies of sex and race categories in the COMPAS dataset. The dataset is heavily skewed towards male and African-American individuals, which has implications for fairness and subgroup analysis.

Sex	Count
Female	2169
Male	8408
Race	Count
African-American	5251
Asian	53
Caucasian	3703
Hispanic	937
Native American	33
Other	600

Table 3. Model performance and ATE summary across models.

Model	Deviance (Priors Count)	Deviance (Jail Days)	C-Index (Survival)	ATE (Priors Count)	ATE (Jail Days)	ATE (Survival)
ZIP	59,592.49	870,126.42	0.602	0.017	0.032	0.085
Poisson	5914.79	26,226.94	0.530	−0.136	−0.241	0.021
ZINB	2,008,810	19,609,262	0.538	−0.000	−0.000	0.076
NB	6070.53	64,812.73	0.518	−0.039	0.105	0.020

Table 4. CNN-LSTM model results by outcome: fit metrics, average treatment effect (ATE) estimates, and statistical significance.

Outcome	Metric	Value
Priors Count (ZIP)	Deviance	59,592.49
	ATE (95% CI)	0.017 (0.017, 0.017)
	t-statistic, p-value	$t = 168.07$ , $p < 0.001$
Days in Jail (ZIP)	Deviance	870,126.42
	RMSE	73.42
	ATE (95% CI)	0.032 (0.032, 0.033)
	t-statistic, p-value	$t = 114.18$ , $p < 0.001$
Time to Recidivism (Survival)	Concordance Index (C-index)	0.602
	ATE (95% CI)	0.085 (0.083, 0.086)
	t-statistic, p-value	$t = 107.85$ , $p < 0.001$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.-M. Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features. Axioms 2025, 14, 626. https://doi.org/10.3390/axioms14080626

AMA Style

Kim J-M. Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features. Axioms. 2025; 14(8):626. https://doi.org/10.3390/axioms14080626

Chicago/Turabian Style

Kim, Jong-Min. 2025. "Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features" Axioms 14, no. 8: 626. https://doi.org/10.3390/axioms14080626

APA Style

Kim, J.-M. (2025). Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features. Axioms, 14(8), 626. https://doi.org/10.3390/axioms14080626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Task CNN-LSTM Modeling of Zero-Inflated Count and Time-to-Event Outcomes for Causal Inference with Functional Representation of Features

Abstract

1. Introduction

2. Methods and Model Description

2.1. Functional Representation of Features

2.2. CNN-LSTM Model Architecture

2.3. Count Outcome Models and Loss Functions

2.4. Survival Outcome Model and Loss Function

2.5. Joint Training Objective and Counterfactual Estimation

3. Model Evaluation

3.1. Counterfactual Predictions and Treatment Effect Estimation

3.2. Bootstrap Confidence Intervals

3.3. Hypothesis Testing

3.4. Model Fit Metrics

4. Data Description and Data Analysis

4.1. Data Description

4.2. Data Analysis

5. Conclusions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI