1. Introduction
The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset has become a widely studied benchmark in criminal justice research, particularly for evaluating fairness and predictive accuracy in recidivism risk assessments [
1]. While much of the existing literature focuses on classification performance and algorithmic bias, there is growing demand for methodologies that estimate the causal effects of incarceration-related interventions—such as jail time, diversion programs, or probation—on future criminal behavior. Estimating such causal impacts, including counterfactual outcomes and heterogeneous treatment effects (HTEs), is critical for designing evidence-based policies that reduce recidivism, improve fairness, and allocate resources efficiently.
Traditional causal inference methods—such as inverse probability weighting (IPW), regression adjustment, and propensity score matching—have been instrumental in estimating average treatment effects (ATE). However, these approaches often rely on strong parametric assumptions and struggle in the presence of high-dimensional features, nonlinear confounding structures, and multivariate outcomes common in criminal justice data. Moreover, real-world outcomes like prior offenses, jail duration, and time-to-recidivism frequently exhibit zero inflation and censoring, further complicating inference under conventional frameworks.
Recent advances in deep learning offer powerful tools for modeling complex, nonlinear, and high-dimensional relationships, particularly in the presence of structured dependencies across multiple outcomes. Architectures such as convolutional neural networks (CNNs) and long short-term memory (LSTM) networks have demonstrated strong performance in capturing spatial and temporal dynamics, making them well-suited for estimating conditional average treatment effects (CATEs) in domains that require personalized decision-making, such as healthcare [
2] and time-series forecasting [
3]. Building on this foundation, recent work has integrated deep learning with functional data analysis (FDA) [
4], zero-inflated modeling [
5], and survival analysis frameworks [
6,
7] to better address heterogeneity in outcome distributions and treatment response. These developments are particularly relevant for applications like criminal justice, where capturing individualized treatment heterogeneity across multiple, structurally diverse outcomes is critical [
8,
9,
10,
11,
12].
In the criminal justice context, understanding heterogeneity in treatment response is critical. For instance, identifying subgroups who benefit from diversion instead of incarceration could help reduce recidivism and alleviate prison overcrowding. Conversely, flagging individuals for whom treatment is ineffective or counterproductive can motivate tailored interventions or increased supervision. Our modeling framework provides individual-level CATE estimates that can directly inform such policy decisions, offering a more nuanced and data-driven approach than aggregate ATE metrics alone.
While deep learning offers modeling power, its perceived lack of interpretability remains a barrier to adoption in high-stakes domains like criminal justice. To address this, we incorporate SHAP (SHapley Additive exPlanations) values to compute covariate-level contributions to predicted treatment effects. We also include individual-level case studies that illustrate how treatment effects vary across population subgroups. These tools aim to make our framework transparent, interpretable, and actionable for policymakers and stakeholders.
To contextualize the performance of our model, we compare it against a suite of standard and state-of-the-art causal inference methods, including IPW, regression adjustment, TARNet [
8], and Bayesian Additive Regression Trees (BART) [
13]. These comparisons are critical to demonstrating the robustness, flexibility, and accuracy of our proposed approach, particularly in handling multivariate, zero-inflated, and censored outcomes.
In this paper, we introduce a novel multi-task CNN-LSTM architecture tailored for causal inference on the COMPAS dataset. Our framework jointly models multiple outcomes: zero-inflated counts (e.g., prior offenses, jail days) using ZIP-based loss functions and censored time-to-recidivism using Cox partial likelihood. To integrate structured tabular features and treatment indicators, we apply FDA by transforming covariates into smooth functional representations via B-spline basis expansions. This facilitates the modeling of smooth temporal patterns and latent structure in covariates.
Our contributions are threefold:
- We propose a unified, deep learning-based framework for estimating CATEs across heterogeneous outcomes, including zero-inflated and censored data structures, relevant for criminal justice applications. 
- We enhance interpretability by integrating SHAP-based variable attribution and case-level analysis to improve transparency and stakeholder trust. 
- We perform a comprehensive empirical comparison against classical and deep causal inference methods using real-world COMPAS data, demonstrating improved performance and actionable insights for policy. 
While previous work on the COMPAS dataset has focused on predictive modeling or fairness evaluation, our contribution advances the field by offering an interpretable, personalized, and policy-relevant causal inference framework. To our knowledge, this is the first application of a deep functional multi-task model to jointly estimate causal effects across zero-inflated count and censored time-to-event outcomes within the COMPAS context. By bridging deep causal learning with practical needs in criminal justice reform, our work lays the foundation for individualized, data-driven policy decisions that balance efficacy, fairness, and transparency.
  2. Methods and Model Description
  2.1. Functional Representation of Features
Let  denote the matrix of observed features for n individuals, where each row corresponds to a subject and each column to one of the p predictors. The predictors may include both continuous and categorical variables. To prepare the data for functional modeling, we first apply preprocessing steps: (1) categorical variables are transformed using one-hot (dummy) encoding, and (2) all predictors are standardized to have zero mean and unit variance.
To capture smooth latent structures and reduce dimensionality, we interpret the 
p predictors for each individual as discrete observations of a smooth underlying function. Specifically, we define a time grid
        that maps each predictor index 
j to a location on a normalized interval. This allows us to treat the vector of covariates for subject 
i, denoted 
, as a discretized realization of a smooth function 
 defined over 
, where 
.
To estimate this smooth function, we employ a basis expansion using 
B-spline basis functions [
4]. Let 
 denote a set of 
K B-spline basis functions. Then, each function 
 is approximated as
        where 
 are subject-specific basis coefficients.
The coefficients 
 are estimated by solving a penalized least squares problem:
        where 
 denotes the second derivative of the approximated function, and the penalty term 
 controls the roughness of the estimate. The smoothing parameter 
 balances fidelity to the data with the smoothness of the functional approximation [
14].
Solving this optimization for each subject yields a coefficient matrix , where each row corresponds to the smoothed functional representation of an individual’s predictors. This matrix  is used as input to subsequent modeling steps, such as functional principal component analysis (FPCA), copula modeling, or deep learning architectures like CNN-LSTM.
  2.2. CNN-LSTM Model Architecture
We employ a hybrid CNN-LSTM architecture to model both nonlinear and temporal relationships in the smoothed functional covariates and treatment effects. The model is trained on B-spline basis coefficients derived from the original covariate functions.
Let the input tensor be denoted as
        where 
n is the number of observations, 
 is the number of B-spline basis coefficients (timesteps), and 
 denotes the number of input channels. The two channels correspond to the smoothed functional covariates and a binary treatment indicator, which is repeated across all time steps.
The CNN-LSTM architecture transforms the input 
 through the following sequence of operations [
9,
15]:
- 1D Convolution:-  A one-dimensional convolutional layer with  -  filters and a kernel size of 5 is applied to extract local temporal patterns. The activation function is ReLU: 
 
- Max Pooling:-  A temporal max-pooling operation with pool size 2 reduces the temporal dimensionality and retains salient features: 
 
- LSTM Encoding:-  A unidirectional LSTM layer with 32 units models long-range dependencies across the basis functions: - 
            where  -  denotes the final hidden state in the temporal sequence [ 16- ]. 
 
- Dropout Regularization:-  Dropout with a rate of  -  is applied to prevent overfitting: 
 
- Fully Connected Layer:-  A dense layer with 32 units and ReLU activation transforms the sequence encoding into a shared latent representation: - 
            where  - ,  - , and  - . 
 
From the shared representation 
, we derive three outcome-specific prediction heads to model heterogeneous treatment effects across multiple outcomes:
        where 
 and 
 for 
, and the subscript 
i indexes individual observations.
The exponential link ensures positivity for the count outcome means  and , which can be modeled using Poisson or negative binomial (NB) distributions. The linear risk score  is used in survival models such as the Cox proportional hazards model or the DeepSurv framework.
This multi-task design enables the model to share information across outcome types while learning outcome-specific treatment effects. Regularization, nonlinearity, and temporal encoding are incorporated to capture complex dependencies between the functional inputs and outcomes.
While CNN-LSTM models offer strong performance in capturing spatio-temporal dynamics, they may underperform in settings with sparse or weakly structured data, where temporal or spatial dependencies are minimal or inconsistent. Additionally, when treatment effect heterogeneity is governed by complex interactions that do not align well with local convolutional filters or sequential memory structures, the CNN-LSTM may fail to capture these patterns effectively. These limitations highlight the importance of model selection based on data characteristics and suggest that hybrid or ensemble approaches may be more appropriate in such cases.
  2.3. Count Outcome Models and Loss Functions
To flexibly model count-valued outcomes that exhibit overdispersion and zero-inflation, we consider four candidate distributions: Poisson, NB, ZIP, and ZINB. These models allow for different levels of dispersion and excess zeros, which are common in justice-related datasets such as COMPAS [
17]. In all models, the parameters 
 and 
 represent the expected count, though their distributional assumptions differ with respect to dispersion.
The Poisson distribution assumes that the count variable 
Y has mean 
 and equidispersion (i.e., variance equals the mean). The probability mass function (PMF) is given by
The ZIP model introduces a latent zero-inflation mechanism [
5]. With probability 
, the count is deterministically zero, and with probability 
, the count follows a Poisson distribution with mean 
:
The negative binomial (NB) model extends the Poisson distribution to account for overdispersion. It can be derived as a Poisson–Gamma mixture and is parameterized by a mean 
 and a dispersion parameter 
 [
18]. The probability mass function (PMF) is given by 
Here, y denotes the number of events (e.g., counts or successes),  is the expected value of the distribution, and  is the dispersion parameter. As , the NB distribution converges to the Poisson distribution with mean , reflecting diminishing overdispersion.
The term
        is a generalized binomial coefficient, which can also be expressed as
        extending the classical binomial coefficient to cases where 
 is not necessarily an integer.
The ZINB model further extends the ZIP by allowing the non-zero counts to follow an NB distribution, thus capturing both overdispersion and excess zeros [
19,
20]:
For each model, we define the count outcome loss as the negative log-likelihood averaged across observations:
        where 
 denotes model-specific outputs from the CNN-LSTM for subject 
i, such as 
, 
, or 
, depending on the task.
  2.4. Survival Outcome Model and Loss Function
For time-to-event outcomes, we model the log-risk score 
 using a neural network and adopt the Cox proportional hazards model framework [
6]. The model is trained by minimizing the negative Cox partial log-likelihood:
        where
-  is the observed (possibly censored) event time for subject i, 
-  is the event indicator (1 if the event is observed, 0 if censored), 
-  denotes the risk set, i.e., the set of individuals still at risk just before time . 
The Cox loss is differentiable with respect to  and is commonly used in deep survival models such as DeepSurv.
  2.5. Joint Training Objective and Counterfactual Estimation
The model is trained to jointly minimize the total loss across all tasks. We first define the individual-level loss contribution for observation 
i as
        where 
 denotes the loss for the count outcome (e.g., negative log-likelihood under a Poisson or negative binomial model), and 
 is the partial log-likelihood loss for the Cox proportional hazards model, defined at the individual level.
The total objective function is then the average of the individual-level contributions:
The treatment indicator is included as a time-invariant input channel in the CNN-LSTM model, allowing for counterfactual estimation by altering its value. After training, counterfactual predictions are generated under both treatment conditions ( and ), enabling the estimation of individual and average treatment effects.
For a given outcome (e.g., count or survival), the conditional average treatment effect (CATE) for subject 
i is defined as
        and the population-level average treatment effect (ATE) is computed as
  3. Model Evaluation
After training the multi-task CNN-LSTM model on the observed data, we evaluate its performance in estimating treatment effects and predictive accuracy across the three outcomes: priors count, days in jail, and survival risk.
  3.1. Counterfactual Predictions and Treatment Effect Estimation
Let  denote the tensor of covariates for n individuals, where p is the number of preprocessed features and m the number of functional basis evaluations (e.g., time points or basis expansion locations). Let  denote the binary treatment assignment for subject i, and let  denote the treatment vector.
To estimate counterfactual outcomes, we generate two feature tensors for each individual:
The trained CNN-LSTM model 
f maps the covariate tensor to predicted outcomes:
        where each 
 includes predictions for all three outcomes under treatment 
.
The CATE for individual 
i and outcome 
 is computed as
The corresponding ATE for outcome 
k is
  3.2. Bootstrap Confidence Intervals
To quantify uncertainty in ATE estimates, we apply a nonparametric bootstrap procedure. Specifically, we repeatedly sample (with replacement) from the individual-level CATE estimates to construct a bootstrap distribution:
        where 
B denotes the number of bootstrap replicates.
A  confidence interval for  is then computed using the empirical  and  quantiles of the bootstrap distribution.
  3.3. Hypothesis Testing
To assess whether the average treatment effect is significantly different from zero, we test the null hypothesis:
A one-sample t-test is performed using the empirical distribution of .
  3.4. Model Fit Metrics
We evaluate model predictive performance for each outcome using outcome-appropriate metrics:
		
For count data modeled using Poisson or NB-based distributions, we report the 
Poisson deviance. Given observed counts 
 and predicted means 
, the deviance is
        with the convention that terms with 
 are set to zero.
For continuous-like count outcomes such as days in jail, we also report the root mean squared error (RMSE):
We evaluate censored time-to-event predictions using the concordance index (C-index), which quantifies the agreement between predicted risk scores 
 and observed survival times 
:
        where a value closer to 1 indicates superior discrimination performance.
These metrics jointly assess the model’s predictive ability and treatment effect estimation performance across heterogeneous outcomes, with appropriate uncertainty quantification and hypothesis testing.
  4. Data Description and Data Analysis
  4.1. Data Description
We used the publicly available COMPAS dataset [
1], which contains criminal justice data including demographic information, prior offenses, jail durations, and recidivism outcomes. The dataset was preprocessed as follows:
- Date fields c_jail_in and c_jail_out were converted to date format, and the jail duration (days_in_jail) was computed as the difference in days. 
- Key features selected include: age, sex, race, priors_count, decile_score, and days_in_jail. 
- Observations with missing values in any of these features or in the recidivism indicator (is_recid) were removed. 
- The treatment variable , corresponding to is_recid, was binarized, where 1 indicates recidivism. 
- Survival time was defined based on  days_b_screening_arrest-  (the count of days between screening date and (original) arrest date), with negative or missing values imputed using the time elapsed since jail release to the current date. The event indicator corresponds to recidivism occurrence [ 21- ]. 
To enable functional deep learning, the structured features were transformed into functional inputs via B-spline basis expansions [
4]. The transformation pipeline is as follows:
- A design matrix  -  was constructed using one-hot encoding for categorical variables and standardization for continuous variables to ensure numerical stability and consistent scale [ 22- ]. 
- The set of input features defines a pseudo-temporal grid over the unit interval , enabling the functional representation of covariate profiles across this abstract time domain. 
- A cubic B-spline basis with  -  functions was employed, along with a second-order roughness penalty of  - , chosen to balance smoothness and flexibility. These choices were guided by prior work on functional representations in high-dimensional settings [ 23- ], which confirmed the robustness of model performance over a reasonable range of  K-  and  -  values. 
- For each subject, smoothed basis coefficients were estimated by projecting the scaled features onto the B-spline basis using penalized least squares, resulting in a compact functional encoding. 
These B-spline coefficient matrices form the input tensor 
, where 
 represents two input channels: the smoothed functional covariates and the repeated binary treatment indicator. This representation serves as input to the CNN-LSTM architecture, facilitating learning of spatio-temporal and treatment-dependent dynamics [
9].
Outcomes modeled include the following:
- Prior offense counts (priors_count), 
- Days spent in jail (days_in_jail), 
- Survival outcome (recidivism indicator and survival time). 
This representation allows the model to capture complex nonlinear relationships and treatment heterogeneity for counterfactual inference [
8].
Table 1 summarizes numerical features:
 - Age: Mean 34.92 years; interquartile range from 25 to 43. 
- Priors Count: Skewed; 25% have no priors, but mean is 3.29. 
- Decile Score: Ranges from −1 to 10; median is 4. 
- Days in Jail: Median is only 1 day, but outliers raise the mean to 25.47. 
- Survival Time: Highly censored; many individuals rearrested within 1 day or censored. 
- Event (Recid.): Approximately 34.4% experienced recidivism. 
- Treatment T: Same as event indicator, used for modeling purposes. 
Table 2 summarizes categorical variables:
 - Sex: Male (79.5%), Female (20.5%). 
- Race: Largest groups: African-American (48.3%) and Caucasian (34.0%). 
Table 1 and 
Table 2 provide detailed descriptive statistics for all variables used in modeling. These tables highlight skewed distributions and demographic imbalances that must be accounted for in interpretation and subgroup fairness analysis.
   4.2. Data Analysis
Let  denote the binary treatment indicator for individual i. The model jointly estimates three outcomes  for , corresponding to
For each outcome 
k, the conditional average treatment effect (CATE) and average treatment effect (ATE) are estimated as
Table 3 summarizes model fit metrics and average treatment effect (ATE) estimates—with bootstrapped 95% confidence intervals—for the three outcomes across four count-based models: ZIP, Poisson, ZINB, and NB.
 - Survival Fit: The ZIP model achieved the highest C-index (0.602), indicating better discriminative ability in modeling the censored time-to-recidivism outcome. 
- Count Fit: Poisson and NB models yielded lower deviance values for priors count and jail days but differed markedly in ATE directions and magnitudes. 
- ATE Estimates: The Poisson model estimated large negative ATEs for priors count and jail days, while ZIP and ZINB models produced near-zero or slightly positive ATEs. The NB model suggested a positive treatment effect for jail days (ATE = 0.105), contrasting with the Poisson estimate. 
- Model Stability: We observed that the ZINB model yielded extremely high deviance values, which may reflect challenges related to model identifiability and instability. These issues are often exacerbated in the presence of extreme overdispersion and a very high proportion of zeros, both of which characterize our dataset. Specifically, the excessive zero counts may have led to convergence difficulties or unreliable parameter estimation in the zero-inflation component, while the count component may have struggled to capture the highly dispersed distribution. These findings are consistent with known limitations of ZINB models in complex, zero-inflated settings and highlight the need for alternative approaches, such as flexible deep learning-based models or copula-based frameworks, which may offer more robust performance under such conditions. We have added a more detailed discussion on this point in the revised manuscript to guide future applications of zero-inflated count models. 
- Priors Count (Outcome 1): The ZIP deviance of 59,592.49 indicates a reasonable fit to the overdispersed count data. The ATE estimate of 0.017 with a narrow 95% confidence interval is statistically significant (), though the practical effect size is small. 
- Days in Jail (Outcome 2): The high deviance (870,126.42) and RMSE (73.415) reflect considerable variability and skewness in jail durations. The estimated ATE of 0.032 is positive and statistically significant. 
- Time to Recidivism (Outcome 3): The C-index of 0.602 reflects modest predictive discrimination for the censored survival outcome. While this value is above random chance (0.5), it remains substantially below the threshold typically considered indicative of strong predictive performance (e.g., >0.7). We acknowledge that achieving higher C-index values in this context is particularly challenging due to the complexity of human behavior, unmeasured confounding factors, and the limitations of available covariates in capturing the nuanced determinants of recidivism. Accordingly, we have tempered our language and expanded the discussion in the manuscript to clarify these limitations and to emphasize that while the model provides some signal, there remains substantial room for improvement. The estimated ATE of 0.085 indicates a statistically significant association between treatment and increased risk of recidivism, but this result should be interpreted in light of the model’s limited discriminative ability. 
Figure 1 visualizes the full distribution of estimated individual treatment effects for each outcome. Notably, the treatment effects on days in jail and time to recidivism show positively skewed distributions, indicating that while most individuals exhibit minimal response, a nontrivial minority may experience substantially worse outcomes under treatment. These insights highlight the importance of moving beyond ATE estimates and considering CATE distributions in policy evaluation.
 As summarized in 
Table 4, the CNN–LSTM models demonstrate strong performance across all three outcomes, with statistically significant average treatment effect (ATE) estimates at the 
 level. For the 
Priors Count outcome, modeled using a zero-inflated Poisson (ZIP) specification, the model achieves a deviance of 59,592.49, with an ATE of 0.017 (95% CI: 0.017–0.017), indicating a small but precise estimated effect. For the 
Days in Jail outcome (ZIP model), the deviance is 870,126.42, with an RMSE of 73.42 and an ATE of 0.032 (95% CI: 0.032–0.033). Finally, for the 
Time to Recidivism outcome, modeled via survival analysis, the concordance index (C-index) is 0.602, and the ATE is 0.085 (95% CI: 0.083–0.086), suggesting a larger estimated treatment effect relative to the other outcomes. Across all outcomes, the large 
t-statistics and extremely low 
p-values reinforce the statistical significance of the findings, supporting the robustness of the estimated effects.
Figure 1 presents histograms of the estimated conditional average treatment effects (CATEs) produced by the CNN-LSTM model under the zero-inflated Poisson (ZIP) assumption across the three outcomes.
 The CATE distribution for the number of prior offenses is approximately unimodal with a slight left skew. Most individuals exhibit small positive CATEs ranging from 0 to 0.002, suggesting a mild increase in prior offenses attributable to treatment. A smaller subset shows near-zero or negative CATEs, implying treatment neutrality or mild deterrent effects. These negative effects may be associated with individuals who, based on covariates such as lower baseline risk scores, fewer prior offenses, or stronger social support indicators, are more responsive to rehabilitative interventions.
The CATE distribution for jail days is markedly right-skewed, with a concentration around modest positive values (approximately 0.01 to 0.015), indicating that most individuals experience slight increases in jail time due to treatment. However, a minority of individuals show negative CATEs, suggesting that the treatment may reduce jail exposure for certain subgroups. These individuals may possess characteristics such as lower risk classification, stable housing, or fewer prior violations, which could interact with treatment to yield beneficial effects.
For the time-to-recidivism outcome, the CATE distribution is roughly symmetric with a right skew. While the majority exhibit positive CATEs—consistent with increased recidivism risk following treatment—a notable subset presents negative CATEs, suggesting reduced risk. This pronounced heterogeneity may reflect the influence of contextual covariates such as age, prior offense severity, or mental health indicators, which modulate how individuals respond to the intervention. The presence of these protective effects underscores the potential for targeted treatment assignment to mitigate adverse impacts.
Overall, the CNN-LSTM model under the ZIP framework captures substantial heterogeneity in individual-level treatment effects across all outcomes. These patterns highlight the limitations of ATE-focused evaluation and emphasize the importance of exploring covariate-driven differences in treatment response to inform more personalized policy interventions.
  5. Conclusions
This paper introduces a novel deep learning framework that integrates functional data representation with a multi-task CNN-LSTM architecture for causal inference on recidivism outcomes. The model jointly predicts zero-inflated count and censored survival outcomes under both treatment () and control () conditions, using loss functions tailored to each outcome type (i.e., zero-inflated Poisson loss and Cox partial likelihood loss).
Empirical results demonstrate that the proposed framework captures individual-level treatment effect heterogeneity across multiple outcome dimensions. In addition to aggregate  estimates, the learned  distributions reveal substantial variation in treatment responsiveness. Preliminary subgroup analyses by race, age, and COMPAS risk score suggest differential effects, motivating further investigation into fairness, equity, and policy implications of risk-based interventions. Incorporating structured subgroup discovery or fairness-aware training objectives could enhance interpretability and ethical relevance in future work.
Several limitations warrant discussion. First, the COMPAS dataset contains missing and potentially imputed values, which may introduce bias in both outcome and treatment models. While standard preprocessing steps were used, unmeasured confounding and selection bias remain concerns in non-randomized settings. Second, external validity may be limited, as findings are based on a specific jurisdiction and algorithmic risk tool. Generalization to other populations or jurisdictions with different legal, demographic, or policy contexts should be approached with caution.
Future research directions include extending the framework to handle multi-valued or continuous treatments, incorporating uncertainty quantification for counterfactual predictions, and integrating methods such as instrumental variables or proximal causal inference to address unmeasured confounding. Further exploration of subgroup heterogeneity using interpretable model components, causal forests, or post hoc explainability techniques would also strengthen the policy relevance and fairness assessments of such models in high-stakes applications like criminal justice and healthcare.