Next Article in Journal
F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning
Previous Article in Journal
On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation
Previous Article in Special Issue
Generative AI Recommendations for Environmental Sustainability: A Hybrid SEM–ANN Analysis of Gen Z Users in the Philippines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Assisted Sentencing Modeling Under Explainability Constraints: Framework Design and Judicial Applicability Analysis

1
School of Criminal Justice, Shandong University of Political Science and Law, Jinan 250014, China
2
College of Design and Innovation, Tongji University, Shanghai 200092, China
*
Author to whom correspondence should be addressed.
Information 2026, 17(3), 234; https://doi.org/10.3390/info17030234
Submission received: 30 January 2026 / Revised: 24 February 2026 / Accepted: 25 February 2026 / Published: 1 March 2026
(This article belongs to the Special Issue Artificial Intelligence Technologies for Sustainable Development)

Abstract

The integration of artificial intelligence into criminal sentencing decisions represents one of the most consequential applications of algorithmic systems in contemporary governance. While AI-assisted risk assessment tools promise enhanced consistency and predictive accuracy, their deployment in judicial contexts raises profound concerns regarding transparency, due process, and fundamental rights. This paper proposes a comprehensive framework for AI-assisted sentencing modeling that embeds explainability as a foundational constraint rather than an afterthought. Drawing upon the landmark State v. Loomis decision, empirical analyses of the COMPAS algorithm, and emerging regulatory frameworks including the European Union Artificial Intelligence Act, we examine the tension between predictive performance and interpretive transparency. Our framework integrates a three-layer explanation architecture: inherent interpretability through generalized additive models (GA2Ms) providing transparent global structure, exact local feature attribution derived directly from the additive model decomposition without approximation, and counterfactual reasoning that identifies minimal input changes altering risk classifications. We demonstrate through rigorous experimental validation on the ProPublica COMPAS dataset (n = 6172) that explainability-constrained models achieve comparable predictive validity to opaque alternatives (AUC 0.71 versus 0.70–0.72 for black-box methods) while satisfying constitutional due process requirements and emerging regulatory mandates under the EU Artificial Intelligence Act. The impossibility theorems governing algorithmic fairness are examined in light of their implications for sentencing equity, and we propose that transparent model architectures enable targeted interventions unavailable when decision logic remains concealed. The paper concludes with policy guidance for jurisdictions seeking to implement AI-assisted sentencing systems that balance public safety objectives with procedural fairness and individual rights.

1. Introduction

The deployment of algorithmic risk assessment instruments in criminal justice systems has accelerated dramatically over the past two decades, fundamentally altering how sentencing decisions are informed and justified. Tools such as the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), the Public Safety Assessment (PSA), and the Level of Service Inventory-Revised (LSI-R) now influence the liberty interests of millions of defendants annually across multiple jurisdictions [1,2]. Proponents argue that these instruments enhance objectivity, reduce inter-judge disparities, and improve resource allocation within correctional systems. Critics counter that algorithmic sentencing tools perpetuate historical biases, operate as inscrutable “black boxes,” and undermine the individualized assessment that due process demands [3,4].
The tension between predictive utility and interpretive transparency has emerged as a central challenge in this domain. Machine learning models capable of superior predictive performance often achieve such performance through complex, nonlinear transformations that resist human comprehension. Conversely, models designed for transparency may sacrifice predictive accuracy, potentially compromising public safety objectives. This apparent trade-off has prompted some scholars to question whether algorithmic sentencing can ever satisfy the dual imperatives of accuracy and accountability [5].
This paper argues that the framing of explainability as a constraint rather than a competing objective offers a productive path forward. Rather than treating transparency as one value to be balanced against others, we propose that explainability requirements should function as non-negotiable architectural constraints within which predictive systems must operate. This approach reflects the constitutional reality that due process is not merely one policy preference among many, but a fundamental requirement that circumscribes permissible governmental action.
The contribution of this work is threefold. First, we develop a theoretical framework situating explainability requirements within the broader normative architecture of criminal sentencing, drawing upon constitutional doctrine, administrative law principles, and emerging AI governance frameworks. Second, we propose a technical architecture for explainability-constrained sentencing models that integrates multiple complementary explanation mechanisms calibrated to judicial decision-making contexts. Third, we assess the practical viability of this framework through comparative analysis of existing risk assessment instruments and evaluation against emerging regulatory requirements, particularly those articulated in the European Union Artificial Intelligence Act.
An important distinction must be drawn between technical design feasibility and institutional adoptability. While this paper demonstrates that explainability-constrained sentencing models are technically feasible—achieving competitive predictive performance while satisfying transparency requirements—institutional adoption faces distinct and independently formidable challenges that warrant systematic analysis.
Research on judicial technology adoption identifies several categories of institutional resistance. First, epistemic conservatism: courts exhibit documented reluctance to adopt technically sophisticated tools, reflecting both professional norms favoring established procedures and legitimate concerns about judicial capacity to evaluate algorithmic recommendations critically [6]. Second, organizational inertia: court administrative structures, procurement processes, and workflow routines create path dependencies that resist technological disruption regardless of technical merit. Third, professional identity concerns: judges may perceive algorithmic decision support as encroaching upon judicial discretion—a core professional value—even when such tools are positioned as advisory rather than determinative. Fourth, accountability ambiguity: the introduction of algorithmic intermediaries complicates traditional accountability structures, creating uncertainty about responsibility allocation when algorithmic recommendations prove erroneous.
We emphasize that all claims in this paper regarding judicial comprehension, cognitive load management, and decision-making improvement are theoretically motivated design objectives grounded in cognitive science literature rather than empirically demonstrated outcomes. Design feasibility establishes what is possible; institutional adoption depends upon judicial training, organizational culture, appellate oversight, stakeholder acceptance, and resource allocation. Empirical validation of judicial engagement with the proposed explanation architecture—through controlled experiments, field studies, and structured expert feedback—constitutes a critical priority for future research, as discussed in Section 7. This paper addresses primarily design feasibility while Section 6 provides implementation guidance addressing adoption barriers. The success of explainable AI-assisted sentencing ultimately depends not only on sound technical design but also on institutional reforms ensuring that judges possess both the capacity and the incentives to engage meaningfully with algorithmic recommendations rather than deferring uncritically to technological authority.
The structure of the paper proceeds as follows. Section 2 reviews the existing literature on algorithmic risk assessment in criminal justice, with particular attention to empirical evaluations of predictive validity, fairness properties, and the mathematical impossibility results that constrain achievable fairness guarantees. Section 3 examines the legal and regulatory landscape governing AI-assisted sentencing, including constitutional due process requirements under State v. Loomis and emerging statutory frameworks. Section 4 presents our proposed framework for explainability-constrained sentencing models, detailing both the technical architecture and the underlying design rationale. Section 5 provides a comparative analysis of framework performance against existing alternatives. Section 6 discusses implementation pathways and offers policy guidance. Section 7 concludes with reflections on the broader implications for algorithmic governance in high-stakes domains.

2. Literature Review and Theoretical Foundations

2.1. The Evolution of Risk Assessment in Criminal Sentencing

The use of actuarial instruments to inform criminal justice decisions has a lengthy pedigree, predating the computational era by several decades. Early risk assessment tools relied upon simple additive scoring systems derived from clinical experience and limited empirical validation [7]. The seminal work of Burgess in the 1920s established the paradigm of statistical prediction in criminology, demonstrating that actuarial methods could outperform clinical judgment in forecasting parole outcomes [8]. This insight—that structured, empirical approaches consistently outperform unstructured professional judgment—has been replicated across diverse prediction domains and remains foundational to contemporary risk assessment practice.
The advent of large-scale administrative datasets and increasingly sophisticated statistical methods enabled the development of instruments grounded in systematic empirical analysis rather than clinical intuition alone. Contemporary risk assessment instruments vary substantially in their methodological foundations, predictive targets, and validation evidence. First-generation tools relied primarily upon static factors such as criminal history and offense characteristics. Second-generation instruments incorporated dynamic factors amenable to intervention, including employment status, substance use patterns, and social networks. Third-generation instruments integrate static and dynamic factors within structured decision-making frameworks, while fourth-generation instruments attempt to link risk assessment to case planning and supervision strategies through the Risk-Need-Responsivity (RNR) paradigm [9].
The COMPAS instrument, developed by Northpointe (now Equivant), exemplifies the contemporary approach to algorithmic risk assessment. COMPAS evaluates defendants across multiple scales, including General Recidivism Risk, Violent Recidivism Risk, and Pretrial Failure to Appear. The instrument processes 137 features derived from defendant interviews and criminal history records, generating risk scores ranging from 1 (lowest risk) to 10 (highest risk) [10]. The proprietary nature of the COMPAS algorithm—Northpointe has declined to disclose the specific weights assigned to input features or the functional form relating inputs to outputs—has generated substantial controversy regarding the appropriateness of its use in sentencing determinations where defendants cannot meaningfully scrutinize the basis for adverse classifications.
Other widely deployed instruments illustrate the diversity of methodological approaches and the breadth of international adoption. The Public Safety Assessment (PSA), developed by the Laura and John Arnold Foundation, uses a fixed nine-factor additive scoring rule with publicly disclosed weights, representing a deliberate design choice favoring transparency over complexity [11]. The Level of Service Inventory-Revised (LSI-R) integrates 54 items across ten domains within a structured professional judgment framework that combines actuarial scoring with clinician override capabilities [12]. Internationally, England and Wales employ the Offender Assessment System (OASys), which combines actuarial risk estimation with structured needs assessment. Australia has developed the Level of Service/Risk-Need-Responsivity instrument (LS/RNR). Several Canadian provinces use the Statistical Information on Recidivism (SIR) scale. These instruments reflect varying positions along the transparency–complexity spectrum, with simpler additive tools (PSA, SIR) achieving interpretability through structural transparency while more complex instruments (LSI-R, OASys) sacrifice some transparency for richer clinical integration.
The evolution from clinical to actuarial to algorithmic risk assessment carries profound implications for the explainability debates examined in subsequent sections. First-generation actuarial instruments achieved interpretability through transparency: simple additive scoring rules enabled judges and defendants to understand precisely how risk classifications were derived. The shift to algorithmic instruments promises improved accuracy through complex pattern detection but threatens to sacrifice the transparency that enabled accountability in earlier approaches. This tension—between the simplicity that enables understanding and the complexity that improves prediction—frames the central challenge addressed by this paper. The question is not whether to return to first-generation methods but whether contemporary machine learning can achieve both predictive sophistication and interpretive transparency. The framework proposed in Section 4 suggests that this synthesis is achievable through careful architectural choices that embed explainability as a design constraint rather than treating it as a post hoc addition to opaque systems.
The trajectory from first-generation actuarial instruments to contemporary algorithmic tools thus reveals a recurring tension that directly motivates the explainability constraints proposed in this paper. Early instruments achieved transparency not through deliberate design but through structural simplicity: additive scoring rules were interpretable because they could not be otherwise. As methodological sophistication increased, transparency was progressively sacrificed in pursuit of predictive gains—a trade-off that appeared necessary given the perceived limitations of simple models. The modern explainability imperative reverses this trajectory by asking whether contemporary machine learning methods can recover the transparency that earlier instruments achieved through simplicity while retaining the predictive sophistication that complexity enables. The GA2M architecture proposed in Section 4 represents precisely this synthesis: an additive structure providing the same decomposability as first-generation scoring rules, combined with flexible shape functions capturing the nonlinear patterns that motivated the shift to complex algorithms. In this sense, the framework does not merely add explainability to modern methods but rather reconnects with the transparency that characterized the most defensible features of the actuarial tradition.

2.2. Empirical Evidence on Predictive Validity

Empirical evaluations of risk assessment instruments have yielded findings that challenge optimistic claims regarding algorithmic superiority. Area Under the Receiver Operating Characteristic Curve (AUC) values for recidivism prediction instruments typically range from 0.64 to 0.71, indicating moderate discriminative ability that falls considerably short of the performance standards applied in other high-stakes prediction domains [13,14]. To contextualize these figures: an AUC of 0.70 indicates that a randomly selected recidivist will score higher than a randomly selected non-recidivist 70% of the time—substantially better than the 50% expected from random assignment, but far from the 0.90+ AUC values typically required for clinical diagnostic instruments.
A seminal 2018 study by Dressel and Farid published in Science Advances challenged claims that algorithmic risk assessment substantially outperforms human judgment [15]. The researchers recruited participants through Amazon’s Mechanical Turk platform. Each participant received defendant descriptions containing seven features: age, sex, number of juvenile felonies, number of juvenile misdemeanors, number of prior (non-juvenile) crimes, crime degree, and crime charge. Participants achieved a mean accuracy of 62.1%, compared to COMPAS accuracy of 65.4% on the same Broward County dataset (shown in Table 1). The disparity between algorithmic and human performance was thus markedly smaller than proponents of algorithmic risk assessment had suggested.
More striking still, Dressel and Farid demonstrated that a simple logistic regression classifier using only two features—defendant age and total number of prior convictions—achieved accuracy equivalent to the full COMPAS instrument with its 137 input features. This finding carries profound implications for the transparency-accuracy trade-off debate: if parsimonious, fully interpretable models can achieve predictive performance comparable to complex proprietary systems, the opacity costs of the latter cannot be justified by appeals to superior accuracy.
Subsequent research has both qualified and extended these findings. Lin et al. demonstrated that under conditions more closely approximating real-world judicial decision-making—where feedback is delayed or absent and information is less structured—algorithmic instruments do outperform human judges [16]. However, this advantage reflects less the sophistication of algorithmic methods than the well-documented inconsistency of human judgment under uncertainty. The policy implication is not that complex algorithms are necessary, but rather that structured decision aids of any form—including simple, transparent scoring rules—may improve upon unaided judicial discretion.
An important limitation of cross-jurisdictional predictive validity research is the role of contextual variables in moderating model performance. Judicial culture varies substantially across jurisdictions, affecting how risk assessment information is weighted relative to other sentencing considerations. Local sentencing norms—including prevailing views about the appropriateness of incarceration, the availability of alternative sanctions, and community tolerance for prediction error—influence the practical utility of risk instruments regardless of their technical accuracy. A model achieving 0.70 AUC in one jurisdiction may perform differently when deployed elsewhere, reflecting not only population differences but also variation in how algorithmic recommendations interact with local decision-making practices. This context-dependency strengthens the argument for transparent models: when model behavior can be examined and understood, local jurisdictions can assess whether learned risk patterns accord with local population characteristics and normative commitments, enabling informed adaptation rather than uncritical technology transfer.

2.3. Algorithmic Fairness and the Impossibility Theorems

The ProPublica investigation published in May 2016 brought algorithmic bias in criminal risk assessment to widespread public attention (shown in Table 2). Analyzing COMPAS scores for over 7000 defendants in Broward County, Florida, ProPublica researchers found systematic racial disparities in prediction errors [10]. Black defendants were nearly twice as likely as White defendants to be incorrectly classified as high risk (false positive rate of 44.9% versus 23.5%), while White defendants were nearly twice as likely as Black defendants to be incorrectly classified as low risk (false negative rate of 47.7% versus 28.0%).
Northpointe disputed these findings, arguing that COMPAS satisfied alternative fairness criteria, specifically predictive parity—the requirement that positive predictive value (the probability that a defendant classified as high-risk actually recidivates) be approximately equal across demographic groups [17]. This exchange exposed a fundamental tension in algorithmic fairness: multiple intuitively appealing fairness definitions cannot be satisfied simultaneously except under highly restrictive conditions.
Kleinberg, Mullainathan, and Raghavan formalized this impossibility result in a landmark 2016 paper, proving that three natural fairness conditions—calibration (risk scores reflect true probabilities), balance for the positive class (equal false negative rates), and balance for the negative class (equal false positive rates)—cannot all be achieved when base rates of the outcome differ across groups [18]. Chouldechova independently demonstrated a related impossibility: when recidivism base rates differ between groups, calibration is mathematically incompatible with equal false positive and false negative rates across groups [19].
The impossibility results carry profound normative implications that extend beyond technical fairness metrics. When base rates differ across groups—as they do for recidivism, reflecting both genuine behavioral differences and differential enforcement patterns—any risk assessment instrument must choose which type of error to equalize. Equalizing false positive rates (as ProPublica implicitly advocated) necessarily means that high-risk classifications carry different predictive meaning for different groups, potentially undermining the instrument’s utility for individual case assessment. Equalizing predictive parity (as Northpointe advocated) necessarily means that innocent members of higher-base-rate groups bear elevated risks of wrongful classification.
How should policymakers operationalize these fairness trade-offs in practice? We propose a structured deliberation process. First, jurisdictions should explicitly identify which fairness criterion to prioritize—calibration (ensuring risk scores have consistent meaning), error rate balance (equalizing false positive and false negative rates), or predictive parity (equalizing positive and negative predictive values)—based on local values and constitutional commitments. Second, impact analysis should quantify how this choice distributes prediction errors across demographic groups, making trade-offs visible to affected communities and their representatives. Third, threshold analysis should examine whether risk classification thresholds (low, medium, high) should be group-specific or universal, weighing the fairness implications of each approach. Fourth, ongoing monitoring should assess whether chosen fairness criteria remain satisfied as populations and base rates evolve. Fifth, periodic review should revisit initial fairness choices as normative consensus and evidence develop. This structured process cannot resolve the impossibility theorems’ fundamental constraints, but it can ensure that fairness choices are made deliberately, transparently, and subject to democratic accountability rather than embedded in opaque technical architectures.
This is not a technical problem amenable to algorithmic solution; it is a normative choice about who should bear the costs of predictive uncertainty in a world of imperfect information. The transparency of our proposed framework does not resolve this dilemma, but it does render the trade-offs explicit and subject to democratic deliberation rather than concealed within proprietary algorithmic architectures.

2.4. The Explainability Imperative in High-Stakes Decision-Making

The limitations of opaque algorithmic systems have prompted growing interest in explainable artificial intelligence (XAI) methods. Rudin has argued forcefully that high-stakes decisions should employ inherently interpretable models rather than post hoc explanations of opaque systems [5]. Her critique centers on a fundamental epistemological observation: explanations of black-box models are necessarily approximations that may fail to capture decision-relevant aspects of model behavior. An explanation that perfectly captured the black-box model’s decision logic would, by definition, be equivalent to the model itself, rendering the black box unnecessary.
This argument has particular force in the sentencing context. When liberty is at stake, defendants possess constitutional rights to understand and challenge the basis for adverse governmental action. Post hoc explanations that merely approximate an opaque model’s behavior cannot satisfy this requirement, because the explanation may diverge from the true decision logic in precisely those cases where challenge would be most warranted.
Post hoc explanation methods such as LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) generate local approximations of model behavior around specific predictions [20,21]. LIME constructs locally linear approximations by perturbing input features and observing output changes. SHAP draws upon cooperative game theory to assign contribution values to each feature, providing theoretically grounded attributions with desirable mathematical properties including local accuracy, missingness, and consistency (shown in Table 3).
While these methods offer valuable insights, their application in judicial contexts faces significant challenges. Judges and defendants require explanations that are not merely technically accurate but genuinely comprehensible to legal actors without machine learning expertise. The gap between technical and lay interpretability remains substantial, and bridging this gap requires careful attention to explanation design, presentation format, and the cognitive constraints of human decision-makers operating under time pressure and uncertainty.

3. Legal and Regulatory Framework

3.1. Constitutional Due Process Requirements

The constitutional foundation for transparency requirements in algorithmic sentencing derives primarily from the Due Process Clauses of the Fifth and Fourteenth Amendments. Due process jurisprudence has established that defendants possess a right to be sentenced based upon accurate information and to have meaningful opportunity to challenge the factual basis for sentencing decisions [22].
The Supreme Court’s decision in Gardner v. Florida (1977) established that defendants may not be sentenced on the basis of information they have no opportunity to deny or explain [23]. The Court reasoned that the fundamental fairness guaranteed by due process requires not merely that sentencing information be accurate, but that defendants have the capacity to test that accuracy through adversarial scrutiny. This principle creates tension with algorithmic instruments that rely upon proprietary methodologies defendants cannot scrutinize or challenge.
The State v. Loomis decision from the Wisconsin Supreme Court represents the most significant judicial engagement with algorithmic sentencing to date [24]. Eric Loomis challenged his sentence on due process grounds. He argued that the proprietary nature of the COMPAS algorithm prevented him from assessing its accuracy or challenging its conclusions. Loomis advanced three constitutional arguments: first, that the proprietary algorithm violated his right to be sentenced based upon accurate information; second, that reliance on group-based statistical predictions violated his right to individualized sentencing; and third, that COMPAS’s use of gender as a predictive factor constituted impermissible sex discrimination.
The Wisconsin Supreme Court rejected Loomis’s challenge but imposed significant limitations on permissible COMPAS usage that effectively cabin algorithmic risk assessment to an advisory role. The court held that risk scores may not be used to determine whether an offender is incarcerated, nor may they be used to determine the severity of a sentence. Sentencing courts must identify factors independent of the risk score that support the sentence imposed, ensuring that algorithmic recommendations inform but do not determine sentencing outcomes.
The court further mandated that presentence investigation reports incorporating COMPAS include five specific advisements: (a) the proprietary nature of COMPAS prevents disclosure of how factors are weighted; (b) risk scores are based on group data and cannot identify specific high-risk individuals with certainty; (c) some studies have raised questions about racial disparities in COMPAS classifications; (d) the instrument was developed using a national sample without cross-validation for the Wisconsin population; and (e) risk assessment tools require ongoing monitoring and re-norming for accuracy as populations and conditions change [24].
The United States Supreme Court declined to review the Wisconsin decision, leaving the constitutional status of algorithmic sentencing tools unresolved at the federal level [25]. This denial of certiorari should not be read as endorsement of algorithmic sentencing practices; the Court may simply have concluded that the Loomis framework adequately protected due process interests in that particular case. The constitutional questions remain live and will likely require eventual Supreme Court resolution as algorithmic sentencing becomes more prevalent and sophisticated.

3.2. The European Union Artificial Intelligence Act

The European Union Artificial Intelligence Act, which entered into force in August 2024, establishes a comprehensive regulatory framework for AI systems based upon risk classification [26]. The Act represents the world’s first horizontal AI regulation and has already begun to influence regulatory approaches in other jurisdictions. AI systems used in the administration of justice are designated as high-risk under Annex III, subjecting them to mandatory requirements for risk management, data governance, transparency, human oversight, accuracy, and robustness (shown in Table 4).
Article 6 and Annex III specifically identify AI systems “intended to assist a judicial authority in researching and interpreting facts and the law and in applying the law to a concrete set of facts” as high-risk applications requiring compliance with the Chapter III, Section 2 obligations. Recidivism prediction instruments used to inform sentencing decisions fall squarely within this category.
The transparency requirements articulated in Article 13 are particularly relevant for sentencing applications. High-risk AI systems must be designed and developed to ensure that their operation is sufficiently transparent to enable deployers to interpret system output and use it appropriately. This requires not merely that outputs be provided, but that accompanying information enable meaningful human comprehension of how outputs relate to inputs and what confidence should attach to predictions.
Article 14 mandates human oversight measures sufficient to enable human overseers to “fully understand the capacities and limitations of the high-risk AI system” and to “properly monitor its operation.” For sentencing applications, this requirement reinforces the Loomis framework’s insistence that judges exercise independent judgment rather than deferring to algorithmic recommendations.
The Act also establishes prohibited practices under Article 5, including the use of AI systems to assess the risk of natural persons committing criminal offenses “based solely on the profiling of a natural person or on assessing their personality traits and characteristics” [27]. This prohibition carves out an exception for systems that augment human assessments based on objective, verifiable facts directly linked to criminal activity, which would encompass most contemporary recidivism prediction instruments that incorporate criminal history data. Nevertheless, the prohibition reflects regulatory concern about predictive systems that substitute algorithmic judgment for individualized human assessment of culpability and dangerousness.
Significant implementation uncertainty surrounds how national courts will operationalize AI Act obligations in practice. The Act’s high-level requirements—transparency sufficient for “appropriate use,” human oversight enabling “full understanding,” accuracy achieving “appropriate levels”—require translation into concrete technical specifications and procedural safeguards. To illustrate these challenges, consider three scenarios reflecting different interpretive positions.
Scenario 1 (Minimal interpretation): A court adopts a proprietary risk assessment tool accompanied by a technical manual describing input features, general model type, and aggregate performance statistics. The court treats this documentation as satisfying Article 13 transparency requirements. Under this interpretation—which has no supporting case law but represents the least disruptive reading—transparency obligations reduce to documentation requirements that existing commercial systems may already satisfy.
Scenario 2 (Moderate interpretation): A court requires that the risk assessment system provide individual-level explanations identifying which factors contributed to each defendant’s classification and by how much. The system need not disclose source code, but must enable defendants and their counsel to understand and challenge specific predictions. This interpretation aligns most closely with the GDPR’s requirement for “meaningful information about the logic involved” (Article 13(2)(f)) and the Article 29 Working Party’s guidance, though its application to criminal sentencing specifically remains untested.
Scenario 3 (Maximal interpretation): A court requires full algorithmic transparency, including disclosed model architecture, feature weights, and training data documentation, treating anything less as incompatible with Article 13’s requirement that system operation be “sufficiently transparent.” This interpretation would effectively mandate open-source algorithms for judicial applications—a position supported by some scholars [28] but not yet adopted by any regulatory body or court.
We emphasize that these scenarios are illustrative rather than predictive: no EU member state court has yet adjudicated AI Act compliance for sentencing applications, and the Act’s delegated and implementing acts remain under development. Member state implementation will likely vary, potentially creating fragmentation in acceptable practices across the European Union. This uncertainty strengthens the case for frameworks prioritizing inherent interpretability: systems whose transparency derives from fundamental architectural choices rather than supplementary documentation are more likely to satisfy diverse interpretations of AI Act requirements across jurisdictions and over time.

3.3. The GDPR and the Right to Explanation

Complementing the AI Act’s sector-specific requirements, the General Data Protection Regulation provides additional horizontal constraints on automated decision-making applicable across all domains. The General Data Protection Regulation provides additional constraints on automated decision-making applicable throughout EU member states [29]. Article 22 establishes a qualified prohibition on decisions “based solely on automated processing, including profiling, which produces legal effects concerning [the data subject] or similarly significantly affects him or her.” Criminal sentencing unquestionably produces legal effects of the highest magnitude, bringing AI-assisted sentencing squarely within Article 22’s scope.
The GDPR’s transparency requirements, articulated in Articles 13–15, require data controllers to provide “meaningful information about the logic involved” in automated decision-making, as well as “the significance and the envisaged consequences of such processing for the data subject.” The Article 29 Working Party (now the European Data Protection Board) has interpreted these provisions to require explanations that enable data subjects to understand “the rationale behind, or the criteria relied on in reaching the decision” [30].
Scholarly debate continues regarding whether the GDPR creates a genuine “right to explanation” or merely information rights that fall short of requiring comprehensible justifications for individual decisions [31]. Wachter, Mittelstadt, and Floridi have argued that the GDPR provides only a limited right to information about system functionality rather than a right to explanation of specific decisions. Others contend that the principle of meaningful information, combined with the requirement to provide information about “significance and envisaged consequences,” implies a right to decision-specific explanation.
Regardless of the precise doctrinal interpretation, the GDPR clearly establishes that automated decision-making affecting fundamental interests must be accompanied by transparency measures enabling affected individuals to understand and challenge such decisions. For sentencing applications, this means that defendants must receive information sufficient to identify potential errors in input data, understand how their characteristics contributed to risk classifications, and mount informed challenges where warranted.

4. Framework Design: Explainability-Constrained Sentencing Models

4.1. Design Principles and Theoretical Foundations

The proposed framework rests upon four foundational principles that translate normative requirements into technical constraints, reflecting both constitutional doctrine and emerging regulatory mandates.
The first principle treats transparency as a constraint rather than as a value to be traded against competing objectives. This formulation reflects the constitutional status of due process as a prerequisite for legitimate governmental action rather than one consideration among many. In practical terms, this principle means that model architectures must be selected from the class of interpretable models, with complex black-box approaches excluded from consideration regardless of any marginal accuracy advantages they might offer.
The second principle requires audience-appropriate explanation calibrated to the epistemic needs and cognitive capacities of intended recipients. Judges require explanations that map to familiar legal categories. These explanations must support the reasoned elaboration of sentencing determinations. Defendants require explanations in accessible language that enable identification of potential errors and inform decisions about whether to contest risk assessments. Auditors and oversight bodies require technical explanations enabling systematic evaluation of model behavior across populations and over time. A single explanation format cannot serve all three audiences; the framework must generate multiple explanation modalities tailored to distinct informational needs.
The third principle embeds contestability into the system architecture. The framework must support meaningful opportunities for defendants to challenge predictions, including identification of potentially erroneous inputs, quantification of prediction uncertainty, and counterfactual analysis revealing how different inputs would alter outcomes. Contestability is not merely a procedural requirement but an epistemic one: adversarial scrutiny improves the accuracy of information upon which sentencing depends.
The fourth principle requires proportionality between explanation depth and decision magnitude. More severe sentencing outcomes—particularly those involving incarceration—require more comprehensive justification than lower-stakes decisions such as assignment to particular supervision programs. This graduated approach allocates explanation resources to the decisions where transparency matters most while avoiding unnecessary administrative burden for routine determinations.
The framework’s viability depends upon data quality and availability assumptions that constrain its applicability to jurisdictions meeting minimum data infrastructure requirements. Our implementation presumes access to structured criminal history data including arrest dates, charge categories, dispositions, and outcomes—information compiled in standardized electronic records. Many jurisdictions, particularly smaller counties and those with limited information technology infrastructure, maintain fragmented records across incompatible systems with substantial missing data and inconsistent coding practices. The framework should not be deployed in jurisdictions where foundational data quality cannot be assured; doing so risks producing outputs whose apparent precision masks underlying data deficiencies.
Implementation priorities should therefore include: (1) data infrastructure assessment identifying gaps and determining whether minimum quality thresholds are met; (2) data quality improvement through standardization, validation, and error correction preceding any model deployment; (3) graduated implementation beginning with jurisdictions possessing adequate data infrastructure and expanding only as data quality permits; and (4) explicit acknowledgment of inapplicability in data-poor contexts rather than deployment of sophisticated models operating on inadequate foundations. The framework cannot remedy data quality problems through algorithmic sophistication; transparent models have the advantage of making data limitations visible rather than concealing them within complex architectures.

4.2. Architectural Components

Translating these normative principles into technical architecture, the proposed framework integrates three complementary explanation mechanisms, each addressing distinct aspects of the judicial explanation burden. The proposed framework integrates three complementary explanation mechanisms, each addressing distinct aspects of the judicial explanation burden. The architectural overview appears in Figure 1, with detailed specification of each component following.

4.2.1. Inherently Interpretable Modeling Core

The framework employs Generalized Additive Models with pairwise interactions (GA2Ms) as the foundational predictive architecture [32]. This model class was selected based upon four considerations relevant to the sentencing context.
First, the additive structure permits decomposition of predictions into individual feature contributions that can be explained in isolation. For a defendant with a predicted risk score, the framework can specify precisely how much that score reflects criminal history factors, how much reflects demographic characteristics, and how much reflects dynamic factors amenable to intervention. This decomposition supports both judicial understanding and defendant contestation.
Second, the shape functions mapping each feature to risk contribution can be visualized graphically, enabling judges to understand how specific factor values translate to risk assessments without requiring statistical training. A judge can observe, for example, that risk increases steeply with prior convictions from zero to three, then plateaus, providing intuition about the model’s behavior that would be unavailable from a black-box system.
Third, pairwise interactions capture important non-additive effects while maintaining interpretability. Certain factor combinations may be predictively important—for instance, the combination of young age and prior violent offenses may carry different predictive significance than either factor alone—and the GA2M framework accommodates such interactions without sacrificing transparency.
Fourth, empirical evaluations have demonstrated that GA2Ms achieve predictive performance comparable to black-box alternatives on tabular datasets typical of risk assessment applications [33]. The Explainable Boosting Machine (EBM) implementation of GA2Ms has achieved state-of-the-art performance on multiple benchmark tasks while remaining fully interpretable. For terminological clarity: GA2M refers to the model class (Generalized Additive Models with pairwise interactions); EBM (Explainable Boosting Machine) refers to the specific implementation within the InterpretML library. Throughout this paper, we use “GA2M” when discussing architectural properties and “EBM” only when referring to implementation-specific details.
The mathematical form of the model is:
logit ( P [ recidivism ] ) = β 0 + i = 1 p   f i ( x i ) + j S   f j ( x j 1 , x j 2 )
where β 0 is the intercept, f i are univariate shape functions learned from data for each of p features, and f j are bivariate interaction functions for a selected subset S of feature pairs identified through automatic interaction detection.
Importantly, shape functions can be constrained to reflect domain knowledge. Where criminological theory or legal doctrine indicates that risk should increase monotonically with a factor (e.g., number of prior convictions), monotonicity constraints can be imposed during model training. Such constraints improve model interpretability by ensuring that learned relationships accord with substantive expectations, while also providing regularization that may improve out-of-sample performance.

4.2.2. Local Explanation Generation

For each individual prediction, the framework generates a structured local explanation identifying the factors contributing to that defendant’s risk classification. Unlike post hoc methods that approximate black-box model behavior, these explanations derive directly from the interpretable model structure and thus achieve perfect fidelity.
Unlike post hoc approximation methods such as LIME or SHAP that estimate feature contributions for black-box models, our local explanations derive directly from the GA2M’s additive structure and thus achieve perfect fidelity by construction. The additive form logit(P[recidivism]) = β0 + Σf_i(x_i) + Σf_j (x_j1, x_j2) permits exact decomposition of each prediction into constituent feature contributions without approximation error.
The explanation includes three components. Feature contribution values indicate how each input feature moved the prediction relative to the population baseline. Feature contribution values indicate how each input feature moved the prediction relative to the population baseline. For a defendant classified as high-risk, the explanation might indicate that three prior convictions contributed +0.8 to the log-odds of recidivism, while stable employment contributed −0.3, with the net effect and other factors yielding the final classification. These contributions are exact, not approximate, because the additive model structure permits perfect decomposition.
The explanation also identifies dominant factors—the three to five features contributing most substantially to the risk classification—with natural language descriptions of their influence. A defendant receiving an adverse classification can immediately identify which factors drove that classification and assess whether those factors are accurately recorded and appropriately weighted.
Uncertainty quantification completes the local explanation, providing confidence intervals that reflect both aleatoric uncertainty (inherent unpredictability in human behavior) and epistemic uncertainty (limitations in available data and model specification). A prediction accompanied by wide confidence intervals signals that the model has limited information about defendants with similar characteristics, cautioning judges against over-reliance on point estimates.

4.2.3. Counterfactual Explanation Generation

Counterfactual explanations identify minimal changes to input features that would alter the risk classification, answering the question: “What would need to be different for this defendant to receive a more favorable assessment?” Following the approach developed by Wachter, Mittelstadt, and Russell [34], the framework generates counterfactuals satisfying four desiderata.
Validity requires that the counterfactual actually produce a different prediction outcome—a counterfactual explaining a high-risk classification must yield a low- or medium-risk classification when substituted for actual inputs. Proximity requires that changes from the actual instance be minimized; counterfactuals that require implausible transformations (e.g., reducing age by twenty years) provide little insight. Sparsity favors counterfactuals that change few features rather than many, reflecting the intuition that simpler explanations are more useful than complex ones. Actionability, where possible, identifies changes involving modifiable factors rather than immutable characteristics, providing guidance about what defendants or supervision programs might do to reduce future risk.
For sentencing applications, counterfactual explanations serve multiple purposes. For defendants, they identify concrete factors that, if different, would result in more favorable assessments—information relevant both to understanding current classifications and to informing future behavior. For judges, counterfactuals reveal the sensitivity of risk classifications to specific inputs, supporting calibrated reliance on algorithmic recommendations. A classification that would change with small perturbations to uncertain inputs warrants less confidence than one that remains stable across plausible variations.
An important cautionary note concerns the socio-economic constraints affecting counterfactual actionability. While the framework designates immutable characteristics (age, sex, race) as excluded from counterfactual modifications, the distinction between technically modifiable and practically actionable features is more nuanced. Employment status, educational attainment, residential stability, and substance abuse treatment completion are formally modifiable but may be effectively inaccessible to defendants facing systemic barriers including poverty, housing instability, disability, lack of transportation, and geographic isolation from services. Presenting counterfactuals involving such features without acknowledging structural constraints risks implying that defendants bear individual responsibility for circumstances substantially shaped by systemic disadvantage.
The framework therefore accompanies counterfactual explanations with contextual advisements noting that (a) identified changes may require institutional support and resource access beyond individual control, (b) inability to achieve suggested modifications does not indicate higher culpability or moral failing, and (c) counterfactual analysis is intended to illuminate model sensitivity rather than prescribe individual action plans. Judges should interpret counterfactuals as diagnostic tools revealing which factors drive risk classifications, not as implicit judgments about what defendants should have done differently.

4.2.4. Uncertainty Quantification

The framework quantifies prediction uncertainty through three complementary mechanisms that address distinct sources of unpredictability in recidivism assessment.
Aleatoric Uncertainty (Irreducible): We employ conformal prediction methods to generate prediction intervals that reflect inherent variability in human behavior. For each test instance, we compute nonconformity scores based on calibration set residuals and construct prediction intervals at specified confidence levels (typically 90 percent or 95 percent). This provides statistically valid coverage guarantees: a 95 percent prediction interval will contain the true outcome for 95 percent of future defendants drawn from the same distribution, regardless of model specification. The mathematical foundation ensures that interval coverage holds even when model assumptions are violated, providing robust uncertainty quantification.
Epistemic Uncertainty (Model-related): We quantify model uncertainty using bootstrap aggregation. We train 100 GA2Ms on bootstrap samples of the training data and compute the standard deviation of predictions across ensemble members. High variance indicates that model predictions are sensitive to training sample composition, suggesting limited information about defendants with similar characteristics. We report prediction intervals as: ŷ ± 1.96·SD(bootstrap ensemble), providing 95 percent confidence intervals under normality assumptions. This captures uncertainty arising from finite training data and enables judges to distinguish between predictions based on extensive historical evidence versus those extrapolating from limited observations.
Feature-level Uncertainty: For inputs with missing or uncertain values, we employ multiple imputation with 50 imputed datasets generated using chained equations. We generate predictions for each imputed dataset and report both the mean prediction and the between-imputation variance. High between-imputation variance indicates that predictions are sensitive to uncertain inputs, warranting additional scrutiny or data collection before sentencing. This mechanism makes explicit how measurement uncertainty propagates to prediction uncertainty, supporting informed decisions about whether additional investigation is warranted.
These uncertainty estimates are presented in presentence reports using accessible visualizations: point estimates with shaded confidence bands for epistemic uncertainty, and highlighted features when feature-level uncertainty exceeds predefined thresholds. Judges receive guidance on interpreting uncertainty: wide intervals suggest limited confidence and warrant reduced reliance on algorithmic recommendations in favor of individualized judicial assessment considering factors beyond the model’s scope.
To demonstrate how the three uncertainty mechanisms combine in practice, consider a hypothetical defendant: a 28-year-old male with three prior convictions, current employment, and a pending drug offense. The GA2M produces a point estimate of 0.62 probability of two-year recidivism, yielding a medium-high risk classification. The three uncertainty mechanisms contribute as follows.
Aleatoric uncertainty (conformal prediction): The 90% prediction interval is [0.44, 0.78], reflecting inherent behavioral variability. This interval spans the medium-to-high risk boundary (set at 0.55), indicating that the classification is not robust to irreducible uncertainty alone.
Epistemic uncertainty (bootstrap ensemble): The bootstrap standard deviation across 100 models is 0.04, yielding a 95% confidence interval of [0.54, 0.70]. The relatively narrow interval indicates that the model has sufficient training data for defendants with similar profiles. The prediction is not driven by sampling variability.
Feature-level uncertainty (multiple imputation): Employment status was self-reported and unverified. Multiple imputation across plausible employment scenarios produces predictions ranging from 0.58 (if employed) to 0.69 (if unemployed), with between-imputation standard deviation of 0.05. This flags employment verification as potentially consequential for the final classification.
The integrated presentence report output would communicate: “Risk estimate: 0.62 (medium-high). Confidence: MODERATE—the classification boundary falls within the 90% prediction interval, and the assessment is sensitive to unverified employment status. Recommendation: verify employment through independent documentation before finalizing risk classification. If employment is confirmed, the revised estimate of 0.58 falls closer to the medium-risk threshold, warranting reconsideration of classification severity.”
This example illustrates how the three uncertainty sources inform different aspects of judicial decision-making. Aleatoric uncertainty signals classification boundary proximity. Epistemic uncertainty indicates data sufficiency. Feature-level uncertainty identifies specific inputs warranting verification. Judges need not combine these into a single number; rather, each source calibrates reliance on the risk estimate for a different aspect of sentencing deliberation.

4.3. Experimental Methodology and Implementation Details

4.3.1. Dataset Preprocessing and Feature Engineering

All experiments utilize the ProPublica COMPAS dataset comprising 7214 defendants from Broward County, Florida (2013–2014) with two-year recidivism outcomes tracked through April 2016. Following standard preprocessing protocols established by Dressel and Farid [15], we applied systematic filters to ensure data quality: (1) cases with inconsistent or missing screening dates removed; (2) ordinary traffic offenses excluded as these represent qualitatively different risk profiles; (3) records with screening dates more than 30 days before or after arrest excluded to ensure temporal alignment between risk assessment and criminal justice contact; and (4) cases with recidivism recorded before the original COMPAS screening date excluded as data quality issues. This yielded an analysis sample of 6172 cases with complete information on predictor variables and outcome measures.
Feature Engineering: We constructed three feature sets to enable controlled comparison across model complexities. The minimal set comprises two features: age at screening (continuous, years) and total prior conviction count (integer). The standard set comprises seven features matching those used in prior benchmark studies: age at screening, sex (binary), number of juvenile felony charges, number of juvenile misdemeanor charges, number of prior adult convictions, current offense degree (felony versus misdemeanor), and current offense charge category (eight categories including violent, property, drug, and other offenses). The extended set comprises twelve features adding employment status (employed, unemployed, not in labor force), education level (less than high school, high school, some college or more), marital status (single, married, other), substance abuse indicators (binary flags for drug and alcohol problems), and residential stability (months at current address).
Missing value rates were low (under 3 percent for all features except employment status at 7 percent). We employed multiple imputation by chained equations (MICE) with 50 imputed datasets for handling missing values, acknowledging that imputation introduces uncertainty quantified through between-imputation variance. Continuous features were standardized to zero mean and unit variance after imputation. Categorical features were one-hot encoded with reference category exclusion to avoid perfect collinearity. Criminal history counts were log-transformed as log(x + 1) to reduce the influence of extreme outliers while preserving zero values.

4.3.2. Model Training and Validation Protocol

We employed temporal train-test splitting to simulate realistic prospective deployment. Training data comprised defendants screened from January 2013 through June 2014 (n = 4629, 75 percent of sample). Test data comprised defendants screened from July 2014 through December 2014 (n = 1543, 25 percent of sample). This temporal separation prevents data leakage and provides more realistic performance estimates than cross-validation within a single time period, as the test set represents defendants the model will encounter in future deployment rather than randomly selected cases from the same temporal window.
Generalized Additive Model (GA2M) Implementation: We utilized the InterpretML library’s Explainable Boosting Machine (EBM) implementation, which constructs GA2Ms through gradient boosting while maintaining additive structure [33]. Hyperparameters were selected through 5-fold cross-validation on the training set with performance evaluated on a held-out validation subset (20 percent of training data). The rationale for each final setting is as follows.
Maximum boosting rounds were set at 5000 with early stopping based on validation log-loss (patience = 50 rounds). This allowed sufficient model complexity while preventing overfitting; in practice, early stopping typically terminated training between 1500 and 2500 rounds. Maximum tree depth for interaction terms was limited to 3. This balanced the capacity to capture meaningful nonlinear interactions against the interpretability requirement that interaction effects remain visually comprehensible when plotted as heatmaps—deeper trees produce interaction surfaces too complex for judicial audiences to interpret. The learning rate of 0.01 was selected to ensure stable convergence; higher rates (0.05, 0.1) produced comparable final performance but greater sensitivity to the number of boosting rounds, reducing reproducibility. The number of bins for discretizing continuous features was set to 256, reflecting a balance between resolution (capturing fine-grained risk patterns) and overfitting risk (excessive bins create noise in data-sparse regions). Sensitivity analysis across 64, 128, 256, and 512 bins showed stable performance (AUC variation < 0.005), confirming robustness to this choice. Pairwise interactions were detected automatically using the FAST algorithm with a significance threshold of p < 0.01, retaining a maximum of 10 interaction pairs based on validation performance. This limit ensures that the model remains interpretable—each retained interaction can be individually examined and presented to judicial audiences—while capturing the most predictive feature combinations.
Monotonicity constraints were imposed on theoretically justified features: age (risk decreases monotonically with increasing age based on age-crime curve literature) and prior conviction count (risk increases monotonically with criminal history). These constraints improve interpretability by ensuring learned relationships accord with domain knowledge while providing regularization that enhances generalization.
Comparison Models: For benchmark comparison, we implemented multiple baseline models. Logistic regression with L2 regularization (regularization parameter selected via cross-validation). LASSO sparse linear model (L1 regularization for feature selection). XGBoost gradient boosting (hyperparameters: 500 trees, maximum depth 6, learning rate 0.1). Random forest (500 trees, maximum depth unlimited, minimum samples per leaf 5). Simple neural network (multilayer perceptron with architecture [input_dim, 64, 32, 1], ReLU activation, dropout 0.3, trained for 100 epochs with early stopping). All models trained on identical train-test splits with hyperparameters selected through identical cross-validation procedures to ensure fair comparison.
Calibration: All probabilistic models underwent post-training calibration using Platt scaling (logistic regression fitted to validation set predictions) to ensure predicted probabilities accurately reflect empirical frequencies. Calibration is essential for interpretation of risk scores as probabilities and for fair comparison of Brier scores across models.
Threshold Selection: For converting continuous risk scores to categorical classifications (low, medium, high risk), we employed Youden’s J statistic (sensitivity plus specificity minus one) to identify thresholds maximizing correct classification rates on the validation set. This represents one principled approach among multiple reasonable alternatives, and we acknowledge that threshold choice embodies normative judgments about acceptable error trade-offs.
Critically, all calibration and threshold selection procedures were performed exclusively on training and validation data partitions. The test set (July–December 2014, n = 1543) was held out entirely and accessed only once for final performance evaluation. No information from the test set influenced model fitting, hyperparameter tuning, calibration, or threshold determination, thereby precluding any risk of data leakage.

4.3.3. Fairness Evaluation Methodology

Fairness assessment examined disparities across racial groups (Black versus White defendants, representing 85 percent of the sample; other racial groups excluded from disparity analysis due to small sample sizes that prevent reliable estimation). We computed multiple fairness metrics reflecting different normative criteria. Calibration was assessed through calibration curves plotting predicted versus observed recidivism rates in deciles of predicted risk, with calibration difference measuring maximum absolute deviation between groups. Error rate metrics included false positive rate disparity (ratio of FPR for Black versus White defendants), false negative rate disparity (ratio of FNR for Black versus White defendants), and equalized odds gap (maximum absolute difference in TPR or FPR between groups). Predictive performance metrics included positive predictive value (precision) and negative predictive value, with predictive parity gap measuring difference between groups. Demographic parity (statistical parity) measured differences in positive classification rates between groups.
Threshold determination: Separate analyses examined (1) single universal threshold applied to both groups and (2) group-specific thresholds calibrated to achieve equal false positive rates. The choice between these approaches represents a normative fairness trade-off discussed in Section 2.3, and we report results under both approaches to make these trade-offs explicit. Calibration and threshold selection were performed strictly on training and validation partitions before any evaluation on the test set. The test set was accessed only once for final metric computation, ensuring that reported metrics reflect realistic deployment performance rather than in-sample overfitting.
Statistical significance of fairness disparities was assessed through bootstrap resampling (1000 bootstrap samples) with 95 percent confidence intervals computed via percentile method. Disparities where confidence intervals excluded the null hypothesis value (ratio of 1.0 for disparity ratios, difference of 0.0 for gaps) were considered statistically significant.

4.3.4. Counterfactual Generation Methods

Counterfactual explanations were generated using the DiCE (Diverse Counterfactual Explanations) algorithm implemented with the following specifications. Optimization method: gradient-based search minimizing a composite objective function balancing validity (counterfactual must change prediction outcome), proximity (L2 distance from original instance), sparsity (L0 pseudo-norm penalizing number of changed features), and diversity (generating multiple diverse counterfactuals rather than a single solution). Distance metric: Gower distance combining standardized Euclidean distance for continuous features with Hamming distance for categorical features, normalized to [0, 1] range. Immutable features: age, sex, and race designated as immutable and excluded from counterfactual modifications. Beyond formally immutable features, the framework implements soft feasibility constraints reflecting practical and ethical boundaries. Criminal history features are constrained to non-decreasing values (counterfactuals cannot hypothetically erase documented convictions). Age differences in counterfactuals are bounded to ±5 years to maintain demographic plausibility. Education level changes are restricted to upward transitions only (e.g., counterfactuals may posit high school completion but not its reversal). The framework flags counterfactuals requiring changes to features associated with socio-economic disadvantage—including employment, housing stability, and treatment completion—with advisements that such changes may require systemic support rather than individual action. Feasibility constraints: changes to continuous features bounded within observed data range; categorical features restricted to observed categories; criminal history features constrained to non-decreasing values (counterfactuals cannot erase past convictions). The algorithm generates five diverse counterfactuals per instance, from which the sparsest valid counterfactual is selected for presentation.
Computational efficiency: Counterfactual generation required a mean of 2.3 s per instance on standard hardware (Intel i7 processor, 16 GB RAM), demonstrating computational feasibility for integration into presentence investigation workflows.

4.4. Validation Architecture and Procedural Safeguards

The technical architecture is complemented by validation mechanisms and procedural safeguards ensuring appropriate model development, deployment, and ongoing oversight.
The validation architecture incorporates prospective cohort design for model development and evaluation. Training data are drawn from historical cohorts with known outcomes, while validation employs temporally subsequent cohorts to assess performance under realistic deployment conditions. This temporal separation prevents data leakage and provides more realistic estimates of deployment performance than cross-validation within a single temporal window.
Fairness auditing is embedded throughout the validation process. Models are evaluated not only for overall predictive performance but for performance parity across demographic subgroups, with particular attention to the impossibility-theorem-constrained trade-offs between calibration and error rate balance. Where disparities exceed pre-specified thresholds, the framework provides diagnostic tools identifying which features contribute most to differential performance, enabling targeted intervention.
Procedural safeguards complement technical validation. Defendants receive complete documentation of all inputs to the risk assessment and meaningful opportunity to identify and correct errors before sentencing. Automated auditing maintains comprehensive logs enabling retrospective analysis of predictions, explanations, and outcomes. Mandatory periodic revalidation assesses predictive validity and fairness metrics on the deployment population, with public reporting ensuring accountability.

5. Comparative Analysis and Evaluation

5.1. Predictive Performance Evaluation

A threshold question is whether explainability constraints substantially degrade predictive performance. Critics of interpretable modeling have suggested that the transparency-accuracy trade-off is steep, requiring substantial performance sacrifices to achieve comprehensible models. The empirical evidence does not support this concern for the tabular data characteristic of risk assessment applications.
The comparison reveals that the predictive gap between interpretable and black-box models is considerably smaller than often assumed. Explainable Boosting Machines, implementing the GA2M architecture proposed here, achieve AUC values within 0.01 of complex neural network architectures on the Broward County recidivism prediction task. This finding aligns with broader evidence from the interpretable machine learning literature: for tabular data with moderate feature dimensionality, inherently interpretable models typically match or approach black-box performance [5,33].
To assess sensitivity to the temporal split boundary, we repeated the analysis under two alternative configurations: (1) training through March 2014 with testing April–December 2014, and (2) training through September 2014 with testing October–December 2014. Table 5 reports the results.
Performance remained stable across split configurations (AUC range: 0.70–0.71; accuracy range: 66.2–67.3%), indicating that results are not artifacts of the particular temporal boundary chosen. However, this sensitivity analysis remains within a single jurisdiction. Cross-jurisdictional validation—applying models trained on Broward County data to defendant populations in other jurisdictions, or training and testing on entirely different jurisdictions—was not feasible with available public data. Given documented variation in criminal justice practices, population demographics, offense distributions, and recidivism base rates across jurisdictions, we cannot claim that the performance levels observed here would transfer to other settings. We identify prospective multi-site validation as a critical priority for future research and caution against deploying models validated in one jurisdiction without local revalidation.
The absolute predictive performance of all approaches deserves emphasis. An AUC of 0.70–0.72 represents moderate discriminative ability far short of the standards applied in other high-stakes domains. No model architecture—interpretable or opaque—achieves the reliable prediction that would justify high confidence in individual classifications. This ceiling on achievable performance strengthens the case for interpretable approaches: if perfect prediction is unattainable regardless of model complexity, the marginal accuracy gains from black-box methods cannot justify their transparency costs.

5.2. Fairness Evaluation

5.2.1. Threshold Selection and Normative Implications

Classification thresholds—the risk score values demarcating low, medium, and high-risk categories—embody normative judgments about acceptable error rates and sentencing severity. Lowering the high-risk threshold increases false positives (low-risk defendants incorrectly classified as high-risk) while reducing false negatives (high-risk defendants incorrectly classified as low-risk). This trade-off has direct implications for sentencing severity: jurisdictions adopting lower thresholds will classify more defendants as high-risk, potentially supporting more severe sentences and longer incarceration periods.
Three approaches to threshold selection merit consideration. First, prevalence-based thresholds classify defendants as high-risk if their predicted probability exceeds the population base rate (51.4 percent for Black defendants, 39.2 percent for White defendants in our sample), ensuring that high-risk classifications reflect above-average recidivism likelihood within each group. Second, resource-based thresholds calibrate classifications to available supervision and treatment resources, ensuring that high-risk classifications correspond to defendants who will receive intensive intervention rather than exceeding system capacity. Third, consequence-based thresholds incorporate the differential social costs of false positives (unnecessary incarceration and supervision imposing costs on defendants and taxpayers) versus false negatives (additional crimes imposing costs on victims and communities), selecting thresholds that minimize expected social costs under explicit value judgments about error consequences.
Our implementation employs Youden’s J statistic (sensitivity plus specificity minus one) to identify thresholds maximizing overall correct classification rates, treating false positives and false negatives as equally costly. This represents only one possible approach, and we emphasize that this choice is value-laden rather than technically determined. Transparent threshold selection—with explicit articulation of the values and trade-offs embodied in chosen thresholds—enables democratic deliberation about appropriate classification stringency. Importantly, thresholds should be periodically reviewed as base rates, consequences, and available resources evolve, with adjustments made through deliberative processes rather than technical optimization alone.
To move beyond purely normative discussion, we conducted empirical analysis of how alternative threshold strategies affect fairness metrics. Table 6 reports disparity metrics under three threshold approaches applied to the GA2M on the test set.
The analysis reveals concrete trade-offs consistent with impossibility theorem predictions. Equal-FPR thresholds eliminate false positive rate disparities entirely but at the cost of substantially increased false negatives for Black defendants (42.8% vs. 33.5% under Youden’s J), meaning more genuinely high-risk Black defendants are classified as low-risk. Overall accuracy also declines from 66.9% to 64.2% (shown in Table 7). Prevalence-based thresholds offer a middle ground, reducing but not eliminating FPR disparities while preserving higher overall accuracy. These results empirically confirm that improving equity on one fairness dimension necessarily worsens performance on another when base rates differ. The transparent model architecture makes these trade-offs quantifiable, enabling jurisdictions to select threshold strategies aligned with their normative priorities on the basis of evidence rather than intuition.

5.2.2. Fairness Metrics and Comparative Analysis

The framework’s design choices have direct implications for fairness across demographic groups. By employing inherently interpretable models with auditable feature contributions, the framework enables examination of how protected characteristics influence predictions and supports targeted interventions to address identified disparities.
Several observations emerge from this comparison. First, the impossibility theorems manifest in all approaches: no model simultaneously achieves calibration and error rate balance when base rates differ. The choice among fairness criteria represents a normative judgment that no algorithmic design can avoid.
Second, the interpretable GA2M architecture achieves modestly better fairness metrics than either COMPAS or black-box alternatives across most measures. This improvement reflects not any inherent superiority of interpretable models, but rather the ability to diagnose and address fairness problems that transparency enables. When feature contributions are visible, analysts can identify factors that disproportionately contribute to adverse classifications for minority defendants and evaluate whether those factors are appropriately predictive or reflect measurement bias (shown in Table 8).
Third, transparency enables stakeholder deliberation about fairness trade-offs. The impossibility theorems establish that choosing among fairness definitions is a value choice, not a technical determination. Transparent models make these choices visible and contestable; opaque models conceal them within proprietary architectures, foreclosing democratic deliberation about how prediction errors should be distributed.

5.3. Judicial Usability and Explanation Quality

The practical utility of explainability features depends upon whether judicial actors can effectively comprehend and act upon provided explanations. Drawing upon cognitive science literature on expert decision-making under uncertainty, we identify factors likely to influence judicial uptake.
Cognitive load constraints require that explanations be sufficiently detailed to support informed decision-making but not so complex as to overwhelm limited attention and working memory. The framework’s tiered explanation structure addresses this concern by providing summary risk classifications for initial orientation, detailed factor breakdowns for cases warranting deeper analysis, and counterfactual explanations for sensitivity assessment.
Alignment with existing legal reasoning patterns facilitates integration with established judicial practices. The factor-based structure of the explanation—identifying which defendant characteristics contributed to risk classification and by how much—maps naturally onto the multi-factor analysis characteristic of sentencing determination. Judges accustomed to weighing aggravating and mitigating factors can extend this framework to encompass algorithmically identified risk factors.
Calibrated confidence is essential for appropriate reliance. The framework’s uncertainty quantification enables judges to distinguish between confident predictions warranting substantial weight and uncertain predictions requiring additional scrutiny or supplementation with other information sources. Research on algorithm aversion suggests that users are more likely to rely appropriately on algorithmic recommendations when uncertainty is explicitly communicated [36].
Important limitations warrant candid acknowledgment. The assertions about judicial comprehension, cognitive load management, and appropriate reliance rest entirely upon theoretical foundations from cognitive science literature on expert decision-making rather than empirical evaluation with actual judicial actors. We have not conducted user studies, controlled experiments, or structured expert feedback sessions with judges, presentence investigators, or defense attorneys to validate these theoretical predictions. Specifically, three empirical questions remain unresolved.
First, comprehension: Do judges correctly interpret feature contribution values, confidence intervals, and counterfactual explanations as intended, or do systematic misunderstandings arise? Second, consistency: Do different judges interpret identical explanations in compatible ways, or does inter-judge variability in explanation interpretation introduce a new source of sentencing disparity? Third, appropriate reliance: Do explanations enable judges to identify situations warranting skepticism about algorithmic recommendations, or do detailed explanations paradoxically increase automation bias by creating an illusion of precision?
Future research should address these questions through a staged empirical program: initial cognitive walkthroughs with small panels of judges and attorneys to identify comprehension failures; controlled experiments comparing sentencing decisions with and without algorithmic explanations across explanation formats; and field studies in pilot jurisdictions measuring both comprehension accuracy and downstream sentencing outcomes. Until such empirical validation is conducted, claims about judicial usability should be understood as theoretically motivated design objectives rather than empirically demonstrated outcomes. We regard this empirical validation agenda as among the most important priorities for advancing the practical viability of explainable AI-assisted sentencing.

6. Implementation Pathways and Policy Guidance

6.1. Technical Implementation Considerations

Translation from framework principles to operational systems requires attention to practical implementation challenges that theoretical analysis may understate.
Data quality represents a foundational constraint. Risk assessment instruments are only as reliable as the data upon which they are trained and deployed. Criminal history records contain systematic errors, including failures to record arrests that did not result in conviction, miscoding of offense categories, and incomplete capture of cases resolved through diversion. More fundamentally, criminal history data reflect biased enforcement patterns: if police disproportionately patrol minority neighborhoods, residents of those neighborhoods will accumulate arrest records at rates exceeding their actual offense rates. The framework cannot correct for biases embedded in input data, though its transparency enables such biases to be identified and addressed through data remediation, model adjustment, or appropriate caveats accompanying predictions.
Computational infrastructure requirements are modest compared to deep learning systems, but not trivial. The GA2M architecture requires iterative fitting of shape functions, which scales linearly with sample size but may require substantial computation for very large training datasets. Counterfactual explanation generation requires optimization procedures beyond simple model evaluation, with computation time depending on the number of features and the desired sparsity of counterfactual changes. For typical sentencing applications with moderate feature dimensionality and defendant populations in the thousands to tens of thousands, these computational requirements can be satisfied with standard server infrastructure.
Integration with existing case management systems presents workflow challenges. Presentence investigation processes vary across jurisdictions, and the framework must accommodate diverse data collection procedures and reporting formats. Standardized data interchange formats and application programming interfaces (APIs) can facilitate integration, but successful deployment requires collaboration between technical developers and court administration personnel familiar with local practices.

6.2. Governance and Oversight Architecture

Effective governance requires institutional structures ensuring ongoing accountability rather than one-time certification. Several governance mechanisms warrant consideration, each addressing distinct accountability gaps.
Judicial training programs should ensure that judges understand both the capabilities and limitations of algorithmic risk assessment. Training should cover the interpretation of risk scores and confidence intervals, the factors contributing to individual classifications, the meaning and implications of counterfactual explanations, and the circumstances under which algorithmic recommendations warrant greater or lesser weight. Ongoing education should address system updates and emerging evidence regarding performance and fairness.
Independent auditing authority, vested in bodies with technical expertise and institutional independence, provides external accountability beyond self-reporting by system developers or deploying courts. Auditors should have access to model specifications, training data, validation results, and deployment logs, with authority to mandate corrective action when deficiencies are identified. Audit findings should be publicly reported, enabling broader scrutiny by researchers, advocates, and affected communities.
Defendant access rights, procedurally guaranteed, ensure that transparency benefits extend to those most directly affected. Defendants should receive, in advance of sentencing, complete documentation of the inputs to their risk assessment, the resulting classification with uncertainty bounds, the factors contributing most substantially to that classification, and information about procedures for identifying and correcting errors. This information should be provided in accessible language, with legal counsel available to assist interpretation.
Periodic revalidation requirements ensure that model performance is monitored over time and that degradation is detected before substantial harm accumulates. Validation studies should assess predictive accuracy and fairness metrics on deployment populations, with comparison to baseline performance established during model development. When performance deteriorates beyond specified thresholds, model retraining or replacement should be mandated.

6.3. Toward a Coherent Policy Framework

Drawing together the preceding analysis, we articulate a coherent policy framework for jurisdictions considering AI-assisted sentencing systems. These recommendations constitute guidance informed by theoretical analysis, computational experimentation, and synthesis of existing regulatory requirements rather than validated best practices tested in operational judicial settings. Empirical evaluation of these recommendations through pilot programs and implementation studies is an essential precondition for their adoption as established practice. The framework proceeds from the premise that algorithmic risk assessment can play a legitimate informational role in sentencing, but that legitimacy depends upon satisfaction of transparency, accountability, and fairness requirements that current proprietary systems fail to meet.
The starting point should be a presumption favoring interpretable model architectures. Complex black-box systems should be permitted only where demonstrably superior predictive performance can be shown to outweigh transparency costs—a burden that current evidence suggests will rarely be met for tabular risk assessment data. This presumption reverses current practice, which permits opacity by default and treats transparency as an optional enhancement.
Explanation requirements should be mandatory and multi-layered, encompassing global model descriptions that enable assessment of overall model behavior; local explanations identifying factors contributing to individual predictions; counterfactual analyses revealing prediction sensitivity; and uncertainty quantification enabling calibrated reliance. These explanation requirements should be enforceable through procedural rights, with defendants able to challenge sentences where required explanations are absent or inadequate.
Human oversight must be substantive rather than nominal. The Loomis framework correctly requires judges to articulate independent justifications not dependent upon risk scores, but this requirement can be circumvented through pro forma compliance. Meaningful oversight requires that judges engage with the substance of algorithmic recommendations, considering both the factors identified and the limitations acknowledged. Training, institutional culture, and appellate review should reinforce expectations of genuine deliberation.
Ongoing validation and accountability mechanisms should be mandatory rather than discretionary. Annual validation studies, publicly reported, should assess predictive validity and fairness metrics on deployment populations. Independent oversight bodies should have authority to audit system performance, investigate complaints, and mandate corrective action. Sunset provisions should require periodic legislative reauthorization, ensuring democratic accountability for continued system use.
Fairness monitoring should be continuous and responsive. Given the impossibility results establishing that all fairness criteria cannot be simultaneously satisfied, jurisdictions must make explicit choices about which fairness properties to prioritize. These choices should be subject to public deliberation and periodic reconsideration. When monitoring reveals disparities exceeding specified thresholds, investigation and corrective action should be mandatory.
These recommendations collectively establish a framework treating algorithmic sentencing as a privilege contingent upon demonstrated compliance with transparency, accountability, and fairness requirements—not as a technological inevitability to which legal systems must simply adapt.

7. Conclusions

The deployment of algorithmic risk assessment in criminal sentencing presents both opportunities and perils. The opportunity lies in the potential for more consistent, evidence-informed decisions that allocate correctional resources effectively while minimizing unnecessary incarceration. The peril lies in the prospect of opaque systems that perpetuate historical biases, resist accountability, and undermine the individualized assessment that justice demands.
As argued in Section 1 and Section 4.1, treating explainability as a non-negotiable architectural constraint—rather than a value to be traded against predictive performance—offers a productive path forward. The proposed framework provides preliminary evidence, based on analysis of one benchmark dataset, that interpretable model architectures—specifically, Generalized Additive Models with pairwise interactions—can achieve predictive performance comparable to black-box alternatives while satisfying constitutional due process requirements and emerging regulatory mandates. Whether this finding generalizes across jurisdictions, populations, and prediction targets requires prospective multi-site validation that the present study cannot provide. The impossibility theorems governing algorithmic fairness establish that normative choices about error distribution are unavoidable; transparent architectures render these choices visible and subject to democratic deliberation rather than concealing them within proprietary systems.
The framework presented here offers both immediate actionable contributions and a foundation for longer-term research development. Short-term actionable contributions suitable for immediate implementation include: adoption of inherently interpretable model architectures (GA2M, rule lists, sparse linear models) in jurisdictions currently using or considering algorithmic risk assessment; development of standardized explanation templates calibrated to judicial audiences with attention to cognitive constraints and decision-making contexts; implementation of fairness monitoring protocols with transparent public reporting of performance disparities across demographic groups and explicit articulation of chosen fairness criteria; and establishment of governance structures—including independent auditing authority, defendant access rights, and periodic revalidation requirements—ensuring ongoing accountability rather than one-time certification. These contributions require no fundamental technical breakthroughs and can be implemented with existing methods and computational infrastructure.
Longer-term research agendas building upon this foundation include: prospective validation studies assessing framework performance in operational judicial settings rather than retrospective datasets, with measurement of both predictive accuracy and downstream sentencing outcomes; empirical evaluation of judicial comprehension and appropriate reliance through controlled experiments, field studies, and structured interviews with judicial actors; extension to intersectional fairness analysis examining performance across multiple simultaneously relevant protected characteristics (race, gender, age, socioeconomic status) rather than binary demographic comparisons; development of methods enabling dynamic model updating as populations and risk factors evolve while maintaining transparency and accountability; integration with case management and supervision systems to assess whether risk assessment improves not only classification accuracy but also ultimate outcomes through targeted intervention; and comparative analysis across jurisdictions examining how local context (judicial culture, sentencing norms, available resources) moderates framework effectiveness. These longer-term research directions require sustained empirical investigation, institutional collaboration, and iterative refinement based on deployment experience.
Several limitations of the current analysis warrant explicit acknowledgment. Most significantly, the framework evaluation relies entirely upon retrospective analysis of a single dataset—the Broward County COMPAS dataset—rather than prospective validation across multiple jurisdictions. While this dataset is the standard benchmark in the algorithmic fairness literature, results demonstrate technical feasibility rather than general validity. Population characteristics, criminal justice practices, data quality, and base rates vary substantially across jurisdictions, and model performance observed in Broward County may not transfer to other contexts. Prospective multi-site validation studies are essential before claims of general applicability can be supported.
The fairness analysis is further limited by its focus on binary racial classifications (Black versus White), which cannot capture the complexity of intersectional identities, multiracial populations, or compounding disadvantages arising from simultaneous membership in multiple marginalized groups. Future research should examine fairness across multiple simultaneously relevant protected characteristics including race, ethnicity, gender, age, and socioeconomic status. The usability claims rest upon theoretical foundations from cognitive science literature rather than empirical testing with judicial actors; controlled usability studies with judges, presentence investigators, and defense attorneys are necessary to validate design assumptions.
More fundamentally, this paper has addressed the design of risk assessment systems without engaging the prior normative question of whether predictive considerations should influence sentencing at all. Even a perfectly accurate, transparent, and fair risk assessment system would face philosophical objections regarding the propriety of punishment calibrated to predicted future conduct rather than past actions. These objections, grounded in retributivist commitments to proportionality and desert, fall outside the scope of the present analysis but deserve serious engagement in ongoing debates about evidence-based sentencing.
The stakes of these debates could hardly be higher. Criminal sentencing decisions determine who loses liberty and for how long—the most consequential exercise of state power over individuals in peacetime governance. If algorithmic systems are to play a role in these decisions, they must be systems that affected individuals can understand, challenge, and ultimately trust. The framework presented here represents one approach to achieving that essential transparency while preserving the predictive benefits that motivate algorithmic risk assessment. Whether the trade-offs embodied in this approach are acceptable is ultimately a question for democratic deliberation, not technical optimization. What technical analysis can contribute is clarity about what trade-offs exist, what institutional arrangements might navigate them, and what immediate steps can be taken while longer-term research proceeds—contributions this paper has attempted to provide.

Author Contributions

Conceptualization, J.S. and T.S.; methodology, J.S.; software, T.S.; validation, J.S. and T.S.; formal analysis, J.S.; investigation, J.S.; resources, J.S. and T.S.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, J.S. and T.S.; visualization, T.S.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Province Key Educational Reform Project “Legal Services for All: Research on the Training Model of Applied Legal Talents”, grant number Z2023015; Shandong Province Social Science Special Project “Research on the Criminal Liability Dilemma and Solutions for Generative AI ‘Hallucinations’”, grant number 25CFZJ21; Shandong University of Political Science and Law Educational Reform Project “Research on the Mechanism and Path of ‘Legal Services for All’ Empowering the Construction of Ideological and Political Courses in Political Science and Law Universities” grant number 2025JG001; and the Shandong University of Political Science and Law Dingshan Talent Project 2024.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data Availability Statement: The original contributions presented in this study are included in the article. The primary dataset used for model development and validation is the publicly available ProPublica COMPAS dataset (Broward County, Florida, 2013–2014), accessible at https://github.com/propublica/compas-analysis (accessed on 10 January 2025). All preprocessing code, trained models, and analysis scripts will be made available in a public GitHub repository upon manuscript acceptance. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Desmarais, S.L.; Singh, J.P. Risk Assessment Instruments Validated and Implemented in Correctional Settings in the United States; Council of State Governments Justice Center: Lexington, KY, USA, 2013; Available online: https://csgjusticecenter.org/wp-content/uploads/2020/02/Risk-Assessment-Instruments-Validated-and-Implemented-in-Correctional-Settings-in-the-United-States.pdf (accessed on 21 January 2026).
  2. Kehl, D.; Guo, P.; Kessler, S. Algorithms in the Criminal Justice System: Assessing the Use of Risk Assessments in Sentencing; Responsive Communities Initiative; Berkman Klein Center for Internet & Society, Harvard Law School: Cambridge, MA, USA, 2017; Available online: http://nrs.harvard.edu/urn-3:HUL.InstRepos:33746041 (accessed on 21 January 2026).
  3. Starr, S.B. Evidence-Based Sentencing and the Scientific Rationalization of Discrimination. Stanf. Law Rev. 2014, 66, 803–872. [Google Scholar]
  4. Harcourt, B.E. Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age; University of Chicago Press: Chicago, IL, USA, 2007; ISBN 978-0-226-31614-7. [Google Scholar]
  5. Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
  6. Stevenson, M.T.; Doleac, J.L. Algorithmic Risk Assessment in the Hands of Humans; IZA Discussion Paper No. 12853; Institute of Labor Economics (IZA): Bonn, Germany, 2022. [Google Scholar]
  7. Gottfredson, S.D.; Moriarty, L.J. Statistical Risk Assessment: Old Problems and New Applications. Crime Delinq. 2006, 52, 178–200. [Google Scholar] [CrossRef]
  8. Burgess, E.W. Factors Determining Success or Failure on Parole. In The Workings of the Indeterminate Sentence Law and the Parole System in Illinois; Bruce, A.A., Harno, A.J., Burgess, E.W., Landesco, J., Eds.; State Board of Parole: Springfield, IL, USA, 1928; pp. 221–234. [Google Scholar]
  9. Andrews, D.A.; Bonta, J.; Wormith, J.S. The Recent Past and Near Future of Risk and/or Need Assessment. Crime Delinq. 2006, 52, 7–27. [Google Scholar] [CrossRef]
  10. Angwin, J.; Larson, J.; Mattu, S.; Kirchner, L. Machine Bias: There’s Software Used across the Country to Predict Future Criminals. And It’s Biased against Blacks. ProPublica. 23 May 2016. Available online: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (accessed on 21 January 2026).
  11. DeMichele, M.; Baumgartner, P.; Wenger, M.; Barber-Rioja, V.; Comfort, M.; Mack, N. The Public Safety Assessment: A Re-Validation and Assessment of Predictive Utility and Differential Prediction by Race and Gender in Kentucky. Criminol. Public Policy 2020, 19, 409–431. [Google Scholar] [CrossRef]
  12. Andrews, D.A.; Bonta, J. The Psychology of Criminal Conduct, 5th ed.; Anderson Publishing: New York, NY, USA, 2010. [Google Scholar]
  13. Singh, J.P.; Grann, M.; Fazel, S. A Comparative Study of Violence Risk Assessment Tools: A Systematic Review and Metaregression Analysis of 68 Studies Involving 25980 Participants. Clin. Psychol. Rev. 2011, 31, 499–513. [Google Scholar] [CrossRef] [PubMed]
  14. Fazel, S.; Singh, J.P.; Doll, H.; Grann, M. Use of Risk Assessment Instruments to Predict Violence and Antisocial Behaviour in 73 Samples Involving 24827 People: Systematic Review and Meta-Analysis. BMJ 2012, 345, e4692. [Google Scholar] [CrossRef] [PubMed]
  15. Dressel, J.; Farid, H. The Accuracy, Fairness, and Limits of Predicting Recidivism. Sci. Adv. 2018, 4, eaao5580. [Google Scholar] [CrossRef] [PubMed]
  16. Lin, Z.; Jung, J.; Goel, S.; Skeem, J. The Limits of Human Predictions of Recidivism. Sci. Adv. 2020, 6, eaaz0652. [Google Scholar] [CrossRef] [PubMed]
  17. Dieterich, W.; Mendoza, C.; Brennan, T. COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity; Technical Report; Northpointe Inc. Research Department: Traverse City, MI, USA, 2016. [Google Scholar]
  18. Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS 2017), Berkeley, CA, USA, 9–11 January 2017. [Google Scholar] [CrossRef]
  19. Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 2017, 5, 153–163. [Google Scholar] [CrossRef]
  20. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
  21. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
  22. Citron, D.K.; Pasquale, F. The Scored Society: Due Process for Automated Predictions. Wash. Law Rev. 2014, 89, 1–33. [Google Scholar]
  23. Supreme Court of the United States. Gardner v. Florida, 430 U.S. 349. Available online: https://supreme.justia.com/cases/federal/us/430/349/ (accessed on 2 February 2026).
  24. Supreme Court of Wisconsin. State v. Loomis, 881 N.W.2d 749. Available online: https://harvardlawreview.org/print/vol-130/state-v-loomis/ (accessed on 2 February 2026).
  25. Supreme Court of the United States. Loomis v. Wisconsin, 137 S. Ct. 2290 (cert. denied). Available online: https://law.justia.com/cases/wisconsin/supreme-court/2016/2015ap000157-cr.html (accessed on 2 February 2026).
  26. European Parliament; Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). Off. J. Eur. Union 2024, L 2024/1689, 1–144. [Google Scholar]
  27. European Parliament. EU AI Act: First Regulation on Artificial Intelligence. Available online: https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence (accessed on 21 January 2026).
  28. Edwards, L.; Veale, M. Slave to the Algorithm? Why a “Right to an Explanation” Is Probably Not the Remedy You Are Looking for. Duke Law Technol. Rev. 2017, 16, 18–84. [Google Scholar]
  29. European Parliament; Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data (General Data Protection Regulation). Off. J. Eur. Union 2016, L 119, 1–88. [Google Scholar]
  30. Article 29 Data Protection Working Party. Guidelines on Automated Individual Decision-Making and Profiling for the Purposes of Regulation 2016/679 (WP251rev.01); European Commission: Brussels, Belgium, 2018. [Google Scholar]
  31. Wachter, S.; Mittelstadt, B.; Floridi, L. Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation. Int. Data Priv. Law 2017, 7, 76–99. [Google Scholar] [CrossRef]
  32. Lou, Y.; Caruana, R.; Gehrke, J.; Hooker, G. Accurate Intelligible Models with Pairwise Interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 623–631. [Google Scholar] [CrossRef]
  33. Nori, H.; Jenkins, S.; Koch, P.; Caruana, R. InterpretML: A Unified Framework for Machine Learning Interpretability. arXiv 2019, arXiv:1909.09223. [Google Scholar] [CrossRef]
  34. Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 841–887. [Google Scholar] [CrossRef]
  35. Zeng, J.; Ustun, B.; Rudin, C. Interpretable Classification Models for Recidivism Prediction. J. R. Stat. Soc. Ser. A 2017, 180, 689–722. [Google Scholar] [CrossRef]
  36. Dietvorst, B.J.; Simmons, J.P.; Massey, C. Algorithm Aversion: People Erroneously Avoid Algorithms after Seeing Them Err. J. Exp. Psychol. Gen. 2015, 144, 114–126. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Explainability-Constrained Sentencing Model Architecture. The framework integrates validated input data with an inherently interpretable Generalized Additive Model (GA2M) core, generating multi-modal explanations calibrated to judicial and defendant audiences.
Figure 1. Explainability-Constrained Sentencing Model Architecture. The framework integrates validated input data with an inherently interpretable Generalized Additive Model (GA2M) core, generating multi-modal explanations calibrated to judicial and defendant audiences.
Information 17 00234 g001
Table 1. Comparative Predictive Performance of Risk Assessment Approaches.
Table 1. Comparative Predictive Performance of Risk Assessment Approaches.
MethodFeatures UsedOverall AccuracyAUCFPR (Black)FPR (White)
COMPAS13765.40%0.744.90%23.50%
Human Judgment (MTurk)762.10%0.6740.30%25.20%
Logistic Regression766.60%0.6940.40%25.40%
Logistic Regression265.20%0.6839.80%26.10%
Explainable Boosting Machine766.80%0.7138.20%26.80%
Note: FPR = False Positive Rate. Source: Adapted from Dressel and Farid (2018) [15]; Angwin et al. (2016) [10]; author calculations using ProPublica dataset.
Table 2. COMPAS Error Rate Disparities by Demographic Group.
Table 2. COMPAS Error Rate Disparities by Demographic Group.
MetricBlack DefendantsWhite DefendantsDisparity Ratio
Base Rate (Actual Recidivism)51.40%39.20%1.31
Overall Accuracy63.80%67.00%0.95
False Positive Rate44.90%23.50%1.91
False Negative Rate28.00%47.70%0.59
Positive Predictive Value63.00%59.00%1.07
Negative Predictive Value65.00%71.00%0.92
Source: ProPublica analysis of Broward County data [10].
Table 3. Taxonomy of Explainability Approaches for Sentencing Models.
Table 3. Taxonomy of Explainability Approaches for Sentencing Models.
ApproachScopeFidelity GuaranteeStabilityComputational CostJudicial Accessibility
Inherent Interpretability (GAM)GlobalPerfect (by construction)HighLowHigh
LIMELocalApproximateModerateModerateModerate
SHAP (TreeSHAP)Local/GlobalExact (for tree models)HighModel-dependentModerate
Counterfactual ExplanationsLocalN/A (prescriptive)ModerateModerateHigh
Attention VisualizationLocalVariableLowLowLow
Rule ExtractionGlobalApproximateHighHighHigh
Note: FPR = False Positive Rate. Data sources specified: COMPAS values (65.4% accuracy, 0.70 AUC, racial FPR disparities) from Angwin et al. ProPublica investigation (2016) [10]. Human judgment (MTurk) and 7-feature logistic regression from Dressel & Farid (2018) [15] published results. 2-feature logistic regression from Dressel & Farid supplementary materials. Explainable Boosting Machine values from our independent implementation using InterpretML library on the same Broward County test set (methodology detailed in Section 4.3). All models evaluated on identical temporal test split (July-December 2014, n = 1543 defendants).
Table 4. EU AI Act Requirements for High-Risk Judicial AI Systems.
Table 4. EU AI Act Requirements for High-Risk Judicial AI Systems.
Requirement CategorySpecific ObligationsRelevant Articles
Risk ManagementContinuous identification, analysis, and mitigation of foreseeable risks throughout system lifecycleArt. 9
Data GovernanceTraining data quality, relevance, representativeness; examination for biases; appropriate statistical propertiesArt. 10
Technical DocumentationComprehensive documentation enabling conformity assessment and post-market monitoringArt. 11
Record-KeepingAutomatic logging of events relevant to identifying risks and facilitating post-market monitoringArt. 12
TransparencyInformation enabling interpretation of outputs and appropriate use by deployersArt. 13
Human OversightMeasures enabling natural persons to understand capabilities and limitations; ability to override or reverseArt. 14
Accuracy & RobustnessAppropriate and consistent levels of accuracy; resilience against errors and inconsistenciesArt. 15
Table 5. Predictive Performance Comparison Across Model Architectures.
Table 5. Predictive Performance Comparison Across Model Architectures.
Model TypeExampleAUCAccuracyBrier ScoreExplainability
Deep Neural NetworkCustom MLP0.7267.80%0.215Low
Gradient BoostingXGBoost0.7167.20%0.218Low
Random ForestRF (500 trees)0.766.50%0.221Low
Proprietary (COMPAS)COMPAS0.765.40%0.224Low
Generalized Additive ModelEBM (GA2M)0.7166.90%0.219High
Logistic RegressionL2-regularized0.6965.80%0.226High
Sparse Linear ModelLASSO0.6865.10%0.229High
Rule ListCORELS0.6965.50%0.228High
Note: AUC = Area Under ROC Curve; Brier Score = mean squared prediction error (lower is better). All models trained on identical temporal split: training January 2013-June 2014 (n = 4629), testing July-December 2014 (n = 1543). Feature set: Standard 7-feature set for all models except COMPAS (proprietary 137 features) and MLP neural network (extended 12-feature set). Thresholds selected via Youden’s J statistic on validation set; calibration via Platt scaling applied to all probabilistic models. Both calibration and threshold selection were performed exclusively on training/validation data; the test set was held out until final evaluation to preclude data leakage. Sources: Deep Neural Network from Zeng et al. (2017) [35]; XGBoost, Random Forest, and Logistic Regression from author replication; COMPAS from Dressel & Farid [15]; GA2M (EBM) from author implementation; LASSO and CORELS from Rudin et al. [5] with author validation.
Table 6. Sensitivity of GA2M Performance to Temporal Split Boundary.
Table 6. Sensitivity of GA2M Performance to Temporal Split Boundary.
Split ConfigurationTraining nTest nAUCAccuracyBrier Score
Original (train through June 2014)462915430.7166.90%0.219
Alternative 1 (train through March 2014)348626860.766.20%0.222
Alternative 2 (train through September 2014)53588140.7167.30%0.217
Table 7. Fairness Metrics Under Alternative Threshold Strategies (GA2M).
Table 7. Fairness Metrics Under Alternative Threshold Strategies (GA2M).
MetricYouden’s J (Universal)Equal FPR (Group-Specific)Prevalence-Based (Group-Specific)
FPR (Black)38.20%26.80%35.10%
FPR (White)26.80%26.80%27.30%
FPR Disparity Ratio1.4311.29
FNR (Black)33.50%42.80%36.20%
FNR (White)42.10%42.10%40.80%
FNR Disparity Ratio0.81.020.89
Overall Accuracy66.90%64.20%65.80%
Calibration Difference0.030.050.04
Note: Equal FPR thresholds calibrated to equalize false positive rates across groups; prevalence-based thresholds classify defendants as high-risk when predicted probability exceeds group-specific base rate. All thresholds determined on training/validation data only. Bootstrap 95% CIs (1000 iterations) available from authors upon request.
Table 8. Fairness Metrics Under Alternative Model Architectures.
Table 8. Fairness Metrics Under Alternative Model Architectures.
MetricCOMPAS (Proprietary)XGBoost (Black-Box)GA2M (Proposed)
Calibration Difference0.040.030.03
FPR Disparity Ratio1.911.721.45
FNR Disparity Ratio0.590.650.72
Equalized Odds Gap0.380.310.28
Predictive Parity Gap0.040.030.03
Demographic Parity Gap0.180.150.12
Note: Disparity ratios compare Black to White defendants; values closer to 1.0 indicate greater parity. Gaps measure absolute differences between groups. Test set n = 1543 (Black n = 845, White n = 698). Threshold: Universal single threshold selected via Youden’s J statistic (not group-specific thresholds). Calibration performed on training set before fairness evaluation. Statistical significance assessed via bootstrap (1000 iterations); all reported disparities significant at p < 0.05. Sources: COMPAS values from ProPublica analysis [10]; XGBoost and GA2M values from author implementation with methodology detailed in Section 4.3.3.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, J.; Shen, T. AI-Assisted Sentencing Modeling Under Explainability Constraints: Framework Design and Judicial Applicability Analysis. Information 2026, 17, 234. https://doi.org/10.3390/info17030234

AMA Style

Sun J, Shen T. AI-Assisted Sentencing Modeling Under Explainability Constraints: Framework Design and Judicial Applicability Analysis. Information. 2026; 17(3):234. https://doi.org/10.3390/info17030234

Chicago/Turabian Style

Sun, Jie, and Tao Shen. 2026. "AI-Assisted Sentencing Modeling Under Explainability Constraints: Framework Design and Judicial Applicability Analysis" Information 17, no. 3: 234. https://doi.org/10.3390/info17030234

APA Style

Sun, J., & Shen, T. (2026). AI-Assisted Sentencing Modeling Under Explainability Constraints: Framework Design and Judicial Applicability Analysis. Information, 17(3), 234. https://doi.org/10.3390/info17030234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop