Next Article in Journal
A Multi-Key Homomorphic Scheme Based on Multivariate Polynomial Look-Up Tables Evaluation
Next Article in Special Issue
Optimal Dividend and Capital Injection Strategies with Exit Options in Jump-Diffusion Models
Previous Article in Journal
Novel Operational Properties and Inductive Relationships of TL-Fuzzy Filters and TL-Fuzzy Congruences in Residuated Lattices
Previous Article in Special Issue
Optimal Policies in an Insurance Stackelberg Game: Demand Response and Premium Setting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interpretable Binary Classification Under Constraints for Financial Compliance Modeling

by
Álex Paz
1,2,
Broderick Crawford
3,*,
Eric Monfroy
2,
Eduardo Rodriguez-Tello
4,
José Barrera-García
5,*,
Felipe Cisternas-Caneo
3,
Benjamín López Cortés
3,
Yoslandy Lazo
3,
Andrés Yáñez
1,2,
Álvaro Peña Fritz
1 and
Ricardo Soto
3
1
Escuela de Ingeniería en Construcción y Transporte, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2147, Valparaíso 2362804, Chile
2
Laboratoire d’Étude et de Recherche en Informatique d’Angers (LERIA), Université d’Angers, UFR Sciences, 2 Bd de Lavoisier, 49000 Angers, France
3
Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2241, Valparaíso 2362807, Chile
4
Cinvestav Unidad Tamaulipas, Km. 5.5 Carretera Victoria-Soto La Marina, Victoria 87130, Tamaulipas, Mexico
5
Escuela de Negocios y Economía, Pontificia Universidad Católica de Valparaíso, Amunátegui 1838, Viña del Mar 2580129, Chile
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(3), 429; https://doi.org/10.3390/math14030429
Submission received: 31 December 2025 / Revised: 18 January 2026 / Accepted: 23 January 2026 / Published: 26 January 2026

Abstract

This study addresses an interpretable supervised binary classification problem under constrained feature availability and class imbalance. The objective is to evaluate whether reliable predictive performance can be achieved using exclusively pre-event administrative variables while preserving transparency and analytical traceability of model decisions. A comparative framework is developed using linear and ensemble-based classifiers, combined with resampling strategies and exhaustive hyperparameter optimization embedded within cross-validation. Model performance is evaluated using standard classification metrics, with particular emphasis on the Matthews correlation coefficient as a robust measure under imbalance. In addition to predictive accuracy, the analysis incorporates global, structural, and local interpretability mechanisms, including permutation feature importance, explicit decision paths derived from tree-based models, and additive local explanations. Experimental results show that optimized ensemble models achieve consistent performance gains over linear baselines while maintaining a balanced error structure across classes. Importantly, the most influential predictors exhibit stable rankings across models and explanation methods, indicating a concentrated and robust discriminative signal within the constrained feature space. The interpretability analysis demonstrates that complex classifiers can be decomposed into verifiable decision rules and locally coherent feature contributions. Overall, the findings confirm that interpretable supervised classification can be reliably conducted under administrative data constraints, providing a reproducible modeling framework that balances predictive performance, error analysis, and explainability in applied mathematical settings.

1. Introduction

1.1. Background and Motivation

Income-contingent student loan systems rely on annual borrower compliance to ensure both equity and financial sustainability. From a computational perspective, monitoring such compliance can be formulated as a prediction problem under uncertainty, where decisions must be made using limited and heterogeneous information available prior to critical administrative deadlines. In this context, supervised learning models offer a systematic framework for estimating the likelihood of borrower compliance based on historical academic and administrative records [1,2].
In Chile, the University Credit Solidarity Fund (Fondo Solidario de Crédito Universitario, FSCU) constitutes a large-scale income-contingent loan system administered at the institutional level. The program generates extensive structured data describing academic trajectories, loan characteristics, and administrative events, which can be represented as high-dimensional feature spaces suitable for predictive modeling. However, the effective use of this information is challenged by class imbalance, delayed outcomes, and the need for transparent decision rules [3,4].
At the Pontificia Universidad Católica de Valparaíso (PUCV), a significant proportion of undergraduate students are beneficiaries of the FSCU, making early identification of non-compliance patterns a recurring operational problem. While detailed administrative records are routinely collected, their potential for predictive analysis has not been fully exploited. This motivates the application of machine learning techniques that can transform institutional data into quantitative risk estimates, enabling anticipatory decision-making.
Beyond this specific case, the problem addressed in this study reflects a broader class of data-driven classification tasks in which outcomes depend on socioeconomic and behavioral variables observed prior to an event of interest. As such, it provides a relevant setting for evaluating supervised learning models under realistic constraints of imbalance, interpretability, and limited observability.

1.2. Problem Statement in the Context of FSCU

The University Credit Solidarity Fund (Fondo Solidario de Crédito Universitario, FSCU) is a state-backed student loan system applied at Chilean universities affiliated with the Council of Rectors (CRUCH) [5]. Its operational design establishes a repayment mechanism that relies on the borrower’s annual income declaration to determine whether the debt installment is adjusted to income or fixed under statutory rules.
From a modeling perspective, this mechanism induces a binary observable outcome at the borrower level. Let y i { 0 ,   1 } denote the declaration status of borrower i, where y i = 1 represents the timely submission of the income declaration and y i = 0 otherwise. The outcome is governed by the legal framework established by Law No. 19,287 [6] and its amendment in Law No. 20,572 [7], which mandate annual declarations and impose asymmetric consequences for compliant and non-compliant borrowers.
Official institutional reports document persistent levels of non-declaration and repayment difficulties among FSCU beneficiaries, particularly among borrowers with incomplete academic trajectories or greater socioeconomic vulnerability [8,9]. In addition, the structure, availability, and accounting treatment of the administrative records used for monitoring and collection are formally defined by regulatory guidelines issued by the Superintendence of Higher Education [10]. These elements establish the empirical and operational context in which declaration outcomes are observed and recorded.
Crucially, the declaration decision must be made using only information available before the declaration deadline. Let X R n × p denote the matrix of pre-declaration features describing academic trajectories, loan characteristics, and administrative records for a cohort of n borrowers. The problem addressed in this study consists of estimating the conditional probability of declaration compliance under conditions of class imbalance, heterogeneous features, and delayed outcome realization, defined as
P ( y = 1 X ) ,
where y = 1 indicates timely submission of the income declaration.
The FSCU collection process further constrains the prediction task. Borrowers who submit their income declaration are assigned a variable installment proportional to their reported income. In contrast, non-compliant borrowers are automatically assigned fixed installments with longer repayment horizons and the loss of associated benefits [11]. These asymmetric outcomes create a strong incentive structure, making early identification of non-compliance particularly relevant for institutional planning.
Accordingly, the problem can be formalized as a supervised binary classification task with interpretability requirements, where predictions are intended to support anticipatory decision-making rather than automated enforcement. This formulation enables analysis of the FSCU case within a general mathematical framework applicable to income-contingent mechanisms and compliance-related prediction problems.

1.3. Research Objectives and Questions

The objective of this study is to construct and evaluate supervised classification models that estimate whether a beneficiary of the FSCU at the Pontificia Universidad Católica de Valparaíso (PUCV) will submit their first annual income declaration. The prediction is performed using exclusively pre-declaration academic, socioeconomic, and administrative features, framing the task as a binary classification problem under institutional and informational constraints.
More specifically, this study aims to
  • Identify the pre-declaration variables that contribute most to the prediction of income declaration compliance.
  • Evaluate the predictive performance of multiple supervised machine learning algorithms under class imbalance conditions.
  • Assess the interpretability of model outputs through feature importance and explanation techniques in an institutional data setting.
Based on these objectives, the following research questions are formulated:
  • RQ1: Which pre-declaration variables exhibit the strongest predictive contribution to income declaration compliance?
  • RQ2: How accurately can supervised learning models predict declaration outcomes using only information available before the declaration cycle?
  • RQ3: To what extent can interpretable classification models provide transparent and reliable predictions under institutional data constraints, beyond predictive accuracy alone?
This study does not aim to introduce new learning algorithms nor novel imbalance-handling techniques. Instead, it adopts a deliberately applied and institutionally grounded perspective. The originality of the work lies in the formulation and validation of a predictive framework designed under realistic administrative constraints, where only pre-declaration information is available and severe class imbalance is inherent to the problem. By prioritizing operational feasibility, methodological coherence, and audit-oriented interpretability over algorithmic novelty, the study addresses a gap in applied machine learning research, where predictive models are often evaluated under conditions misaligned with real-world institutional deployment.

1.4. Intended Contributions

This study makes the following contributions to the applied machine learning literature:
  • A pre-event predictive problem formulation grounded in realistic administrative constraints, explicitly reflecting the information available to institutions before the target compliance behavior occurs.
  • A controlled and reproducible methodological pipeline for benchmarking established supervised learning models and imbalance-handling strategies under a unified validation and partitioning protocol.
  • An imbalance-appropriate evaluation strategy that prioritizes the Matthews Correlation Coefficient (MCC) as the primary performance metric, explicitly linking model assessment to the balanced management of Type I and Type II errors in severely imbalanced settings.
  • A triangulated interpretability design that combines global, structural, and local explanation methods, positioned as an audit mechanism to support institutional decision-making rather than as a claim of direct model transparency.
  • A transferability analysis that distinguishes context-specific elements from pipeline-level methodological insights applicable to other income-contingent financing schemes and administrative compliance prediction problems under similar pre-event constraints.

1.5. Structure of the Paper

The remainder of this paper is organized as follows: Section 2 reviews the most relevant studies and outlines the main research gaps identified in the literature. Section 3 describes the databases, selection criteria, and methodological framework adopted for the empirical analysis. Section 4 details the experimental design and evaluation procedures applied to ensure replicability and transparency. Section 5 presents the main findings obtained from the comparative analysis. Section 6 discusses these findings in light of existing evidence and highlights the implications for future research. Finally, Section 7 summarizes the conclusions and proposes potential directions for further investigation.

2. Related Work

2.1. Abandonment and Default Risk Prediction

Student loan repayment and abandonment have become persistent concerns in higher education financing systems, particularly in contexts where repayment depends on long-term income trajectories rather than fixed installment schedules. In income-contingent systems such as those implemented in Australia and the United Kingdom, legal default is relatively uncommon; instead, the central challenge lies in anticipating long-run non-repayment and associated fiscal risks [12,13]. Conversely, systems with weaker collection mechanisms or limited income linkage tend to exhibit higher levels of arrears and borrower distress [14,15]. These contrasting designs highlight the importance of early risk identification over ex post recovery.
Comparative policy analyses consistently identify academic non-completion and socioeconomic vulnerability as primary structural drivers of repayment difficulties [8,15]. Borrowers who fail to complete their programs or who enter informal labor markets exhibit reduced repayment capacity and higher probabilities of falling into arrears. From a modeling perspective, these findings motivate the use of predictive approaches capable of integrating academic, socioeconomic, and administrative information to estimate repayment or compliance risk before adverse outcomes materialize.
Recent studies have demonstrated the effectiveness of machine learning techniques for predicting repayment-related outcomes. Thuy et al. [16] showed that machine learning and deep learning models outperform traditional statistical approaches in student credit scoring tasks. Related work in educational analytics further supports the use of institutional administrative data for early risk detection. For instance, Suleiman and Anane [17] and Yakubu and Abubakar [18] applied supervised learning models to academic and socioeconomic data to predict student performance and progression, demonstrating that profile-based representations improve predictive accuracy.
Taken together, the literature indicates that repayment distress and non-compliance behaviors can be framed as predictive problems driven by multidimensional risk factors observable before default or abandonment. This perspective supports the development of supervised classification models that estimate the likelihood of adverse outcomes using pre-event institutional data, providing the methodological foundation for the approach adopted in this study.

2.2. Profile-Based Representation in Predictive Modeling

The representation of individuals through multidimensional profiles plays a central role in predictive modeling for higher education and credit-related applications. Traditional econometric approaches typically rely on a limited set of explanatory variables, such as income, loan amount, or repayment history, to model default or non-compliance outcomes [19]. While these models offer interpretability, they often fail to capture the complex interactions that arise when academic, socioeconomic, and administrative factors jointly influence borrower behavior, a limitation that has motivated the adoption of machine learning techniques in both educational and credit risk settings [20].
From a machine learning perspective, profile-based modeling represents each individual as a feature vector in a multidimensional space, allowing heterogeneous attributes to be integrated within a unified predictive framework. Institutional datasets commonly include variables describing academic trajectories, enrollment continuity, completion status, and financial characteristics, which can be transformed into structured feature representations suitable for supervised learning. Empirical evidence suggests that such representations often contribute more to predictive performance than the specific choice of algorithm. For example, Suleiman and Anane [17] demonstrated that regression-based machine learning applied to institutional data can successfully identify at-risk students, emphasizing the importance of feature construction. Similarly, Yakubu and Abubakar [18] showed that combining socioeconomic, demographic, and academic variables improves predictive accuracy in educational contexts.
In credit management settings, the same representational logic applies. Borrower profiles that integrate academic progression, socioeconomic background, and administrative engagement can reveal latent patterns associated with future non-compliance or repayment distress. By embedding these profiles in a high-dimensional feature space, machine learning models can capture nonlinear relationships among variables that are not readily captured by linear modeling assumptions [21].
Overall, the literature supports the view that profile-based representation is a critical determinant of model effectiveness in predictive tasks involving heterogeneous institutional data. This insight motivates the adoption of supervised learning models that leverage structured feature spaces to estimate compliance-related outcomes, forming a key methodological pillar of the approach proposed in this study.
Taken together, the reviewed literature highlights two converging insights. First, repayment distress and compliance-related outcomes in student loan systems are driven by multidimensional factors that extend beyond purely financial attributes. Second, integrating academic, socioeconomic, and administrative data into profile-based representations enables more accurate and robust predictive modeling. These findings motivate the development of supervised learning approaches that treat compliance behavior as a classification problem in heterogeneous feature spaces under class-imbalance constraints. Building on this methodological foundation, the following section describes the data sources, feature construction, and modeling procedures adopted in this study.

3. Materials and Methods

3.1. Data Sources and Legal Context

Each year, the PUCV Finance Department requests that borrowers submit their income declaration by 31 May. The declaration form includes personal identification, contact information, pension affiliation, and the borrower’s monthly gross income, as well as that of the spouse, when applicable. Supporting documents are required for verification. All information is integrated into the university system and stored in a relational database.
In this study, we set a cutoff date of 24 April 2024, and restrict the analysis to obligations maturing from 2012 onward, following the 2012 legal reform that standardized the annual income-declaration process. Focusing on the post-reform period ensures a consistent operational regime and avoids structural breaks caused by legacy rules.

3.2. Database Schema

The source database comprises eight relational tables with historical records of the FSCU portfolio and enrollment information: Person, Promissory Note, Due Group, Debt, Installment, Payment Slip, Income Declaration, and Enrollments. For data management and query performance, the contents were migrated to PostgreSQL prior to dataset construction.

3.3. Cohort Definition and Target

The working dataset is constructed at the borrower level using exclusively information available prior to the first income-declaration deadline. Let y i { 0 , 1 } denote the declaration outcome for borrower i, where y i = 1 indicates submission of the first income declaration and y i = 0 otherwise. To ensure consistency with the current operational regime, only records corresponding to obligations maturing from 2012 onward were included, following the legislative reform that standardized the annual income-declaration process.
The feature space comprises numerical, categorical, and date-derived variables describing borrower demographics, loan characteristics, and academic trajectory. Exact variable counts by type are reported in Table 1. In total, the initial dataset consists of the binary target variable and a heterogeneous collection of features derived from administrative and academic records available before the declaration event.
As discussed in Section 1, compliance with the income-declaration requirement plays a central role in the functioning of income-contingent loan systems. From a modeling perspective, understanding the factors associated with first-time declaration behavior is essential for characterizing compliance patterns under informational constraints.
Accordingly, this study focuses on predicting whether a borrower will submit the first income declaration using pre-declaration information only, including attributes related to the borrower profile, loan characteristics, and academic history. By analyzing both compliant and non-compliant cases, the objective is to identify systematic patterns that can inform classification-based risk estimation within a supervised learning framework.

3.4. Feature Construction

The feature construction process was guided by the need to balance predictive expressiveness, interpretability, and strict temporal validity. In particular, all representations were deliberately constrained to borrower-level summaries observable before the first income-declaration deadline, reflecting the information realistically available for institutional decision-making at that stage.
The predictive task requires a borrower-level representation in which each observation corresponds to the information available before the first income-declaration deadline. Accordingly, a flat dataset was constructed, where each row represents a unique borrower and each column corresponds to a pre-declaration attribute derived from academic, financial, or administrative records.
Let D = { ( X i , y i ) } i = 1 n denote the resulting dataset, where X i R p is the feature vector associated with borrower i, and y i { 0 ,   1 } indicates whether the borrower submitted the first income declaration. Feature construction was strictly constrained to information observable before the declaration deadline to prevent temporal leakage.
The original data are stored in a relational schema comprising multiple tables with one-to-many relationships, such as enrollment records and promissory notes. To obtain a fixed-dimensional representation suitable for supervised learning, borrower-level aggregation operators were applied to recurring records. In particular, count-based and sum-based aggregations were used to summarize enrollment history and loan-related information, yielding scalar features that preserve cumulative exposure while ensuring dimensional consistency across observations.
Two additional categorical attributes describing the undergraduate program were appended after extraction. These variables are static with respect to the prediction horizon and do not depend on post-declaration information, making them admissible for inclusion in the pre-declaration feature space.
A small subset of borrowers holds more than one loan in the source database (166 cases, representing less than 0.01 % of the sample). To preserve a consistent unit of analysis and avoid duplicate borrower histories, only the first loan per borrower was retained. Previously, for borrowers with multiple loans, the feature vector X i was constructed from the earliest loan record, ensuring that each observation corresponds to a single, well-defined prediction instance.
As a result, the final feature matrix X R n × p provides a borrower-centric, fixed-dimensional representation that integrates academic trajectory, loan characteristics, and institutional attributes available prior to the declaration event. This construction enables the application of standard supervised classification algorithms while maintaining a clear correspondence between model inputs and the underlying administrative processes.

3.5. Feature Set Overview

The modeling process begins with an initial pool of features extracted from academic and administrative records available prior to the first income-declaration deadline. This initial feature pool is subsequently refined through the data cleaning, consolidation, encoding, and transformation steps described in the following subsections, yielding a reduced and consistent feature set used for model training and evaluation.
For transparency and reproducibility, both the initial feature pool and the final feature set are reported. Table 1 summarizes the variables initially extracted from the source databases, while Table 2 provides semantic descriptions of the initially extracted variables prior to any preprocessing, consolidation, or feature selection steps. Variable names are retained in their original Spanish form, as they correspond directly to field identifiers used in the official FSCU administrative databases. In Table 3 reports the variables exhibiting missing observations. Table 4 reports the features retained after preprocessing and feature engineering. Preserving this nomenclature ensures traceability, consistency, and alignment with operational institutional data structures. For clarity, all variables are explicitly described and interpreted in English within the table, allowing international readers to follow the analysis without ambiguity.
Reporting both the initial and final feature sets allows the reader to trace how methodological decisions progressively reduce dimensionality while preserving institutional meaning, thereby supporting transparency and reliability in an applied administrative context.

3.6. Data Cleaning and Preprocessing

Data cleaning and preprocessing decisions were driven by the dual objective of preserving as much administratively meaningful information as possible while ensuring numerical stability and interpretability under severe class imbalance. Rather than applying aggressive filtering or imputation, the adopted strategy prioritizes conservative transformations aligned with institutional data quality and deployment constraints.

3.6.1. Missing Values

Figure 1 summarizes the number of missing values observed in each extracted feature. This exploratory analysis enables the identification of variables affected by incomplete information and guides subsequent preprocessing decisions.
Five variables exhibit missing observations, as reported in Table 3. The variables fecha_nacimiento, edad, and edad_dias present identical missingness patterns, since the latter two are deterministically derived from the birth date. Given the high proportion and complete overlap of missing values across these three attributes, they were excluded from the feature set to avoid redundant loss of information and unstable imputations.
Formally, let X R n × p denote the original feature matrix. The filtered feature matrix X was obtained by removing the columns corresponding to the affected variables such that
X = X { fecha _ nacimiento , edad , edad _ dias } .
In contrast, the variable escuela presents a small number of missing values corresponding to degree programs without an associated school. Rather than discarding these observations, a dedicated categorical level was introduced to encode the absence of an assigned school, thereby preserving the affected records and retaining potentially informative structure in the data.
This handling strategy reflects a deliberate trade-off between information retention and model robustness, favoring the exclusion of highly incomplete and redundant variables while preserving partially missing categorical information through explicit encoding.

3.6.2. Class Consolidation and Rare Categories

Class consolidation decisions were guided by the need to reduce sparsity and unstable parameter estimation while maintaining a semantically coherent representation aligned with institutional practice.
To assess the distributional properties of the constructed feature space, an exploratory analysis was performed on both numerical and categorical variables. Figure 2, Figure 3 and Figure 4 summarize the empirical distributions observed across the dataset and provide guidance for subsequent consolidation decisions.
Numerical Feature Distribution
Figure 2 present the distributions of selected numerical variables, including the number of enrollments (conteo_matr), the number of promissory notes (conteo_pagare), the outstanding debt amount (deud_monto), and the total value of promissory notes (monto_total_pagare).
Although some variables exhibit similar distributional shapes (e.g., Figure 2c,d), none of the numerical features display degenerate or constant behavior. Consequently, all numerical variables were retained at this stage and further examined through correlation analysis to evaluate potential redundancy, as discussed in Section 3.6.3.
Categorical Feature Distribution
Figure 3, Figure 4, Figure 5 and Figure 6 illustrate the empirical distributions of the categorical variables. These features describe marital status, loan attributes, declaration status, academic trajectory, and institutional affiliation. The visual inspection highlights dominant categories, sparsity patterns, and variables with limited variability.
Categorical variables exhibiting invariant behavior within the analyzed cohort were removed, as they provide no discriminative information for the classification task. Specifically, the features carr_t_carrera, cod_inst_ult_matr, deud_t_deuda, and e_ult_matr were excluded from the feature set.
Very low-frequency categories were also addressed to reduce sparsity and prevent unstable parameter estimation. The “foreign” category in nacionalidad, comprising six observations, was removed due to its negligible representation. Similarly, a single observation corresponding to the year 2008 in anio_ult_matr was excluded. To further control categorical cardinality, the variables cod_carr_ult_matr and escuela were consolidated at the facultad level, yielding a more compact and semantically coherent representation.
In addition, a small subset of borrowers holds more than one FSCU debt associated with the same institution (166 cases, representing less than 0.01% of the final sample after excluding non-PUCV loans). These records correspond to second or subsequent debts acquired by the same borrower, rather than to independent or parallel loan events. To preserve a consistent and institutionally meaningful unit of analysis, only the first FSCU debt per borrower was retained. Including secondary debts as separate observations would implicitly introduce a longitudinal dimension based on a very limited number of cases, increasing model complexity while risking bias toward atypical borrower trajectories.
This aggregation strategy was further validated in consultation with the FSCU management unit at PUCV, which confirmed that the first debt constitutes the primary administrative reference for enforceability and early-stage monitoring processes. Accordingly, the resulting dataset adopts a borrower-centric representation aligned with institutional practice while avoiding unnecessary dimensional expansion or instability in subsequent modeling stages.
After consolidating categorical variables and reducing sparsity, the resulting feature space was examined for redundancy among numerical attributes, as detailed in the following subsection.

3.6.3. Correlation Screening

Correlation screening was introduced as a pragmatic dimensionality-reduction step to mitigate multicollinearity effects that could distort both model estimation and downstream interpretability analyses.
To identify potential redundancy among numerical variables, pairwise linear dependence was assessed using Pearson’s correlation coefficient. Let x j and x k denote two numerical features. Their Pearson correlation is defined as
r j k = cov ( x j , x k ) σ x j σ x k ,
where cov ( · ,   · ) denotes covariance and σ x the standard deviation of variable x.
Figure 7 presents the empirical correlation matrix computed over the numerical feature subset. A threshold of | r j k | 0.70 was adopted as a pragmatic criterion to flag pairs of variables exhibiting strong linear association and, therefore, potential collinearity.
Three variable pairs exceeded the selected threshold: (deud_monto, monto_total_pagare), (monto_total_pagare, conteo_pagare), and (deud_monto, conteo_pagare). To mitigate multicollinearity effects, only one representative variable from this correlated group was retained. Specifically, deud_monto was preserved due to its direct interpretability and explicit association with the loan magnitude, while monto_total_pagare and conteo_pagare were removed from the feature set.
Formally, let F num denote the set of numerical features prior to screening and F num the reduced set after correlation filtering. The selection can be expressed as
F num = F num { monto _ total _ pagare , conteo _ pagare } .
This selection balances dimensional parsimony and interpretability, ensuring that strongly collinear monetary variables do not distort model estimation, numerical stability, or feature importance analyses in subsequent classification stages.

3.6.4. Encoding and Scaling

To ensure compatibility with supervised learning algorithms and to avoid introducing artificial ordinal relationships, categorical variables were transformed using one-hot encoding. Let x ( c ) C denote a categorical feature taking values in a finite set of categories. The encoding maps x ( c ) to a binary vector in { 0 ,   1 } | C | , where each component indicates the presence of a specific category. This transformation preserves category membership while enabling linear and non-linear classifiers to operate on a numerical feature space.
This choice ensures that categorical distinctions are preserved without imposing artificial ordinality, while numerical scaling supports stable optimization across heterogeneous model families.
Numerical variables were standardized to ensure comparable scales and stable numerical behavior during model training. Given a numerical feature x, standardization was performed using the z-score transformation
z = x μ σ ,
where μ and σ denote the sample mean and standard deviation, respectively. This transformation yields features with zero mean and unit variance, reducing scale dominance effects in distance-based and gradient-based learning algorithms.
The combined preprocessing pipeline can be viewed as a transformation
Φ : X X ˜ ,
where X denotes the original feature matrix and X ˜ the encoded and standardized representation used for model estimation.
This preprocessing strategy preserves interpretability while improving numerical stability and reducing noise during training. In particular, by retaining a single representative monetary feature following correlation screening, the transformation avoids redundancy among variables expressing similar financial magnitude and maintains a clear conceptual link between financial exposure and the probability of non-compliance.
After encoding categorical variables and scaling numerical ones, temporal features were processed separately, as described in the following subsection.

3.6.5. Date Handling

Temporal variables were handled with particular caution to retain interpretability while avoiding unnecessary dimensional expansion in a setting with limited temporal granularity.
After preprocessing categorical and numerical variables, temporal information was handled separately. Only one date-related attribute remains in the feature set, namely deud_fecha_exigibilidad. Its empirical distribution, as shown in Figure 8, exhibits two pronounced density valleys associated with specific enforceability periods, while the remaining dates present relatively uniform frequencies. The aforementioned valleys are specifically January first of 2013 with six values and January second of 2022 with one value. For context, January third of 2013 has 1109 values, while January first of 2022 has 1006 values.
To obtain a numerical representation suitable for supervised learning, the date variable was decomposed into its constituent components: day, month, and year. Let d i denote the enforceability date associated with borrower i. The transformation can be expressed as
d i ( day i ,   month i ,   year i ) .
The resulting marginal distributions of these components are shown in Figure 9 as Histograms with a Kernel Density Estimate to provide a smooth picture of the distribution. As all observations correspond to the same calendar month (May), as evidenced in Figure 9b, the month component exhibits zero variance across the dataset and therefore provides no discriminative information for the classification task. Consequently, it was excluded from the feature set.
The retained temporal components thus define a reduced representation ( day i , year i ) , operationalized in the final dataset as the variables dia_exigibilidad and anio_exigibilidad. This decomposition ensures that the numerical encoding of dates remains informative yet parsimonious, facilitating consistent scaling and interpretation within the machine learning pipeline and allowing temporal information to contribute meaningfully to the estimation of classification functions without introducing unnecessary dimensional complexity.
The resulting feature matrix and target variable, presented in Table 4, were subsequently used to train and evaluate supervised classification models under a consistent validation protocol. The next section describes the computational setup, training and testing strategy, and evaluation metrics.

4. Experimental Setup

This section describes the complete experimental configuration used to evaluate the proposed supervised binary classification framework. All stages of data partitioning, preprocessing, resampling, model training, hyperparameter optimization, and evaluation were designed to ensure methodological rigor and to prevent information leakage, supporting reliable and auditable empirical assessment.

4.1. Computational Environment

All experiments were executed on a dedicated server equipped with an Intel Core i9-10900K CPU and 64 GB of DDR4 RAM. Data manipulation and numerical operations were conducted using Pandas (v2.1.4) and NumPy (v1.26.3). Machine learning models were implemented using scikit-learn (v1.7.1), while gradient boosting models were trained using LightGBM (v4.6.0). Class imbalance techniques were applied via imbalanced-learn (v0.14.0). Feature attribution analyses were supported by the SHAP library (v0.48.0).

4.2. Data Partitioning and Validation Protocol

The dataset was initially divided into a training set (70%) and an independent test set (30%) as a widely adopted practice in the machine learning literature (e.g., [22,23]), and because it provides a sufficiently large hold-out set to obtain stable and reliable estimates of performance metrics under class imbalance. From an operational perspective, this split enables a clear separation between model development and final evaluation, with the test set serving as a proxy for unseen future cohorts, while preserving enough training data to support robust model fitting and cross-validated hyperparameter tuning.
Within the training set, all model selection and hyperparameter tuning procedures were conducted using stratified K-fold cross-validation. This strategy ensures that class proportions remain consistent across folds and provides an unbiased estimate of generalization performance. At each fold, preprocessing, resampling, and model fitting were performed exclusively on the corresponding training partition, thereby preventing any form of data leakage.

4.3. Pipeline Structure

Each experiment followed a unified pipeline architecture composed of the following sequential stages:
  • Training–validation split according to the cross-validation fold.
  • Feature scaling when required by the learning algorithm.
  • Application of class imbalance handling techniques.
  • Model training using a specific hyperparameter configuration.
  • Validation performance estimation using predefined evaluation metrics.
All transformations were fitted exclusively on training data within each fold and subsequently applied to the corresponding validation subset. This pipeline was applied uniformly across all experiments.

4.4. Predictive Models

Seven supervised learning algorithms were evaluated: K Nearest Neighbors, Naive Bayes, Logistic Regression, Linear Support Vector Classifier, Decision Tree, Random Forest, and Light Gradient Boosting Machine. Non-linear kernel variants of Support Vector Machines were excluded after preliminary analysis due to consistently inferior performance on the studied feature space.

4.5. Class Imbalance Handling

The target variable exhibits a pronounced class imbalance, with the non-declaration class representing the minority group. To address this issue, three resampling strategies were evaluated within the training folds, following standard approaches for learning from imbalanced data [3]. The resampling procedures were implemented using the imbalanced-learn library [24]:
  • Synthetic Minority Over-Sampling Technique (SMOTE),
  • Adaptive Synthetic Sampling (ADASYN),
  • Random Under-Sampling of the majority class.
The choice of resampling method and its associated parameters were treated as hyperparameters and jointly optimized with the classifier configuration. Resampling was applied exclusively to the training portion of each cross-validation fold.

4.6. Hyperparameter Optimization

Model and sampling hyperparameters were optimized using an exhaustive grid search strategy embedded within the cross-validation procedure applied to the training set. Let Θ denote the discrete search space defined by the Cartesian product of all candidate hyperparameter values for a given model–sampling configuration. For each θ Θ , model performance was estimated using stratified K-fold cross-validation.
Formally, let MCC k ( θ ) denote the Matthews Correlation Coefficient obtained on the validation subset of the k-th fold when training the model with configuration θ . The optimal hyperparameter configuration θ ^ was selected by maximizing the mean validation performance across folds, defined as
θ ^ = arg max θ Θ 1 K k = 1 K MCC k ( θ ) .
This optimization process was applied consistently across all classifiers and resampling strategies. The complete hyperparameter grids explored for predictive models and sampling methods are reported in Table 5 and Table 6, respectively.
Although exhaustive grid search entails a higher computational cost compared to heuristic or randomized alternatives, it ensures a systematic exploration of the predefined parameter space and avoids biases associated with ad hoc hyperparameter selection. This design choice supports a fair and methodologically controlled comparison across models and configurations.

4.7. Evaluation Metrics

Model performance was assessed using five complementary metrics derived from the confusion matrix: Accuracy, Precision, Recall, F1-score, and Matthews Correlation Coefficient (MCC). These metrics capture distinct aspects of classification behavior and are particularly appropriate for binary classification problems under class imbalance.
Let T P , F P , T N , and F N denote the number of true positives, false positives, true negatives, and false negatives, respectively. Accuracy measures the proportion of correctly classified instances over the total number of observations:
Accuracy = T P + T N T P + T N + F P + F N .
Although widely used, Accuracy may provide misleading assessments when class distributions are highly unbalanced.
Precision quantifies the proportion of positive predictions that are correctly classified, and is defined as:
Precision = T P T P + F P .
Recall measures the proportion of actual positive instances that are correctly identified:
Recall = T P T P + F N .
Precision and Recall characterize complementary aspects of classification error, particularly in scenarios where the costs of false positives and false negatives differ.
The F1-score corresponds to the harmonic mean of Precision and Recall, providing a balanced summary of both measures:
F 1 = 2 · Precision · Recall Precision + Recall .
Finally, the Matthews Correlation Coefficient (MCC) provides a comprehensive evaluation by incorporating all four elements of the confusion matrix:
MCC = T P · T N F P · F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) .
MCC ranges from 1 to 1, where values close to 1 indicate perfect classification, 0 corresponds to random prediction, and negative values indicate systematic disagreement between predictions and true labels. This metric is particularly robust in imbalanced classification settings, as it accounts for all four components of the confusion matrix and provides a balanced evaluation even when class distributions are skewed [25].
All reported results correspond to performance on the held-out test set using the hyperparameter configuration selected during cross-validation.

4.8. Feature Importance Analysis

To analyze the contribution of individual predictors, both model-specific and model-agnostic feature importance techniques were employed. For classifiers providing intrinsic interpretability, native importance measures were extracted, including impurity-based importance for tree-based models and coefficient magnitudes for linear models.
Additionally, permutation feature importance was computed as a model-agnostic approach [26]. This method quantifies the decrease in predictive performance induced by randomly permuting a single feature, thereby breaking its association with the target variable. Formally, the importance of feature f j is defined as:
i j = s 1 R r = 1 R s r , j
where s denotes the original model score and s r , j represents the score obtained after the r-th permutation of feature f j .
This analysis was conducted on validation data and enables consistent comparison of feature relevance across heterogeneous model families, subject to known limitations in the presence of highly correlated predictors.

4.9. Reproducibility

All experiments were conducted using fixed random seeds for data partitioning, resampling, and model initialization. Software versions and experimental configurations were explicitly controlled to ensure that results could be consistently replicated under identical conditions. This design supports transparent verification of the reported findings and facilitates methodological scrutiny in applied institutional settings.

5. Results

This section presents the results obtained from the experiments described in the previous sections. Section 5.1 summarizes the evaluation metrics achieved by each algorithm introduced in Section 4.4, including both the baseline configurations and those incorporating data-balancing techniques discussed in Section 4.5. In the best-performing configurations reported in this section, the synthetic oversampling methods SMOTE and ADASYN used a sample ratio of 0.7 , while Random Undersampling reduced the majority class to 5500 instances. Each configuration was also tested with and without hyperparameter optimization, as described in Section 4.6, resulting in eight experimental combinations per algorithm.
Section 5.2 complements this analysis by presenting the confusion matrices of the best-performing experiment for each model, providing a detailed view of error distribution and class-level performance.
Finally, Section 5.3 explores the factors that drive the models’ predictions, combining global and local interpretability analyses. It integrates permutation importance and model-wise feature analyses with visual inspections of Decision Tree structures at multiple depths (Section 5.3.3) and SHAP value visualizations (Section 5.3.4). Together, these results provide both aggregate and instance-level explanations, offering a comprehensive understanding of how model decisions align with observable borrower behavior.

5.1. Model Performance

Table 7, Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 summarize the test-set performance of all algorithms under the eight experimental configurations described earlier. As the primary selection criterion, MCC differentiates the most competitive configurations under class imbalance, while Accuracy, Precision, Recall, and the F1-score provide complementary views of classification behavior. Unoptimized Decision Trees and Naive Bayes pipelines show weaker performance, whereas their optimized variants improve substantially.
Precision, recall, and the F1-score followed a consistent trend across models, generally remaining above 0.80 , while MCC exhibited greater sensitivity to model and sampling choices. Slight deviations were observed in a few configurations, such as the KNN with ADASYN, where precision reached 0.785 , and the unoptimized Naive Bayes, where recall dropped to 0.449 and F1-scores remained below 0.70 . Once optimized, however, all Naive Bayes variants improved notably, reaching F1-scores of approximately 0.80 . This pattern indicates that even simple models benefit from parameter adjustment when trained on administrative data with moderate class imbalance.
The Matthews Correlation Coefficient (MCC) displayed higher variability across experiments, as expected under class imbalance, ranging from approximately 0.28 to 0.42 . Models such as Naive Bayes and Decision Tree showed the greatest sensitivity to hyperparameter tuning, while the Random Forest and LightGBM achieved consistent improvements after optimization. In particular, LightGBM achieved the highest MCC values across all experiments (up to 0.419 ), suggesting that boosting methods capture nonlinear interactions among financial and academic features more effectively than other algorithms.
Overall, the results indicate that both linear and ensemble classifiers achieve reliable generalization on the held-out test set using exclusively pre-declaration features. Linear models (Logistic Regression and Linear SVM) yield stable performance with transparent decision functions, while ensemble methods (Random Forest and LightGBM) provide a modest gain in predictive power, as reflected by their higher MCC values. From an error-analysis standpoint, the joint inspection of Precision, Recall, and MCC shows that competitive configurations maintain a favorable trade-off between false positives and false negatives under class imbalance, without collapsing into degenerate majority-class predictions.
From an operational perspective, differences in Matthews Correlation Coefficient (MCC) translate into meaningful trade-offs between Type I and Type II errors, which are directly relevant for institutional decision-making. In the present context, Type I errors (false positives) correspond to borrowers incorrectly classified as compliant, potentially delaying preventive outreach, whereas Type II errors (false negatives) correspond to borrowers incorrectly classified as non-compliant, potentially triggering unnecessary monitoring actions. Because MCC jointly accounts for all cells of the confusion matrix, improvements in MCC reflect a more balanced reduction of both error types, rather than gains driven by majority-class dominance or asymmetric error minimization.
Consequently, configurations achieving higher MCC values—such as optimized Random Forest and LightGBM models—offer more robust discrimination capacity under uncertainty, supporting earlier and more proportionate administrative responses. Importantly, these gains should not be interpreted as deterministic decision thresholds, but as improvements in risk ranking quality that enhance the efficiency of targeted communication and follow-up strategies while preserving institutional discretion.
These findings address RQ2 by showing that supervised models can predict declaration outcomes with consistent performance using only pre-event information. They also support RQ3 by demonstrating that interpretability can be preserved under constrained administrative feature spaces: linear decision functions and tree-based structures provide explicit, verifiable decision criteria, while the cross-model stability of the top-ranked predictors motivates the interpretability analyses developed in the subsequent subsections.
Regarding computational cost, the full experimental training pipeline was executed on a standard commercial off-the-shelf workstation (as described in Section 4.1) and required approximately three days to complete, including hyperparameter optimization and cross-validation across all evaluated configurations. Once trained, inference is computationally lightweight: the average prediction time is approximately 0.002 seconds per instance on the held-out test set.
Given that the institutional dataset comprises on the order of 10 3 records per year, batch inference over new cohorts can be performed in negligible time on conventional hardware, without imposing any operational burden. From an institutional deployment perspective, this clear separation between moderate offline training cost and negligible online inference cost makes the proposed framework fully feasible for routine use in administrative settings.

5.2. Confusion Matrices

Figure 10 and Figure 11 display the confusion matrices corresponding to the best-performing experiment for each model. These visualizations provide a more granular view of how each classifier distinguishes between borrowers who submitted their first income declaration and those who did not.
Overall, all models exhibit a strong ability to differentiate between the two classes, though the nature of the misclassifications varies. Some models show a tendency toward Type I errors (false positives—predicting a borrower will declare when they will not), while others lean toward Type II errors (false negatives—predicting a borrower will not declare when they actually do).
The confusion matrices for Naive Bayes, Logistic Regression, Linear SVM, and Decision Tree reveal a predominance of Type II errors, consistent with their lower recall values reported in Section 5.1. These models tend to miss a portion of actual declarants, prioritizing conservative classifications that favor the majority class.
In contrast, KNN, Random Forest, and LightGBM display a stronger inclination toward Type I errors, predicting more declarants than those who actually filed. Although this behavior slightly reduces precision, it prevents severe drops in recall and yields higher overall F1-scores. In practical terms, this trade-off is favorable for early-warning systems, as it minimizes the risk of failing to identify potential defaulting borrowers.
From an error-analysis perspective, the observed asymmetry between Type I and Type II errors has direct implications for model selection under uncertainty. Configurations exhibiting a mild bias toward Type I errors prioritize higher recall at the cost of a moderate increase in false positives, whereas models dominated by Type II errors achieve higher precision but risk systematically missing true positive cases. This trade-off is consistent with the metric profiles reported in Section 5.1, particularly the joint behavior of Recall, F1-score, and MCC.
Under a constrained feature setting and class imbalance, ensemble models such as Random Forest and LightGBM exhibit a more balanced error structure, avoiding extreme concentration on either error type. Their confusion matrices show that gains in recall are not achieved at the expense of severe precision degradation, which explains their consistently higher MCC values. From a computational standpoint, this balance indicates a more robust discrimination capacity across both classes, rather than reliance on majority-class dominance.

5.3. Model Interpretability

This section analyzes which variables most strongly drive the predictive behavior of the models and how these relationships can be interpreted to provide transparent and verifiable explanations of model predictions. Beyond supporting transparency, this interpretability layer also plays a key role in identifying and monitoring potential socioeconomic biases present in the underlying administrative data. By making feature contributions, split thresholds, and decision rules explicit, the proposed approach allows institutional analysts to detect patterns that may disproportionately affect specific groups, enabling informed oversight and periodic review. Importantly, interpretability is not presented as a bias-mitigation mechanism per se, but as a diagnostic tool to support responsible use, human judgment, and the design of complementary governance or corrective strategies when needed.
Section 5.3.1 reports the average permutation feature importance (PFI) across all trained models, providing a global view of variable relevance. Section 5.3.2 presents the top fifteen model-wise importances for the best-performing experiment of each interpretable model, highlighting differences between linear and tree-based algorithms.
To complement these aggregate analyses, Section 5.3.3 illustrates decision paths extracted from the optimized Decision Tree (OPT RUS DecisionTree) at multiple depths, showing how model structure can be translated into human-readable rules. Finally, Section 5.3.4 introduces SHAP value visualizations, which quantify the individual contribution of each feature to specific predictions, enhancing transparency and case-level explainability.

5.3.1. Permutation Feature Importance Results

Figure 12 shows the averaged PFI computed for every model. To reduce the effect of random shuffling, the procedure was repeated thirty-one times per model and the results were averaged.
Two features stand out clearly: deud_monto (total loan amount) and conteo_matr (total number of enrollments). They are followed by estado_civil (marital status) and anio_exigibilidad (year of enforceability). The consistent prominence of these four variables across models indicates that financial exposure, academic trajectory, and basic demographics jointly explain most of the predictive signal.
The remaining variables contribute progressively less. Most faculty dummies have limited impact, with the notable exception of the indicator corresponding to the FACULTY OF LAW (see Table 14), which ranks among the top features and suggests program-specific differences in declaration behavior. This result indicates that representing academic affiliation at the faculty level provides sufficient and stable information to capture program-level trends, allowing newly introduced academic programs to be accommodated through their faculty assignment without altering the model structure.
At the lower end of the chart, some features exhibit slightly negative average PFI values. Given their very small magnitude and the known sensitivity of permutation to sampling noise and collinearity, these values do not by themselves justify feature removal.
While a formal feature ablation study was not conducted, the permutation feature importance analysis provides an indirect indication of model sensitivity to reduced feature availability. Across all evaluated models, predictive performance is largely driven by a small subset of highly influential features, whereas the permutation of remaining variables results in negligible changes in performance. This suggests that the learned decision structure is not critically dependent on a large number of marginal features. However, it should be noted that permutation importance reflects sensitivity to information degradation rather than actual feature removal; a systematic retraining-based ablation analysis is therefore left as future work.

5.3.2. Model-Wise Feature Importance

Figure 13 shows the feature importances for the best experiments of the linear models, while Figure 14 reports the importances for the best tree-based models.
For the linear models, the coefficient-based importances in Figure 13 show that estado_civil (marital status) dominates the decision boundaries in both Logistic Regression and Linear SVMs, reflecting its strong marginal effect under the standardized feature space. In the Logistic Regression model, academic and institutional variables such as facultad_7 (Faculty of Law), the STEM indicator, and several anio_ult_matr dummies are also influential, suggesting that the academic program and enrollment history contribute to the likelihood of timely declaration. In contrast, the Linear SVM assigns higher relative weights to recent enrollment years (anio_ult_matr_2011, 2015, and 2020) and to the total debt amount (deud_monto), capturing the impact of both temporal and financial dimensions. These differences are expected, since permutation importance evaluates overall predictive dependence, whereas linear coefficients reflect local marginal effects conditioned on feature scaling.
For the tree-based models (Figure 14), the feature importance rankings are broadly consistent across the Decision Tree, Random Forest, and LightGBM. In all three cases, the same dominant predictors identified by permutation importance define the core predictive structure.
The Decision Tree model assigns the greatest weight to conteo_matr, followed closely by estado_civil and deud_monto, indicating that a single borrower’s academic trajectory and financial exposure are key splitting criteria. Random Forest and LightGBM reinforce this pattern but invert the top two variables—deud_monto slightly surpasses conteo_matr—highlighting that ensemble averaging emphasizes financial magnitude over enrollment frequency. The consistent presence of anio_exigibilidad among the top features across all three models underscores the importance of the repayment timeline in distinguishing between declaring and non-declaring borrowers.
Lower-ranked variables, such as facultad indicators and STEM affiliation, contribute marginally to model performance, offering limited incremental information once the main financial and academic variables are included. This stability of rankings across independent tree-based architectures suggests that the predictive signal is dominated by a small, interpretable subset of features directly linked to borrower behavior and loan structure.
From a predictive and interpretability standpoint, these results align with the performance analysis. Variables related to debt magnitude and academic trajectory consistently carry the strongest explanatory weight across models, indicating that a small subset of administrative features concentrates most of the discriminative signal. This concentration supports stable interpretation under uncertainty, as the same variables govern both predictive accuracy and explanatory structure.

5.3.3. Decision Tree Snapshots at Different Depths

To illustrate how model structure supports decision-making, Figure 15, Figure 16 and Figure 17 display the same Decision Tree trained under one of the best-performing configurations, namely the Optimized Random under-sampling Decision Tree (OPT RUS DecisionTree) described in Section 5.1, and visualized at three different depths ( d = 4 , d = 5 , and d = 11 ). These visualizations are intended as illustrative artifacts rather than as objects of exhaustive node-by-node inspection. The shallow representations ( d = 4 and d = 5 ) highlight a small set of high-yield splits that can be readily examined, whereas the deeper tree ( d = 11 ) introduces finer partitions that capture niche interactions at the cost of interpretability, exemplifying how structural complexity rapidly limits direct human inspection in administrative prediction settings.
From an institutional perspective, the decision tree structure enables the extraction of explicit and auditable decision rules that can be interpreted as early-warning signals rather than deterministic prescriptions. Split thresholds and branch conditions identify combinations of administrative and academic attributes that are systematically associated with elevated risk of non-submission. When used with appropriate caution, these rules can inform high-level monitoring criteria or screening heuristics to prioritize outreach, communication, or follow-up actions while avoiding automated enforcement or exclusion. Importantly, these rule-based patterns are intended to support human oversight and contextual judgment, not to replace institutional decision-making processes.
At d = 4 , the tree typically places estado_civil, anio_exigibilidad and facultad_4 among the first splits, followed by conteo_matr and deud_monto. These nodes yield compact rules with broad coverage. For example, a Single debtor, a low loan amount, combined with low enrollments, may increase the probability of not submitting the first income declaration. Such rules are easy to operationalize as “portfolio filters” for early outreach.
At d = 5 , the model refines these segments, introducing thresholds that separate borderline cases (for instance, specific ranges in deud_monto, faculties (facultad_X), or if the undergraduate program is a STEM program (stem)). This level balances fidelity and interpretability.
At increasing depths, the Decision Tree exposes progressively finer-grained interactions among features. While deeper representations ( d = 11 ) may improve local fit by capturing higher-order combinations, they also reduce transparency and increase sensitivity to sampling variability. In contrast, shallower trees ( d = 4 and d = 5 ) emphasize a small set of high-yield splits that yield compact and stable decision rules. From an interpretability standpoint, these shallow structures provide a favorable balance between expressive power and human verifiability, making them suitable for analytical inspection and rule-based reasoning under uncertainty.
These structural observations are consistent with the global and model-wise importance analyses: the first-level splits systematically involve the same dominant variables (estado_civil, deud_monto, conteo_matr, and anio_exigibilidad) identified by permutation importance and ensemble-based rankings. This alignment indicates that the learned decision paths are not artifacts of model depth, but rather reflect stable predictive signals present in the restricted administrative feature space. Consequently, the extracted rules provide explicit, auditable explanations of individual predictions, reinforcing the interpretability claims examined in relation to RQ3.
To illustrate the internal reasoning of the chosen model, Table 15 summarizes one representative decision path extracted from the tree with depth d = 4 . This path shows how a borrower’s characteristics sequentially lead the model to predict a higher probability of not submitting the first income declaration.
This path illustrates a borrower whose marital status corresponds to a single individual (estado_civil = 1), with a loan enforceable in 2019 and below-average academic enrollments (conteo_matr standardized value = −0.5). The model first follows the left branch for single borrowers, then the right branch for recent enforceability years, and subsequently the left branches for both low enrollment count and below-average debt (deud_monto = −0.4).
The resulting classification, Never Declared, arises from the combination of limited academic continuity (fewer than average enrollments) and less-than-average financial exposure (a total amount of debt below the average). The probability related to the outcome of the model, in this case 85 % corresponds to the class proportion at the terminal node reached by this path. This value corresponds to the empirical class frequency observed at the leaf node and does not represent a calibrated posterior probability.
This example shows how the decision tree structure enables a transparent, rule-based explanation of predictions: each split represents a human-interpretable condition that links administrative attributes to behavioral outcomes. Such explicit reasoning enables predictions to be traced, verified, and analytically justified through a sequence of human-interpretable conditions defined on observed features.

5.3.4. Shap Values for Light Gradient Boosting Machine

Figure 18 shows the SHAP value distribution for all features in the Base Light Gradient Boosting Machine (LGBM) model. The TreeExplainer method from the SHAP library was applied, as it provides accurate local attributions for ensemble-based algorithms. Each point represents a single observation: its position along the x-axis indicates the magnitude and direction of its contribution to the model output, while the color encodes the original feature value (blue for low and red for high). Points distributed farther from zero correspond to stronger impacts on the final prediction.
The SHAP summary plot reveals patterns consistent with the permutation and tree-based feature importance analyses (Figure 12 and Figure 14). The dominant variables—deud_monto (loan amount), conteo_matr (number of enrollments), estado_civil (marital status), and anio_exigibilidad (loan enforceability year)—exhibit the largest SHAP magnitudes. These features drive the model’s predictions in interpretable directions: high deud_monto, high conteo_matr, and higher estado_civil codes (married borrowers) tend to push predictions toward the Declares class, while older anio_exigibilidad values (earlier repayment years) shift the prediction toward Never Declared.
Features related to academic programs (faculty dummies and the STEM indicator) show minimal dispersion around zero, confirming their marginal influence on model decisions. Notably, the dummy variable corresponding to the Faculty of Law (facultad_7) displays a slightly asymmetric distribution, suggesting a weak but consistent positive contribution to declaration probability.
These patterns indicate that marital status exerts a moderate but consistent influence on declaration behavior, with married or partnered borrowers showing slightly higher compliance. Financial exposure also plays a central role: larger loan amounts are associated with higher declaration probability, whereas smaller debts correspond to increased non-declaration risk. Academic continuity further contributes to the model output, as a lower number of enrollments (conteo_matr) is systematically linked to reduced declaration likelihood. Temporal effects are present but weaker, with earlier enforceability years (anio_exigibilidad) marginally increasing the probability of declaration. Finally, program-related variables such as faculty affiliation exhibit only secondary effects, with the Faculty of Law showing a small but consistent positive contribution relative to other faculties.
Overall, the SHAP analysis complements the global and model-wise interpretability results by providing instance-level attributions that are consistent with the previously identified feature rankings. The agreement between permutation importance, tree-based importances, and SHAP value distributions indicates that the contribution of the dominant predictors is stable across explanation paradigms and model families.
From a formal interpretability perspective, SHAP values offer a locally additive decomposition of the model output, enabling each prediction to be expressed as a sum of feature-level contributions relative to a baseline expectation. This property ensures traceability and internal coherence of explanations, even in ensemble-based models with complex nonlinear decision functions. Under the restriction to pre-declaration administrative features, such locally consistent explanations allow predictions to be examined, compared, and validated without reliance on latent or post-event information.
Taken together, the stability of feature rankings, the availability of explicit decision rules in tree-based models, and the locally faithful explanations provided by SHAP jointly address RQ3. They demonstrate that reliable interpretability can be achieved in supervised classification tasks operating on constrained institutional datasets, supporting transparent reasoning about predictions under uncertainty rather than opaque score-based classification.

5.3.5. Consistency and Complementarity Across Interpretation Layers

The interpretability framework adopted in this study integrates global (permutation feature importance), structural (decision paths), and local (SHAP) explanation methods, each addressing a distinct aspect of model behavior. These approaches are not expected to yield identical explanations, as they operate at different analytical levels and respond to different interpretative questions.
Global explanations identify variables that exert consistent influence across the borrower population, structural explanations reveal how such variables are combined within the internal decision logic of the models, and local explanations provide instance-level attributions for individual predictions. Apparent discrepancies between explanation layers are therefore not treated as methodological inconsistencies, but rather as complementary perspectives that jointly characterize predictive behavior.
From an institutional perspective, this layered interpretability strategy supports decision-making at multiple levels. Global explanations inform strategic prioritization and policy-level resource allocation, structural explanations enhance transparency and auditability of decision rules, and local explanations enable case-by-case review when targeted monitoring or preventive actions are considered. Rather than resolving disagreements by privileging a single interpretability method, the proposed framework emphasizes triangulation across explanation layers to ensure robust, interpretable, and context-aware decision support.

5.4. Rule-Based Threshold Baseline Comparison

To contextualize the performance gains achieved by the machine learning models, a simple rule-based baseline was implemented using threshold rules on debt amount and enrollment count, which were consistently identified as the most influential numerical features across the interpretability analyses (Section 5.3.2, Section 5.3.3 and Section 5.3.4). High-risk cases were defined as those exceeding the fourth quartile of each respective distribution.
As reported in Table 16, this baseline exhibits high precision but very low recall, resulting in poor overall performance, as reflected by low F1-score and MCC values. This behavior indicates that the rule-based approach captures only a small subset of extreme-risk borrowers while failing to identify a large proportion of non-compliant cases. The corresponding confusion matrix (Figure 19) confirms this pattern, showing a limited number of true positives alongside a substantial number of false negatives.
In contrast, the Light Gradient Boosting Machine achieves substantially higher and more balanced performance (e.g., F1 = 0.861 and MCC = 0.418; see Table 13), demonstrating its ability to exploit multivariate and non-linear relationships beyond simple threshold rules. These results highlight the limitations of practical rule-based heuristics and underscore the added value of machine learning models for early risk identification in this institutional context.

6. Discussion

The findings of this work is methodological and institutional rather than algorithmic. The findings should be read as evidence about what can be achieved with established models when the problem is formulated under realistic administrative constraints, evaluated with imbalance-appropriate metrics, and accompanied by interpretability mechanisms designed for auditability. Accordingly, the manuscript does not claim new learning theory or new imbalance-handling methods, but provides a defensible blueprint for deploying predictive decision support in comparable administrative compliance settings.
Given the breadth of evaluated configurations, the interpretation in this section emphasizes cross-model patterns and relative performance tiers rather than isolated numerical differences. While full metric tables are retained for transparency and completeness, the discussion focuses on aggregated trends—such as the comparative behavior of linear versus ensemble models, the effect of optimization and sampling strategies, and the stability of MCC across configurations—where additional numerical detail does not yield proportional interpretive value.
This aggregation strategy avoids overemphasis on marginal metric fluctuations and aligns the analysis with the study’s applied objective: assessing whether administratively deployable models achieve reliable, interpretable, and operationally meaningful performance under realistic data constraints.

6.1. Model Performance and Interpretability

Although non-submission of the first income declaration may, in general, reflect heterogeneous behavioral conditions, the present study considers this outcome within the specific institutional context of the FSCU system. The obligation to submit the declaration is contractually established, becomes enforceable after a defined grace period, and is supported by systematic informational and reminder mechanisms. Accordingly, first-time non-submission is interpreted as an early manifestation of non-compliance within a fully informed contractual framework, rather than as a consequence of lack of awareness. While individual circumstances may differ, such distinctions cannot be reliably inferred from pre-declaration administrative data alone. Consequently, the interpretability analyses presented in this study should be understood as identifying correlates of elevated early non-compliance risk, rather than as causal explanations of distinct underlying behavioral mechanisms.
Across all experiments, most algorithms achieved strong and stable predictive performance. Linear models, particularly Logistic Regression and the Linear Support Vector Machine, consistently achieved F1-scores above 0.85 and Matthews Correlation Coefficients (MCC) near 0.37 , indicating balanced performance between the two classes despite a moderate imbalance in the dataset. Tree-based ensemble methods, such as Random Forest and LightGBM, achieved slightly higher MCC values (around 0.41 0.42 ), suggesting that non-linear relationships exist between borrower characteristics and repayment behavior. However, the gap in performance between ensemble and linear models was narrow, reflecting that the underlying patterns can be captured effectively without complex architectures. This consistency across algorithms indicates that administrative data contain a strong and stable signal that can be modeled reliably through interpretable approaches under constrained feature spaces and that assessing classification quality through MCC, which is particularly appropriate in imbalanced settings because it accounts for all cells of the confusion matrix, supports more reliable identification and prioritization of borrowers at elevated non-compliance risk. In operational terms, this can inform earlier and more targeted outreach (e.g., reminders and guidance) and a more efficient allocation of administrative follow-up resources, without treating the model output as a deterministic decision rule.
The confusion matrix analysis confirmed these trends: linear models favored conservative classifications with higher precision but lower recall (Type II errors), whereas ensemble methods offered more balanced results, slightly increasing false positives (Type I errors) to improve recall for non-declarants. Hyperparameter tuning produced marginal yet consistent improvements across all models, while Random Under Sampling often enhanced minority-class recall without substantial accuracy loss. Synthetic oversampling methods (SMOTE and ADASYN) achieved similar effects, marginally improving precision in some configurations.
Interpretability analyses further strengthened the robustness and transparency of these results. Permutation feature importance and model-specific coefficients consistently highlighted financial and academic variables—particularly total debt (deud_monto), enrollment count (conteo_matr), marital status (estado_civil), and loan enforceability year (anio_exigibilidad)—as the main determinants of first-declaration behavior. The decision-tree visualizations provided concrete rule-based explanations, showing how these variables interact to form decision paths (e.g., combinations of high debt and limited enrollment predicting non-declaration). Complementarily, SHAP value analysis quantified each feature’s contribution to individual predictions, confirming that higher debt levels and continuous enrollment increase the probability of compliance, whereas more recent enforceability years and single marital status lean toward non-declaration.
It should be noted that these interpretability techniques do not address identical explanatory questions and may therefore yield partially divergent insights. Permutation Feature Importance (PFI) captures global sensitivity by measuring performance degradation under feature perturbation, whereas SHAP values provide conditional, instance-level attributions, and decision trees offer simplified structural approximations of learned relationships. As a result, discrepancies between global rankings and local explanations are expected and should be interpreted as complementary perspectives rather than as methodological contradictions.
Together, these interpretability layers—global (PFI), structural (tree paths), and local (SHAP)—provide a comprehensive understanding of model behavior. They ensure that predictions can be traced, analytically justified, and examined across multiple levels of abstraction, reinforcing the reliability of supervised learning models operating under constrained administrative feature spaces.
At the same time, the interpretability of ensemble models should be understood as mediated rather than intrinsic. While post-hoc explanation tools enable analytical inspection of model behavior, ensemble methods such as Random Forests and gradient boosting do not yield transparent decision rules in a strict sense. Accordingly, the explanations presented in this study should be viewed as audit-oriented approximations that support diagnostic reasoning and institutional scrutiny, rather than as fully transparent representations of the underlying decision logic.
At the current stage, no institution-specific decision threshold is defined for translating predicted risk scores into automatic actions. This reflects the fact that, within the FSCU system, formalized risk tolerance criteria and cost-sensitive decision policies have not yet been established. Consequently, the proposed models are conceived as an initial screening and monitoring tool, providing continuous risk indicators rather than binary decision triggers. These outputs are intended to support early identification, targeted communication, and preventive follow-up strategies, leaving final decisions to institutional judgment. The definition of optimized thresholds aligned with explicit institutional risk preferences is therefore identified as a natural extension of this work, once such policies are formally specified.
From an operational standpoint, the interpretability framework is intended to support institutional processes rather than individual-level adjudication. For example, a borrower characterized by high outstanding debt, limited enrollment history, and a recent enforceability year may be flagged as presenting elevated early non-compliance risk. In such cases, interpretability outputs can guide targeted communication, administrative follow-up, or preventive guidance, without being used as deterministic or punitive decision rules.

6.2. Implications for Predictive Modeling Under Administrative Constraints

This study illustrates how supervised learning models trained on routinely collected administrative data can anticipate borrower declaration behavior under information constraints. From a modeling perspective, the results demonstrate that pre-event academic and financial variables contain sufficient signal to support reliable binary classification, even in the presence of moderate class imbalance.
The combination of predictive performance and interpretability indicates that complex behavioral outcomes can be approximated using transparent decision structures. Rule-based paths extracted from decision trees and locally additive SHAP explanations allow predictions to be decomposed into verifiable feature contributions, facilitating analytical scrutiny rather than opaque score assignment.
More broadly, the proposed framework exemplifies how predictive modeling can be integrated into administrative data environments without reliance on latent variables or post-event information. This characteristic supports transferability to other income-contingent loan systems or institutional datasets with similar structural limitations, where explainability and traceability are as critical as predictive accuracy.
The generalizability of these findings should be interpreted with appropriate scope. Several elements of the results are inherently context-specific, including the exact distribution of declaration outcomes, the magnitude of predictive performance metrics, the relative importance of individual features, and the absence of institutionally defined decision thresholds. These aspects reflect the regulatory framework, borrower population, and administrative practices of the FSCU system at the PUCV, and should not be assumed to transfer directly to other institutions or funding schemes.
In contrast, the methodological structure of the proposed framework is potentially transferable across administrative compliance contexts. Specifically, the pre-event formulation of the predictive task under informational constraints, the use of a unified and leakage-aware validation protocol, the prioritization of imbalance-appropriate evaluation metrics such as the Matthews Correlation Coefficient, and the positioning of interpretability mechanisms as audit-oriented decision support tools are applicable to other income-contingent loan systems and regulated administrative domains where outcomes are delayed and class imbalance is structural.

6.3. Limitations

Several limitations should be acknowledged. First, the study is constrained by the scope and structure of the available administrative data, which, while comprehensive, exclude certain socioeconomic variables that could further explain borrower behavior (e.g., employment type or household composition). Second, the analysis focuses exclusively on first-declaration outcomes; subsequent declarations and long-term repayment behavior remain outside the scope of the present study. Extending the framework to a longitudinal setting would allow the identification of recurrent non-compliance patterns and the assessment of persistence in compliance behavior over time.
Another limitation concerns generalizability. The data and administrative context correspond to a single higher education institution. Although the proposed modeling framework is transferable, predictive performance and feature relevance may vary across universities with different borrower profiles, regulatory environments, or collection practices. Future research should validate the approach using multi-institutional data, particularly within the CRUCH network, to assess external validity and support the development of standardized predictive tools for income-contingent student loan management in Chile.
Finally, although models are trained on data pooled across multiple cohorts and enforceability periods, the present study adopts a cross-sectional predictive perspective rather than a time-aware longitudinal one. Potential temporal drift arising from regulatory changes, labour-market conditions, or evolving institutional practices is therefore acknowledged but not explicitly modelled. This limitation also constrains the feasibility of time-aware validation strategies, as implementing cohort-based training and testing would require a longer and more stable post-reform observation window to avoid conflating gradual temporal drift with structural breaks induced by legislative changes and exogenous shocks. Assessing robustness across cohorts under such conditions is identified as an important direction for future research.

7. Conclusions and Future Work

7.1. Summary of Main Findings

This study developed a predictive framework for estimating whether borrowers of the Fondo Solidario de Crédito Universitario (FSCU) at the Pontificia Universidad Católica de Valparaíso (PUCV) would submit their first income declaration using only pre-declaration administrative and academic data. By combining standard machine learning algorithms with rigorous preprocessing, the models achieved strong and consistent predictive performance.
Linear classifiers—Logistic Regression and a Support Vector Machine—demonstrated high interpretability and stability, while ensemble models such as the Random Forest and LightGBM offered slightly higher predictive accuracy, reaching F1-scores above 0.85 and Matthews Correlation Coefficients around 0.41. Interpretability analyses, including permutation importance, decision-tree visualization, and SHAP values, consistently identified financial and academic features—particularly total debt (deud_monto), number of enrollments (conteo_matr), marital status (estado_civil), and loan enforceability year (anio_exigibilidad)—as the most influential determinants of declaration behavior. Together, these results validate the feasibility of leveraging administrative data for anticipating declaration behavior and demonstrate that transparent, interpretable models can achieve reliable performance within income-contingent loan settings.

7.2. Methodological Implications

From a methodological perspective, the study highlights the importance of combining predictive performance with interpretability when modeling compliance-related outcomes using administrative data. The results show that relatively simple classifiers, when properly tuned and evaluated, can achieve competitive performance while preserving transparency and analytical tractability.
The integration of interpretable structures—such as explicit decision paths and additive explanation models—demonstrates that complex ensemble methods can remain accessible to inspection and validation. This balance between accuracy and explainability is particularly relevant for modeling tasks involving regulated or high-stakes outcomes, where understanding the contribution of individual features is as important as predictive accuracy itself.

7.3. Directions for Future Research

While the results are encouraging, several research opportunities remain open. Future work should extend the predictive framework to longitudinal analysis, examining how borrower behavior evolves across successive income declarations and repayment cycles. Incorporating additional socioeconomic variables—such as employment stability, regional context, or household composition—could further enhance predictive performance and interpretability.
From a machine learning point of view, a natural extension is the incorporation of cost-sensitive or constrained learning strategies. In this domain, their meaningful adoption requires an explicit institutional definition of misclassification costs, since the operational consequences of false positives and false negatives are administrative-policy-dependent. Future work should formalize these cost structures and evaluate cost-sensitive learning under the same pre-event constraints.
Methodologically, integrating advanced explainable AI techniques (e.g., SHAP interaction values, LIME, or counterfactual explanations) would allow for a deeper understanding of feature contributions at both the individual and subgroup levels. Expanding the dataset to include multiple universities or linking it with national administrative records could test the model’s generalizability and scalability. Ultimately, these extensions would contribute to the development of adaptive and transparent predictive frameworks suitable for complex, regulated administrative datasets in Chile and comparable contexts worldwide.
An additional avenue for future research is to examine the extent to which the proposed framework generalizes across administrative contexts with different structural characteristics. In particular, controlled sensitivity analyses under alternative class imbalance ratios or simulated administrative conditions would allow for a systematic assessment of robustness beyond the specific distributional properties of the FSCU system. Such extensions would help clarify which performance and interpretability patterns are stable across institutions and which are contingent on local regulatory or population features, while preserving the pre-event and audit-oriented design principles adopted in this study.
Overall, the results confirm the achievement of this study’s objectives: the predictive models identify the key factors associated with first-declaration behavior while maintaining reliable performance and interpretability under administrative constraints. This work contributes a replicable modeling framework that bridges supervised learning, explainable AI, and real-world administrative data, reinforcing the role of transparent predictive methods in applied computational research.

Author Contributions

Conceptualization, Á.P., B.C., E.M., E.R.-T., J.B.-G., F.C.-C., B.L.C., A.Y. and Á.P.F.; methodology, Á.P., B.C., E.M., E.R.-T. and R.S.; software, Á.P., J.B.-G., F.C.-C., B.L.C. and Y.L.; validation, Á.P., B.C., E.M., E.R.-T., J.B.-G., F.C.-C., B.L.C., Y.L., A.Y., Á.P.F. and R.S.; formal analysis, Á.P., J.B.-G., F.C.-C., B.L.C., Y.L. and A.Y.; investigation, Á.P., J.B.-G., F.C.-C., B.L.C. and Y.L.; resources, B.C. and Á.P.F.; data curation, Á.P., J.B.-G., F.C.-C., B.L.C. and Y.L.; writing—original draft, Á.P., J.B.-G., F.C.-C., B.L.C. and Y.L.; writing—review & editing, Á.P., B.C., J.B.-G., F.C.-C., B.L.C., Y.L., Á.P.F. and R.S.; visualization, Á.P., J.B.-G., F.C.-C., B.L.C., Y.L. and A.Y.; supervision, B.C., E.M. and R.S.; project administration, Á.P. and B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

José Barrera-García is supported by the National Agency for Research and Development (ANID)/Scholarship Program/DOCTORADO NACIONAL/2024-21242516. Felipe Cisternas-Caneo is supported by the National Agency for Research and Development (ANID)/Scholarship Program/DOCTORADO NACIONAL/2023-21230203.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Romero, C.; Ventura, S. Educational data mining: A review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2010, 40, 601–618. [Google Scholar] [CrossRef]
  2. Paz, Á.; Crawford, B.; Monfroy, E.; Barrera-García, J.; Peña Fritz, Á.; Soto, R.; Cisternas-Caneo, F.; Yáñez, A. Machine Learning and Metaheuristics Approach for Individual Credit Risk Assessment: A Systematic Literature Review. Biomimetics 2025, 10, 326. [Google Scholar] [CrossRef] [PubMed]
  3. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  4. Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
  5. Consejo de Rectoras y Rectores de las Universidades Chilenas. Universidades CRUCH a lo Largo de Chile. 2025. Available online: https://consejoderectores.cl/el-consejo/universidades-cruch/ (accessed on 16 October 2025).
  6. Biblioteca Nacional del Congreso de Chile. Ley Fondos Solidatiros de Crédito Universitario. 1994. Available online: https://www.bcn.cl/leychile/navegar?idNorma=30654 (accessed on 16 October 2025).
  7. Biblioteca Nacional del Congreso de Chile. Modificación Ley Fondos Solidatiros de Crédito Universitario. 2012. Available online: https://www.bcn.cl/leychile/navegar?idNorma=1036996&idParte=9235355&idVersion=2012-02-04 (accessed on 16 October 2025).
  8. Subsecretaría de Educación Superior, MINEDUC. Primer Informe Crédito con Aval del Estado: Características de la población deudora e impactos, Julio 2022. Available online: https://educacionsuperior.mineduc.cl/wp-content/uploads/sites/49/2022/07/PrimerInformeCAE-1.pdf (accessed on 30 December 2025).
  9. Consejo de Rectoras y Rectores de las Universidades Chilenas (CRUCH). Deudores Morosos de Fondo Solidario de Crédito Universitario (publicaciones anuales). 2025. Available online: https://consejoderectores.cl/en/fondo-solidario-de-credito-universitario/ (accessed on 6 September 2025).
  10. Superintendencia de Educación Superior. Norma de Carácter General N°3: Registros y contabilidad del FSCU. 2024. Available online: https://www.sesuperior.cl/wp-content/uploads/2024/04/NCG-3-FSCU.pdf (accessed on 6 September 2025).
  11. Pontificia Universidad Católica de Valparaíso. Fondo Solidario de Crédtio Universitario. 2025. Available online: https://estudiantespucv.cl/fscu/ (accessed on 16 October 2025).
  12. Department for Education (UK). Student Loans in England: Financial Year 2024–25; Department for Education: London, UK, 2025. Available online: https://www.gov.uk/government/statistics/student-loans-in-england-2024-to-2025/student-loans-in-england-financial-year-2024-25 (accessed on 6 September 2025).
  13. Australian Taxation Office. Study and Training Loan Repayment Thresholds and Rates; Australian Taxation Office: Canberra, Australia, 2025. Available online: https://www.ato.gov.au/tax-rates-and-codes/study-and-training-support-loans-rates-and-repayment-thresholds (accessed on 6 September 2025).
  14. Salmi, J.; Hauptman, A.M. Innovations in Tertiary Education Financing: A Comparative Evaluation of Allocation Mechanisms. World Bank. 2006. Available online: https://documents1.worldbank.org/curated/en/383241468138743150/pdf/383240WP0Box0317363B01PUBLIC1.pdf (accessed on 6 September 2025).
  15. OECD. OECD Policy GPS—Student Support (Comparative Policy Notes); OECD: Paris, France, 2024; Available online: https://gpseducation.oecd.org/revieweducationpolicies/ (accessed on 6 September 2025).
  16. Thuy, N.T.H.; Ha, N.T.V.; Trung, N.N.; Binh, V.T.T.; Hang, N.T.; Binh, V.T. Comparing the Effectiveness of Machine Learning and Deep Learning Models in Student Credit Scoring: A Case Study in Vietnam. Risks 2025, 13, 99. [Google Scholar] [CrossRef]
  17. Suleiman, R.; Anane, R. Institutional data analysis and machine learning prediction of student performance. In Proceedings of the 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Hangzhou, China, 4–6 May 2022; IEEE: New York, NY, USA, 2022; pp. 1480–1485. [Google Scholar] [CrossRef]
  18. Yakubu, M.N.; Abubakar, A.M. Applying machine learning approach to predict students’ performance in higher educational institutions. Kybernetes 2022, 51, 916–934. [Google Scholar] [CrossRef]
  19. Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
  20. Aulck, L.; Velagapudi, N.; Blumenstock, J.; West, J. Predicting student dropout in higher education. arXiv 2016, arXiv:1606.06364. [Google Scholar] [CrossRef]
  21. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  22. Alam, T.M.; Shaukat, K.; Hameed, I.A.; Luo, S.; Sarwar, M.U.; Shabbir, S.; Li, J.; Khushi, M. An Investigation of Credit Card Default Prediction in the Imbalanced Datasets. IEEE Access 2020, 8, 201173–201198. [Google Scholar] [CrossRef]
  23. Madaan, M.; Kumar, A.; Keshri, C.; Jain, R.; Nagrath, P. Loan default prediction using decision trees and random forest: A comparative study. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012042. [Google Scholar] [CrossRef]
  24. Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
  25. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
  26. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Figure 1. Missing values across dataset features compared to the dataset size (red-dotted line).
Figure 1. Missing values across dataset features compared to the dataset size (red-dotted line).
Mathematics 14 00429 g001
Figure 2. Distribution of selected numerical features.
Figure 2. Distribution of selected numerical features.
Mathematics 14 00429 g002
Figure 3. Distribution of selected categorical features (Part I).
Figure 3. Distribution of selected categorical features (Part I).
Mathematics 14 00429 g003
Figure 4. Distribution of selected categorical features (Part II).
Figure 4. Distribution of selected categorical features (Part II).
Mathematics 14 00429 g004
Figure 5. cod_carr_ult_matr Distribution.
Figure 5. cod_carr_ult_matr Distribution.
Mathematics 14 00429 g005
Figure 6. escuela Distribution.
Figure 6. escuela Distribution.
Mathematics 14 00429 g006
Figure 7. Pearson correlation heatmap among numerical features.
Figure 7. Pearson correlation heatmap among numerical features.
Mathematics 14 00429 g007
Figure 8. Distribution of enforceability dates (deud_fecha_exigibilidad).
Figure 8. Distribution of enforceability dates (deud_fecha_exigibilidad).
Mathematics 14 00429 g008
Figure 9. Decomposition of enforceability dates into day, month, and year components.
Figure 9. Decomposition of enforceability dates into day, month, and year components.
Mathematics 14 00429 g009
Figure 10. Confusion Matrices (Part I).
Figure 10. Confusion Matrices (Part I).
Mathematics 14 00429 g010
Figure 11. Confusion Matrices (Part II).
Figure 11. Confusion Matrices (Part II).
Mathematics 14 00429 g011
Figure 12. Average Permutation Feature Importance of all Models.
Figure 12. Average Permutation Feature Importance of all Models.
Mathematics 14 00429 g012
Figure 13. Linear Models Feature Importance.
Figure 13. Linear Models Feature Importance.
Mathematics 14 00429 g013
Figure 14. Tree-Based Models for Feature Importance.
Figure 14. Tree-Based Models for Feature Importance.
Mathematics 14 00429 g014
Figure 15. Decision Tree Snapshot of Model OPTRUSDecisionTree at Depth = 4. Blue (orange) nodes indicate higher association with declaration (non-declaration), with color intensity reflecting node purity.
Figure 15. Decision Tree Snapshot of Model OPTRUSDecisionTree at Depth = 4. Blue (orange) nodes indicate higher association with declaration (non-declaration), with color intensity reflecting node purity.
Mathematics 14 00429 g015
Figure 16. Decision Tree Snapshot of Model OPTRUSDecisionTree at Depth = 5. Blue (orange) nodes indicate higher association with declaration (non-declaration), with color intensity reflecting node purity; boxes (“...”) denote truncated branches beyond the selected tree depth.
Figure 16. Decision Tree Snapshot of Model OPTRUSDecisionTree at Depth = 5. Blue (orange) nodes indicate higher association with declaration (non-declaration), with color intensity reflecting node purity; boxes (“...”) denote truncated branches beyond the selected tree depth.
Mathematics 14 00429 g016
Figure 17. Decision Tree Snapshot of Model OPTRUSDecisionTree at Depth = 11. Blue (orange) nodes indicate higher association with declaration (non-declaration), with color intensity reflecting node purity. boxes (“...”) denote truncated branches beyond the selected tree depth.
Figure 17. Decision Tree Snapshot of Model OPTRUSDecisionTree at Depth = 11. Blue (orange) nodes indicate higher association with declaration (non-declaration), with color intensity reflecting node purity. boxes (“...”) denote truncated branches beyond the selected tree depth.
Mathematics 14 00429 g017
Figure 18. SHAP Values for LGBM.
Figure 18. SHAP Values for LGBM.
Mathematics 14 00429 g018
Figure 19. Confusion matrix for the rule-based threshold baseline.
Figure 19. Confusion matrix for the rule-based threshold baseline.
Mathematics 14 00429 g019
Table 1. Initial feature pool prior to preprocessing.
Table 1. Initial feature pool prior to preprocessing.
NameData TypeFeature TypeDetail
estado_civilBooleanCategorical1 and 2
nacionalidadBooleanCategorical1 and 2
sexoBooleanCategoricalM and F
fecha_nacimientoDateDate1 January 1900 to 26 March 1991
edadIntegerNumerical21 to 119
edad_diasIntegerNumerical7860 to 43,646
deud_montoFloatNumerical0.571 to 1285.762
deud_fecha_exigibilidadDateDate1 January 1994 to 1 January 2023
deud_t_deudaIntegerCategorical1 value
tiene_declaracionBooleanTarget0 and 1
monto_total_pagareFloatNumerical1.32 to 922.22
conteo_pagareIntegerNumerical1 to 29
anio_ult_matrIntegerCategorical13 values
e_ult_matrIntegerCategorical1 value
cod_carr_ult_matrIntegerCategorical80 values
carr_t_carreraIntegerCategorical1 value
cod_inst_ult_matrIntegerCategorical1 value
conteo_matrIntegerNumerical1 to 31
facultadStringCategorical9 values
escuelaStringCategorical34 values
stemBooleanCategorical0 and 1
Table 2. Feature set: detailed descriptions of each variable.
Table 2. Feature set: detailed descriptions of each variable.
FeatureDescription
estado_civilLast known marital status of the debtor. It can take the following values: 1 not married, 2 married
nacionalidadWhether the debtor is Chilean or foreign. 1 means Chilean, 2 means foreign
sexoGender of the debtor. M means male and F means Female
fecha_nacimientoBirth date of the debtor
edadAge in years of the debtor at the moment that the debt is enforceable
edad_diasAge in days of the debtor at the moment that the debt is enforceable.
deud_montoTotal loan amount
deud_fecha_exigibilidadDate of enforceability of the loan
deud_t_deudaType of loan contracted
tiene_declaraciónWhether the debtor handed their first income declaration or not. 1 Means they handed it and 0 means they did not. Target Variable
monto_total_pagareTotal value of promissory notes signed by the debtor
conteo_pagareAmount of promissory notes signed by the debtor
anio_ult_matrYear of the last college enrollment of the debtor
e_ult_matrStatus of the last college enrollment of the debtor. 1 means the enrollment has a valid status
cod_carr_ult_matrCode of the degree program covered by the loan
carr_t_carreraType of degree program in the last college enrollment of the debtor. 1 means undergraduate program
cod_inst_ult_matrInstitution code in the last college enrollment of the debtor
conteo_matrTotal amount of enrollments of the debtor within the degree program covered by the loan
facultadFaculty of the degree program
escuelaSchool of the degree program
stemWhether the degree program covered by the loan is a STEM one or not. 1 means the degree program is a STEM program, 0 means it is not
Table 3. Features with missing values.
Table 3. Features with missing values.
Column NameMissing Values
sexo9583
fecha_nacimiento9614
edad9614
edad_dias9614
escuela22
Table 4. Final feature set used for model training after preprocessing and feature engineering.
Table 4. Final feature set used for model training after preprocessing and feature engineering.
NameData TypeFeature TypeDetail
estado_civilBooleanCategorical1 and 2
nacionalidadBooleanCategoricalFiltered (foreign category removed)
sexoBooleanCategoricalM and F
deud_montoFloatNumericalstandardized
dia_exigibilidadIntegerNumerical1 to 31
anio_exigibilidadIntegerNumericalmultiple years
anio_ult_matrIntegerCategoricalfiltered values
conteo_matrIntegerNumerical1 to 31
facultadStringCategorical9 values
stemBooleanCategorical0 and 1
tiene_declaracionBooleanTarget0 and 1
Table 5. Hyperparameter grid used for model grid search strategy.
Table 5. Hyperparameter grid used for model grid search strategy.
ModelHyperparameterValues
KNNn_neighbors5, 7, 9, 11, 14, 18, 22, 25, 28, 35, 40, 45, 50, 70
weights“uniform”, “distance”
p1, 2
Random Forestn_estimators20, 50, 100, 200, 500, 800
max_depthNone, 5, 10, 20, 50, 70
min_samples_split2, 10, 20, 50, 70, 100
min_samples_leaf1, 2, 4, 6, 10, 30
max_features“sqrt”, “log2
LightGBMn_estimators20, 30, 50, 100, 500, 600, 800, 1000
max_depthNone, 5, 10, 20, 50,70
learning_rate0.01, 0.05, 0.1, 0.12, 0.15, 0.2
num_leaves2, 4, 8, 15, 31, 50, 100
SVMC0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10
loss“hinge”, “squared_hinge”
penalty‘l1’, ‘l2’
Logistic RegressionC0.01, 0.1, 1, 5, 10
solver“lbfgs”, “sag”, “saga”
Naive Bayesvar_smoothing1 × 10 9 , 1 × 10 8 , 1 × 10 7 , 1 × 10 6 , 1 × 10 4 , 1 × 10 2
Decision Treemax_depthNone, 5, 10, 20, 50, 70
min_samples_split2, 10, 20, 50, 70, 100
min_samples_leaf1, 2, 4, 6, 10, 30
max_features“sqrt”, “log2
Table 6. Hyperparameter grid used for Sampling grid search strategy.
Table 6. Hyperparameter grid used for Sampling grid search strategy.
Sampling StrategyHyperparameterValues
OverSamplingRatio0.6, 0.7, 0.8, 1
UnderSamplingTarget Sample Values4000, 5000, 5500, 6000
Table 7. K-Nearest Neighbors Results.
Table 7. K-Nearest Neighbors Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
KNN0.7460.7950.8720.8320.323
Smote KNN0.7010.8140.7580.7850.298
Adasyn KNN0.7010.8170.7530.7840.303
RUS KNN0.7200.8050.8070.8060.304
OPT KNN0.7650.7850.9280.8500.344
OPT Smote KNN0.7460.8140.8390.8260.354
OPT Adasyn KNN0.7400.8230.8140.8190.361
OPT RUS KNN0.7630.8040.8870.8430.366
Table 8. Gaussian Naive Bayes Results.
Table 8. Gaussian Naive Bayes Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
Naive Bayes0.6150.8740.5430.6700.309
Smote Naive Bayes0.5690.8920.4580.6050.293
Adasyn Naive Bayes0.5620.8880.4490.5960.282
RUS Naive Bayes0.5990.8750.5180.6510.296
OPT Naive Bayes0.7150.8470.7380.7890.366
OPT Smote Naive Bayes0.7540.8100.8600.8350.361
OPT Adasyn Naive Bayes0.7350.8270.8000.8130.360
OPT RUS Naive Bayes0.7250.8350.7700.8010.360
Table 9. Logistic Regression Results.
Table 9. Logistic Regression Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
Logistic Regression0.7700.7920.9230.8530.366
Smote Logistic Regression0.7290.8470.7600.8020.383
Adasyn Logistic Regression0.7290.8500.7570.8010.387
RUS Logistic Regression0.7530.8210.8410.8310.376
OPT Logistic Regression0.7700.7920.9220.8520.364
OPT Smote Logistic Regression0.7520.8220.8370.8300.377
OPT Adasyn Logistic Regression0.7380.8400.7850.8120.384
OPT RUS Logistic Regression0.7560.8200.8460.8330.379
Table 10. Linear Support Vector Machine Results.
Table 10. Linear Support Vector Machine Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
Linear SVM0.7750.7880.9390.8570.374
Smote Linear SVM0.7270.8470.7580.8000.380
Adasyn Linear SVM0.7290.8510.7560.8010.388
RUS Linear SVM0.7540.8180.8460.8320.373
OPT Linear SVM0.7760.7890.9400.8580.376
OPT Smote Linear SVM0.7290.8400.7700.8040.373
OPT Adasyn Linear SVM0.7500.8280.8240.8260.382
OPT RUS Linear SVM0.7260.8460.7580.8000.377
Table 11. Decision Tree Results.
Table 11. Decision Tree Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
DecisionTree0.7160.8040.8010.8020.297
Smote DecisionTree0.7020.8180.7550.7850.306
Adasyn DecisionTree0.6950.8110.7510.7800.286
RUS DecisionTree0.6880.8080.7430.7740.273
OPT DecisionTree0.7640.7910.9130.8480.351
OPT Smote DecisionTree0.7040.8680.6960.7720.384
OPT Adasyn DecisionTree0.7350.8410.7800.8090.381
OPT RUS DecisionTree0.7420.8440.7870.8140.394
Table 12. Random Forest Results.
Table 12. Random Forest Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
Random Forest0.7410.8080.8400.8240.337
Smote Random Forest0.7330.8180.8080.8130.343
Adasyn Random Forest0.7330.8210.8040.8130.348
RUS Random Forest0.7200.8160.7880.8020.323
OPT Random Forest0.7860.8020.9330.8630.416
OPT Smote Random Forest0.7700.8280.8590.8430.412
OPT Adasyn Random Forest0.7620.8340.8350.8350.407
OPT RUS Random Forest0.7520.8440.8050.8240.408
Table 13. Light Gradient Boosting Machine Results.
Table 13. Light Gradient Boosting Machine Results.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
LightGBM0.7860.8070.9220.8610.418
Smote LightGBM0.7600.8250.8450.8350.392
Adasyn LightGBM0.7650.8320.8430.8380.409
RUS LightGBM0.7650.8250.8550.8400.400
OPT LightGBM0.7860.8030.9310.8620.415
OPT Smote LightGBM0.7560.8360.8230.8290.403
OPT Adasyn LightGBM0.7580.8360.8260.8310.405
OPT RUS LightGBM0.7500.8380.8090.8230.397
Table 14. Faculty Dummy Feature Values.
Table 14. Faculty Dummy Feature Values.
Dummy FeatureReal Value
0Ecclesiastical Faculty of Theology
1Faculty of Sciences
2Faculty of Philosophy and Education
3Faculty of Economic and Administrative Sciences
4Faculty of Engineering
5Faculty of Marine and Geographical Sciences
6Faculty of Agronomic and Food Sciences
7Faculty of Law
8Faculty of Architecture and Urbanism
Table 15. Example decision path from the optimized Decision Tree (depth = 4).
Table 15. Example decision path from the optimized Decision Tree (depth = 4).
Observed ValueSplit ConditionFeature MeaningBranch Taken
1 estado _ civil 1.5 Borrower is single or without dependentsTrue (left branch)
2019 anio _ exigibilidad 2018.5 Loan enforceability year (2019)False (right branch)
0.5 conteo _ matr 0.696 Total number of enrollments (standardized)True (left branch)
0.4 deud _ monto 0.478 Total debt amount (standardized)True (left branch)
2019 anio _ exigibilidad 2020.5 Loan enforceability year (2021)End branch
Predicted class: Never Declared (estimated probability ≈ 0.85, 282 cases of no declaration over 333 total in this node)
Table 16. Applied Practical Threshold Metrics.
Table 16. Applied Practical Threshold Metrics.
PipelineAccuracyPrecisionRecallF1-ScoreMCC
Practical Threshold0.3850.8760.1720.2880.141
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Paz, Á.; Crawford, B.; Monfroy, E.; Rodriguez-Tello, E.; Barrera-García, J.; Cisternas-Caneo, F.; Cortés, B.L.; Lazo, Y.; Yáñez, A.; Peña Fritz, Á.; et al. Interpretable Binary Classification Under Constraints for Financial Compliance Modeling. Mathematics 2026, 14, 429. https://doi.org/10.3390/math14030429

AMA Style

Paz Á, Crawford B, Monfroy E, Rodriguez-Tello E, Barrera-García J, Cisternas-Caneo F, Cortés BL, Lazo Y, Yáñez A, Peña Fritz Á, et al. Interpretable Binary Classification Under Constraints for Financial Compliance Modeling. Mathematics. 2026; 14(3):429. https://doi.org/10.3390/math14030429

Chicago/Turabian Style

Paz, Álex, Broderick Crawford, Eric Monfroy, Eduardo Rodriguez-Tello, José Barrera-García, Felipe Cisternas-Caneo, Benjamín López Cortés, Yoslandy Lazo, Andrés Yáñez, Álvaro Peña Fritz, and et al. 2026. "Interpretable Binary Classification Under Constraints for Financial Compliance Modeling" Mathematics 14, no. 3: 429. https://doi.org/10.3390/math14030429

APA Style

Paz, Á., Crawford, B., Monfroy, E., Rodriguez-Tello, E., Barrera-García, J., Cisternas-Caneo, F., Cortés, B. L., Lazo, Y., Yáñez, A., Peña Fritz, Á., & Soto, R. (2026). Interpretable Binary Classification Under Constraints for Financial Compliance Modeling. Mathematics, 14(3), 429. https://doi.org/10.3390/math14030429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop