This section presents the results obtained from the experiments described in the previous sections.
Section 5.1 summarizes the evaluation metrics achieved by each algorithm introduced in
Section 4.4, including both the baseline configurations and those incorporating data-balancing techniques discussed in
Section 4.5. In the best-performing configurations reported in this section, the synthetic oversampling methods SMOTE and ADASYN used a sample ratio of
, while Random Undersampling reduced the majority class to 5500 instances. Each configuration was also tested with and without hyperparameter optimization, as described in
Section 4.6, resulting in eight experimental combinations per algorithm.
5.1. Model Performance
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12 and
Table 13 summarize the test-set performance of all algorithms under the eight experimental configurations described earlier. As the primary selection criterion, MCC differentiates the most competitive configurations under class imbalance, while Accuracy, Precision, Recall, and the F
1-score provide complementary views of classification behavior. Unoptimized Decision Trees and Naive Bayes pipelines show weaker performance, whereas their optimized variants improve substantially.
Precision, recall, and the F1-score followed a consistent trend across models, generally remaining above , while MCC exhibited greater sensitivity to model and sampling choices. Slight deviations were observed in a few configurations, such as the KNN with ADASYN, where precision reached , and the unoptimized Naive Bayes, where recall dropped to and F1-scores remained below . Once optimized, however, all Naive Bayes variants improved notably, reaching F1-scores of approximately . This pattern indicates that even simple models benefit from parameter adjustment when trained on administrative data with moderate class imbalance.
The Matthews Correlation Coefficient (MCC) displayed higher variability across experiments, as expected under class imbalance, ranging from approximately to . Models such as Naive Bayes and Decision Tree showed the greatest sensitivity to hyperparameter tuning, while the Random Forest and LightGBM achieved consistent improvements after optimization. In particular, LightGBM achieved the highest MCC values across all experiments (up to ), suggesting that boosting methods capture nonlinear interactions among financial and academic features more effectively than other algorithms.
Overall, the results indicate that both linear and ensemble classifiers achieve reliable generalization on the held-out test set using exclusively pre-declaration features. Linear models (Logistic Regression and Linear SVM) yield stable performance with transparent decision functions, while ensemble methods (Random Forest and LightGBM) provide a modest gain in predictive power, as reflected by their higher MCC values. From an error-analysis standpoint, the joint inspection of Precision, Recall, and MCC shows that competitive configurations maintain a favorable trade-off between false positives and false negatives under class imbalance, without collapsing into degenerate majority-class predictions.
From an operational perspective, differences in Matthews Correlation Coefficient (MCC) translate into meaningful trade-offs between Type I and Type II errors, which are directly relevant for institutional decision-making. In the present context, Type I errors (false positives) correspond to borrowers incorrectly classified as compliant, potentially delaying preventive outreach, whereas Type II errors (false negatives) correspond to borrowers incorrectly classified as non-compliant, potentially triggering unnecessary monitoring actions. Because MCC jointly accounts for all cells of the confusion matrix, improvements in MCC reflect a more balanced reduction of both error types, rather than gains driven by majority-class dominance or asymmetric error minimization.
Consequently, configurations achieving higher MCC values—such as optimized Random Forest and LightGBM models—offer more robust discrimination capacity under uncertainty, supporting earlier and more proportionate administrative responses. Importantly, these gains should not be interpreted as deterministic decision thresholds, but as improvements in risk ranking quality that enhance the efficiency of targeted communication and follow-up strategies while preserving institutional discretion.
These findings address RQ2 by showing that supervised models can predict declaration outcomes with consistent performance using only pre-event information. They also support RQ3 by demonstrating that interpretability can be preserved under constrained administrative feature spaces: linear decision functions and tree-based structures provide explicit, verifiable decision criteria, while the cross-model stability of the top-ranked predictors motivates the interpretability analyses developed in the subsequent subsections.
Regarding computational cost, the full experimental training pipeline was executed on a standard commercial off-the-shelf workstation (as described in
Section 4.1) and required approximately three days to complete, including hyperparameter optimization and cross-validation across all evaluated configurations. Once trained, inference is computationally lightweight: the average prediction time is approximately 0.002 seconds per instance on the held-out test set.
Given that the institutional dataset comprises on the order of records per year, batch inference over new cohorts can be performed in negligible time on conventional hardware, without imposing any operational burden. From an institutional deployment perspective, this clear separation between moderate offline training cost and negligible online inference cost makes the proposed framework fully feasible for routine use in administrative settings.
5.2. Confusion Matrices
Figure 10 and
Figure 11 display the confusion matrices corresponding to the best-performing experiment for each model. These visualizations provide a more granular view of how each classifier distinguishes between borrowers who submitted their first income declaration and those who did not.
Overall, all models exhibit a strong ability to differentiate between the two classes, though the nature of the misclassifications varies. Some models show a tendency toward Type I errors (false positives—predicting a borrower will declare when they will not), while others lean toward Type II errors (false negatives—predicting a borrower will not declare when they actually do).
The confusion matrices for Naive Bayes, Logistic Regression, Linear SVM, and Decision Tree reveal a predominance of Type II errors, consistent with their lower recall values reported in
Section 5.1. These models tend to miss a portion of actual declarants, prioritizing conservative classifications that favor the majority class.
In contrast, KNN, Random Forest, and LightGBM display a stronger inclination toward Type I errors, predicting more declarants than those who actually filed. Although this behavior slightly reduces precision, it prevents severe drops in recall and yields higher overall F1-scores. In practical terms, this trade-off is favorable for early-warning systems, as it minimizes the risk of failing to identify potential defaulting borrowers.
From an error-analysis perspective, the observed asymmetry between Type I and Type II errors has direct implications for model selection under uncertainty. Configurations exhibiting a mild bias toward Type I errors prioritize higher recall at the cost of a moderate increase in false positives, whereas models dominated by Type II errors achieve higher precision but risk systematically missing true positive cases. This trade-off is consistent with the metric profiles reported in
Section 5.1, particularly the joint behavior of Recall, F
1-score, and MCC.
Under a constrained feature setting and class imbalance, ensemble models such as Random Forest and LightGBM exhibit a more balanced error structure, avoiding extreme concentration on either error type. Their confusion matrices show that gains in recall are not achieved at the expense of severe precision degradation, which explains their consistently higher MCC values. From a computational standpoint, this balance indicates a more robust discrimination capacity across both classes, rather than reliance on majority-class dominance.
5.3. Model Interpretability
This section analyzes which variables most strongly drive the predictive behavior of the models and how these relationships can be interpreted to provide transparent and verifiable explanations of model predictions. Beyond supporting transparency, this interpretability layer also plays a key role in identifying and monitoring potential socioeconomic biases present in the underlying administrative data. By making feature contributions, split thresholds, and decision rules explicit, the proposed approach allows institutional analysts to detect patterns that may disproportionately affect specific groups, enabling informed oversight and periodic review. Importantly, interpretability is not presented as a bias-mitigation mechanism per se, but as a diagnostic tool to support responsible use, human judgment, and the design of complementary governance or corrective strategies when needed.
Section 5.3.1 reports the average permutation feature importance (PFI) across all trained models, providing a global view of variable relevance.
Section 5.3.2 presents the top fifteen model-wise importances for the best-performing experiment of each interpretable model, highlighting differences between linear and tree-based algorithms.
To complement these aggregate analyses,
Section 5.3.3 illustrates decision paths extracted from the optimized Decision Tree (OPT RUS DecisionTree) at multiple depths, showing how model structure can be translated into human-readable rules. Finally,
Section 5.3.4 introduces SHAP value visualizations, which quantify the individual contribution of each feature to specific predictions, enhancing transparency and case-level explainability.
5.3.1. Permutation Feature Importance Results
Figure 12 shows the averaged PFI computed for every model. To reduce the effect of random shuffling, the procedure was repeated thirty-one times per model and the results were averaged.
Two features stand out clearly: deud_monto (total loan amount) and conteo_matr (total number of enrollments). They are followed by estado_civil (marital status) and anio_exigibilidad (year of enforceability). The consistent prominence of these four variables across models indicates that financial exposure, academic trajectory, and basic demographics jointly explain most of the predictive signal.
The remaining variables contribute progressively less. Most faculty dummies have limited impact, with the notable exception of the indicator corresponding to the
FACULTY OF LAW (see
Table 14), which ranks among the top features and suggests program-specific differences in declaration behavior. This result indicates that representing academic affiliation at the faculty level provides sufficient and stable information to capture program-level trends, allowing newly introduced academic programs to be accommodated through their faculty assignment without altering the model structure.
At the lower end of the chart, some features exhibit slightly negative average PFI values. Given their very small magnitude and the known sensitivity of permutation to sampling noise and collinearity, these values do not by themselves justify feature removal.
While a formal feature ablation study was not conducted, the permutation feature importance analysis provides an indirect indication of model sensitivity to reduced feature availability. Across all evaluated models, predictive performance is largely driven by a small subset of highly influential features, whereas the permutation of remaining variables results in negligible changes in performance. This suggests that the learned decision structure is not critically dependent on a large number of marginal features. However, it should be noted that permutation importance reflects sensitivity to information degradation rather than actual feature removal; a systematic retraining-based ablation analysis is therefore left as future work.
5.3.2. Model-Wise Feature Importance
Figure 13 shows the feature importances for the best experiments of the linear models, while
Figure 14 reports the importances for the best tree-based models.
For the linear models, the coefficient-based importances in
Figure 13 show that
estado_civil (marital status) dominates the decision boundaries in both Logistic Regression and Linear SVMs, reflecting its strong marginal effect under the standardized feature space. In the Logistic Regression model, academic and institutional variables such as
facultad_7 (Faculty of Law), the
STEM indicator, and several
anio_ult_matr dummies are also influential, suggesting that the academic program and enrollment history contribute to the likelihood of timely declaration. In contrast, the Linear SVM assigns higher relative weights to recent enrollment years (
anio_ult_matr_2011,
2015, and
2020) and to the total debt amount (
deud_monto), capturing the impact of both temporal and financial dimensions. These differences are expected, since permutation importance evaluates overall predictive dependence, whereas linear coefficients reflect local marginal effects conditioned on feature scaling.
For the tree-based models (
Figure 14), the feature importance rankings are broadly consistent across the Decision Tree, Random Forest, and LightGBM. In all three cases, the same dominant predictors identified by permutation importance define the core predictive structure.
The Decision Tree model assigns the greatest weight to conteo_matr, followed closely by estado_civil and deud_monto, indicating that a single borrower’s academic trajectory and financial exposure are key splitting criteria. Random Forest and LightGBM reinforce this pattern but invert the top two variables—deud_monto slightly surpasses conteo_matr—highlighting that ensemble averaging emphasizes financial magnitude over enrollment frequency. The consistent presence of anio_exigibilidad among the top features across all three models underscores the importance of the repayment timeline in distinguishing between declaring and non-declaring borrowers.
Lower-ranked variables, such as facultad indicators and STEM affiliation, contribute marginally to model performance, offering limited incremental information once the main financial and academic variables are included. This stability of rankings across independent tree-based architectures suggests that the predictive signal is dominated by a small, interpretable subset of features directly linked to borrower behavior and loan structure.
From a predictive and interpretability standpoint, these results align with the performance analysis. Variables related to debt magnitude and academic trajectory consistently carry the strongest explanatory weight across models, indicating that a small subset of administrative features concentrates most of the discriminative signal. This concentration supports stable interpretation under uncertainty, as the same variables govern both predictive accuracy and explanatory structure.
5.3.3. Decision Tree Snapshots at Different Depths
To illustrate how model structure supports decision-making,
Figure 15,
Figure 16 and
Figure 17 display the same Decision Tree trained under one of the best-performing configurations, namely the Optimized Random under-sampling Decision Tree (OPT RUS DecisionTree) described in
Section 5.1, and visualized at three different depths (
,
, and
). These visualizations are intended as illustrative artifacts rather than as objects of exhaustive node-by-node inspection. The shallow representations (
and
) highlight a small set of high-yield splits that can be readily examined, whereas the deeper tree (
) introduces finer partitions that capture niche interactions at the cost of interpretability, exemplifying how structural complexity rapidly limits direct human inspection in administrative prediction settings.
From an institutional perspective, the decision tree structure enables the extraction of explicit and auditable decision rules that can be interpreted as early-warning signals rather than deterministic prescriptions. Split thresholds and branch conditions identify combinations of administrative and academic attributes that are systematically associated with elevated risk of non-submission. When used with appropriate caution, these rules can inform high-level monitoring criteria or screening heuristics to prioritize outreach, communication, or follow-up actions while avoiding automated enforcement or exclusion. Importantly, these rule-based patterns are intended to support human oversight and contextual judgment, not to replace institutional decision-making processes.
At , the tree typically places estado_civil, anio_exigibilidad and facultad_4 among the first splits, followed by conteo_matr and deud_monto. These nodes yield compact rules with broad coverage. For example, a Single debtor, a low loan amount, combined with low enrollments, may increase the probability of not submitting the first income declaration. Such rules are easy to operationalize as “portfolio filters” for early outreach.
At , the model refines these segments, introducing thresholds that separate borderline cases (for instance, specific ranges in deud_monto, faculties (facultad_X), or if the undergraduate program is a STEM program (stem)). This level balances fidelity and interpretability.
At increasing depths, the Decision Tree exposes progressively finer-grained interactions among features. While deeper representations () may improve local fit by capturing higher-order combinations, they also reduce transparency and increase sensitivity to sampling variability. In contrast, shallower trees ( and ) emphasize a small set of high-yield splits that yield compact and stable decision rules. From an interpretability standpoint, these shallow structures provide a favorable balance between expressive power and human verifiability, making them suitable for analytical inspection and rule-based reasoning under uncertainty.
These structural observations are consistent with the global and model-wise importance analyses: the first-level splits systematically involve the same dominant variables (estado_civil, deud_monto, conteo_matr, and anio_exigibilidad) identified by permutation importance and ensemble-based rankings. This alignment indicates that the learned decision paths are not artifacts of model depth, but rather reflect stable predictive signals present in the restricted administrative feature space. Consequently, the extracted rules provide explicit, auditable explanations of individual predictions, reinforcing the interpretability claims examined in relation to RQ3.
To illustrate the internal reasoning of the chosen model,
Table 15 summarizes one representative decision path extracted from the tree with depth
. This path shows how a borrower’s characteristics sequentially lead the model to predict a higher probability of not submitting the first income declaration.
This path illustrates a borrower whose marital status corresponds to a single individual (estado_civil = 1), with a loan enforceable in 2019 and below-average academic enrollments (conteo_matr standardized value = −0.5). The model first follows the left branch for single borrowers, then the right branch for recent enforceability years, and subsequently the left branches for both low enrollment count and below-average debt (deud_monto = −0.4).
The resulting classification, Never Declared, arises from the combination of limited academic continuity (fewer than average enrollments) and less-than-average financial exposure (a total amount of debt below the average). The probability related to the outcome of the model, in this case corresponds to the class proportion at the terminal node reached by this path. This value corresponds to the empirical class frequency observed at the leaf node and does not represent a calibrated posterior probability.
This example shows how the decision tree structure enables a transparent, rule-based explanation of predictions: each split represents a human-interpretable condition that links administrative attributes to behavioral outcomes. Such explicit reasoning enables predictions to be traced, verified, and analytically justified through a sequence of human-interpretable conditions defined on observed features.
5.3.4. Shap Values for Light Gradient Boosting Machine
Figure 18 shows the SHAP value distribution for all features in the Base Light Gradient Boosting Machine (LGBM) model. The TreeExplainer method from the SHAP library was applied, as it provides accurate local attributions for ensemble-based algorithms. Each point represents a single observation: its position along the
x-axis indicates the magnitude and direction of its contribution to the model output, while the color encodes the original feature value (blue for low and red for high). Points distributed farther from zero correspond to stronger impacts on the final prediction.
The SHAP summary plot reveals patterns consistent with the permutation and tree-based feature importance analyses (
Figure 12 and
Figure 14). The dominant variables—
deud_monto (loan amount),
conteo_matr (number of enrollments),
estado_civil (marital status), and
anio_exigibilidad (loan enforceability year)—exhibit the largest SHAP magnitudes. These features drive the model’s predictions in interpretable directions: high
deud_monto, high
conteo_matr, and higher
estado_civil codes (married borrowers) tend to push predictions toward the
Declares class, while older
anio_exigibilidad values (earlier repayment years) shift the prediction toward
Never Declared.
Features related to academic programs (faculty dummies and the STEM indicator) show minimal dispersion around zero, confirming their marginal influence on model decisions. Notably, the dummy variable corresponding to the Faculty of Law (facultad_7) displays a slightly asymmetric distribution, suggesting a weak but consistent positive contribution to declaration probability.
These patterns indicate that marital status exerts a moderate but consistent influence on declaration behavior, with married or partnered borrowers showing slightly higher compliance. Financial exposure also plays a central role: larger loan amounts are associated with higher declaration probability, whereas smaller debts correspond to increased non-declaration risk. Academic continuity further contributes to the model output, as a lower number of enrollments (conteo_matr) is systematically linked to reduced declaration likelihood. Temporal effects are present but weaker, with earlier enforceability years (anio_exigibilidad) marginally increasing the probability of declaration. Finally, program-related variables such as faculty affiliation exhibit only secondary effects, with the Faculty of Law showing a small but consistent positive contribution relative to other faculties.
Overall, the SHAP analysis complements the global and model-wise interpretability results by providing instance-level attributions that are consistent with the previously identified feature rankings. The agreement between permutation importance, tree-based importances, and SHAP value distributions indicates that the contribution of the dominant predictors is stable across explanation paradigms and model families.
From a formal interpretability perspective, SHAP values offer a locally additive decomposition of the model output, enabling each prediction to be expressed as a sum of feature-level contributions relative to a baseline expectation. This property ensures traceability and internal coherence of explanations, even in ensemble-based models with complex nonlinear decision functions. Under the restriction to pre-declaration administrative features, such locally consistent explanations allow predictions to be examined, compared, and validated without reliance on latent or post-event information.
Taken together, the stability of feature rankings, the availability of explicit decision rules in tree-based models, and the locally faithful explanations provided by SHAP jointly address RQ3. They demonstrate that reliable interpretability can be achieved in supervised classification tasks operating on constrained institutional datasets, supporting transparent reasoning about predictions under uncertainty rather than opaque score-based classification.
5.3.5. Consistency and Complementarity Across Interpretation Layers
The interpretability framework adopted in this study integrates global (permutation feature importance), structural (decision paths), and local (SHAP) explanation methods, each addressing a distinct aspect of model behavior. These approaches are not expected to yield identical explanations, as they operate at different analytical levels and respond to different interpretative questions.
Global explanations identify variables that exert consistent influence across the borrower population, structural explanations reveal how such variables are combined within the internal decision logic of the models, and local explanations provide instance-level attributions for individual predictions. Apparent discrepancies between explanation layers are therefore not treated as methodological inconsistencies, but rather as complementary perspectives that jointly characterize predictive behavior.
From an institutional perspective, this layered interpretability strategy supports decision-making at multiple levels. Global explanations inform strategic prioritization and policy-level resource allocation, structural explanations enhance transparency and auditability of decision rules, and local explanations enable case-by-case review when targeted monitoring or preventive actions are considered. Rather than resolving disagreements by privileging a single interpretability method, the proposed framework emphasizes triangulation across explanation layers to ensure robust, interpretable, and context-aware decision support.