1. Introduction
Credit risk management is of fundamental importance in ensuring the stability of financial institutions, as it determines their capacity to absorb losses arising from credit defaults (
Jorion, 2000). In this context, international standards, particularly those developed by the
Comité de Supervisión Bancaria de Basilea (
2011), have played a main role in the development of regulatory frameworks that guide the identification, measurement, and mitigation of financial risks. The Basel II Accord represented a significant landmark in the field of credit risk modeling, introducing three distinct approaches for this purpose: the standardized approach, the foundation internal ratings-based (IRB) approach, and the advanced IRB approach. Each method provides varying degrees of flexibility in estimating the components of expected loss probability of default (PD), loss given default (LGD), and exposure at default (EAD). While the standardized approach relies mainly on external ratings, the IRB methodologies enable institutions to advantage behavioral and transactional data to enhance expected loss modeling. The fundamental distinction between the foundation and advanced IRB methods lies in the degree of autonomy afforded to banks in estimating risk components. Although advanced methodologies require significant investment in infrastructure and analytical capacity, they allow institutions to achieve more accurate models, reduce excess capital holdings, and strengthen their ability to manage risks and foster sustainable growth.
In the Colombian context, institutions supervised by the Superintendence of the Solidarity Economy (SES) are required to conduct continuous evaluations of the credit risk associated with their loan portfolios. This evaluation must be performed both at origination and at regular intervals during the loan lifecycle, across all lending modalities (consumer, housing, commercial, and microcredit). To support these processes, the SES has developed a reference model that serves as a standardized guide for rating credit portfolios and estimating expected losses (
Superintendencia de la Economia Solidaria, 2024). While this regulatory model ensures consistency across entities, it is constrained by its reliance on standardized frameworks that do not consider the operational, social, and financial heterogeneity of solidarity-based institutions.
The Colombian solidarity sector encompasses a diverse array of financial entities, including savings and credit cooperatives, employee funds, and mutual associations. These organizations are characterized by their not-for-profit orientation, democratic governance, and strong community ties, resulting in operational dynamics that differ markedly from those of commercial banks. Heterogeneity in institutional size, membership composition, credit products, and data availability poses unique challenges for credit risk assessment. Consequently, standardized rating models, often designed for traditional banking environments, fail to capture the social and financial specificities of these institutions. Tailored credit scoring models, particularly those employing explainable machine learning techniques, are therefore essential to ensure accurate risk assessments, fair access to credit, and alignment with both regulatory requirements and cooperative principles. Furthermore, the current regulatory model relies predominantly on binary variables (e.g., default/non-default, presence/absence of collateral). While operationally straightforward, this approach limits the explanatory power of risk assessments by disregarding informative continuous variables (e.g., income, tenure, payment history, or debt level) and by constraining the identification of complex patterns associated with default probability. These technical limitations hinder decision-making and may lead to inefficiencies in credit risk management.
Recent credit risk research has delivered a rich toolkit for risk classification and prediction, ranging from logistic regression models to more sophisticated methods such as random forests, gradient boosting, and LightGBM, all of which exhibit high predictive accuracy (
Gatla, 2023;
Aguilar-Valenzuela, 2024;
Machado & Karray, 2022;
Sharma et al., 2022). Additionally, various studies incorporate model-agnostic interpretability frameworks such as SHapley Additive exPlanations (SHAP) (
Bussmann et al., 2021;
Li & Wu, 2024). However, most existing works concentrate on default prediction at origination, leaving the post-disbursement phase—where lifetime PD, LGD, and EAD must be estimated to support forward-looking provisioning—relatively underexplored (
Jacobs, 2020;
Botha et al., 2025). This gap is even wider in the Colombian solidarity sector, where limited data volumes and regulatory constraints provide little empirical evidence for the application of machine learning frameworks to provisioning models. Few studies, such as that of
Bermudez Vera et al. (
2025), have attempted to predict default without fully integrating behavioral data across the loan lifecycle, leaving significant opportunities for methodological advances.
Within this broader landscape,
Gambacorta et al. (
2024) provide evidence from a Chinese fintech firm showing that machine learning models and non-traditional data improve credit risk prediction, particularly during periods of economic stress. Their findings underscore the potential of combining advanced analytics with contextualized datasets to enhance resilience in credit scoring. Complementing this perspective,
Alsuhabi (
2024) introduced a novel Topp–Leone exponentiated exponential distribution for financial data, offering new insights into risk modeling through innovative statistical frameworks. Together, these contributions highlight a global research trend toward integrating non-traditional data, advanced models, and distributional innovations into credit risk management. However, applications in cooperative and solidarity-based financial systems, particularly in Latin America, remain scarce.
Given this scenario, there is a clear need to develop more adaptive and technically robust credit scoring models tailored to the operational realities of solidarity institutions. The objective of this research is therefore to apply explainable machine learning techniques to credit rating in the Colombian solidarity sector, and to propose a methodology aligned with the principles of the Basel II IRB approach. To this end, we analyze a dataset of 17,518 members from a cooperative, applying both linear and tree-based regression models, including LightGBM. Model performance is evaluated using root-mean square error (RMSE), while interpretability is ensured through the SHAP framework. The findings demonstrate that models incorporating continuous variables drawn from real institutional data can generate credit ratings comparable to those produced by the SES regulatory model, while improving transparency and predictive accuracy. The main novelty of this research lies in the adaptation and validation of explainable machine learning models (specifically LightGBM combined with SHAP) for credit risk rating in the Colombian solidarity sector.
LightGBM outperforms linear methods such as ridge regression due to its ability to capture nonlinear relationships, and the SHAP analysis provides actionable insights by tracing the influence of predictors on individual scores. This interpretability, combined with high predictive performance, positions LightGBM as not only a technically sound alternative but also a management-oriented tool that fulfills traceability and accountability requirements in modern risk management systems. This work thus offers a technically robust and management-oriented tool that strengthens credit risk management in solidarity-based financial institutions, fostering regulatory alignment while respecting the sector’s social and operational specificities.
The remainder of this article is organized as follows:
Section 2 presents the dataset, outlines the regulatory model used in the Colombian solidarity sector, and provides a brief introduction to the machine learning models employed.
Section 3 evaluates the predictive performance of the models—regularized linear regression (ridge), decision trees, random forests, and LightGBM—and analyzes the contribution of each variable to the prediction of default.
Section 4 provides a brief discussion of recent regulatory reforms in the Colombian solidarity sector, illustrating how the proposed models align with the logic of the 2025 regulatory changes.
Section 5 presents the conclusions.
3. Results
For the practical implementation of the model, the risk analysis team should begin by requesting from the institution the data corresponding to the 12 variables that make up the model (as described in
Appendix A). These variables are related to payment and delinquency behavior and must be collected from a monthly historical record covering the three years prior to the evaluation date. The features related to the historical records of the borrowers are collapsed into a single scalar value: for instance, MORA12, representing the maximum delinquency recorded over the past 12 months. Next, the proposed models should be applied, and the SHAP analysis should be incorporated as described in the following subsections.
3.1. Exploratory Analysis
The purpose of the exploratory data analysis (EDA) was to gain an initial understanding of the dataset’s structural characteristics and to investigate the relationships between the independent variables and the target variable Z (credit rating). To this end, univariate descriptive statistics were computed, and the distributions of variables were visualized through histograms and scatter plots. The associations between features and the target were assessed using the Spearman rank correlation coefficient, which is well suited for detecting monotonic (including nonlinear) relationships.
Based on the results of this analysis, variables that contributed little to predictive performance were excluded. Specifically, the variables TipoCuota and Activo were removed from the debit consumer credit dataset, while Reestr was eliminated from the non-debit consumer credit dataset. This refinement of the feature space aimed to improve the data quality for subsequent modeling steps and to mitigate issues such as multicollinearity and model overfitting.
Figure 2 displays the heatmap corresponding to the Spearman correlation matrix. This visualization captures the monotonic relationships between all predictor variables and the target variable, Z (credit rating). As shown, the target variable does not exhibit strong correlations with most of the predictors; many of the correlation values are represented in light blue or white tones, indicating weak or negligible associations. Features EA and TC show near-zero correlation coefficients, suggesting that they have minimal or no direct influence on the target variable. In contrast, a cluster of futures (MORA1230, MORA1260, MORA2430, MO-RA2460, MORTRIM, and MORA15) exhibits strong positive intercorrelations, as evidenced by the presence of deep red tones. This pattern suggests a high degree of co-movement among different default indicators: increases in default within a one-time window tend to be associated with increased default across other temporal ranges.
3.2. Data Preparation for Model Implementation
The selection of predictor variables was based on the regulatory model, which employs 12 variables for debit consumer credit and 14 for non-debit consumer credit. Unlike the regulatory approach, which predominantly uses binary variables, the proposed model utilizes their continuous values. This methodological shift enables a more nuanced characterization of individual financial behavior, thereby avoiding the loss of information inherent in variable binarization.
Two variables—TC (credit type) and EA (active status)—were excluded due to a lack of variability. The EA variable is highly imbalanced, with 6,658 observations coded as 1 and only two as 0, while TC offers no meaningful differentiation across observations. Both were deemed irrelevant and removed from further analysis. Additionally, the delinquency-related variables (MORA1230, MORA1260, MORA2430, MORA2460, MORATRIM, and MORA15) exhibited strong positive intercorrelations, indicating significant redundancy among them. To address this, they were consolidated into a single variable, MORA12, representing the maximum delinquency recorded over the past 12 months. This transformation reduces multicollinearity while retaining the most informative aspect of recent payment behavior, thereby enhancing the model’s representativeness.
A detailed description of all variables and the modifications applied is provided in
Appendix A. However, the most relevant adjustments are summarized below.
For variables reflecting the member’s financial relationship and solvency, the proposed model replaces binary indicators with continuous metrics. For example, instead of using a binary variable to indicate the presence of savings, contributions, or term deposits (CDATs), the model incorporates the actual balance at the time of evaluation, offering a more precise representation of the member’s financial engagement with the institution. Regarding seniority, the model replaces categorical groupings with a continuous variable that measures the length of affiliation in months. This adjustment enhances the granularity of the analysis, allowing for a more accurate assessment of how tenure impacts credit behavior. For payment behavior, a continuous variable capturing the maximum delinquency observed during the evaluation period is used, rather than classifying arrears into predefined risk thresholds as mandated by Supersolidaria. This approach yields a more detailed view of credit history and enhances the model’s predictive accuracy. These refinements result in a reduction in the explanatory variables: from 12 to 9 in the debit consumer credit dataset, and from 14 to 10 in the non-debit consumer credit dataset.
3.3. Models for Debit Consumer Credit
For this credit line, several regression models, including both linear and decision tree-based models, were evaluated. The best results obtained are presented below.
3.3.1. Linear Regression Model
The linear model that produced the best results was a ridge regression model with a λ of 0.010 and power transformation of the numerical variables. With this model, a mean validation RMSE of 0.338, a test RMSE of 0.344, and an R
2 of 0.640 were obtained. The equation resulting from this model was
All variables were transformed using the Yeo–Johnson method (
I. Yeo & Johnson, 2000), except for the SINMORA variable, which is binary.
As previously observed in Equations (1) and (2), the regulatory model incorporates a large number of explanatory features. In contrast, the model based on ridge regression, developed using real historical data, presents a simpler structure, retaining only those variables that proved statistically significant under the L2 penalty.
The residuals of this model had a mean of 5.59 × 10−4 and a standard deviation of 0.344.
3.3.2. Tree-Based Models
With a single decision tree, the best model obtained had a complexity coefficient (ccp_alpha) of 1.317 × 10−2. With this model, the mean validation RMSE was 0.248, the test RMSE was 0.248, and the R2 in the test set was 0.813. It is worth clarifying that neither this model nor any tree-based model was preprocessed for variables.
The most important feature of this model was MORA12, with a score of 0.674, followed by ANTI (score = 0.136) and SINMORA (score = 0.101). Finally, the COOCDAT score was 0.089, and the AP score was 0. The residuals of this model had a mean of 1.06 × 10−2 and a standard deviation of 0.248.
Random forest models were also evaluated, with the best performer having 226 estimators and a ccp_alpha of 1.02 × 10−2. With this model, the mean validation RMSE was 0.237, the test RMSE was 0.240, and the R2 in the test set was 0.825. The most important feature of this model was MORA12, with a score of 0.673, followed by ANTI (score = 0.140) and SINMORA (score = 0.097). Finally, the COOCDAT score was 0.090, and the AP score was 0. The residuals of this model had a mean of 7.99 × 10−3 and a standard deviation of 0.24.
XGBoost models were also evaluated. Of these, the one with which the best results were obtained had the following hyperparameters: colsample_bytree: 0.755, gamma: 0.170, learning_rate: 0.156, max_delta_step: 2, max_depth: 4, min_child_weight: 8, n_estimators: 422, reg_alpha: 1.009 × 10−2, reg_lambda: 4.787, and subsample: 0.624. With this model, the mean validation RMSE was 0.219, the test RMSE was 0.228, and the R2 in the test set was 0.841. In this model, the residuals’ mean was 3.23 × 10−3, and the standard deviation was 0.227. The most important feature of this model was SINMORA, with a score of 0.593, followed by MORA12 (score = 0.170) and COOCDAT (score = 0.166). Lastly, the ANTI feature had a score of 0.054, and AP was 0.018.
The last family of models evaluated was the LightGBM models. Of these, the one that gave the best results had the following hyperparameters: colsample_bytree: 0.648, learning_rate: 0.071, max_depth: 13, min_child_samples: 30, n_estimators: 134, num_leaves: 74, reg_alpha: 1.592, reg_lambda: 0.277, and subsample: 0.518. With this model, the mean validation RMSE was 0.215, the test RMSE was 0.223, and the R2 in the test set 0.849. In this model, the residuals had a mean of 4.27 × 10−3 and a standard deviation of 0.222. The most important variable in this model was MORA12 (normalized score = 0.541), followed by SINMORA (normalized score = 0.191) and ANTI (normalized score = 0.133). Finally, the COOCDAT score was 0.093, and the AP score was 0.043.
Table 1 summarizes the results of the metrics obtained with each model evaluated. It can be observed that the model yielding the best results is LightGBM. However, the other ensemble models evaluated (random forests and XGBoost) also provided similar results in terms of the RMSE metric.
It can be observed that the ridge model tends to make more or less accurate predictions for values less than or equal to −3, but for values greater than −3 it tends to predict values close to −3. On the other hand, in assembly models, it can be observed in
Figure 3 that they tend to make more or less accurate predictions for values less than or equal to −1, but for values greater than this they tend to predict values close to −1.
Figure 4 shows the relative importance of the features in each assembly model evaluated. It can be seen that, for three of these models, the most important feature is MORA12, and only for the XGBoost model is it SINMORA. What all models agree on is that the AP feature is the least important, to the point that, for decision tree models and random forests, it has zero importance. Although they are not directly comparable with the sizes of the coefficients of each feature in the linear ridge model, it is interesting to note that, in this model, the highest coefficient, in terms of absolute value, corresponds to the SINMORA feature (0.800), followed by MORA12 (0.178) and COOCDAT (0.154), coinciding with the order of importance of the XGBoost model.
3.3.3. SHAP Global Analysis of the LightGBM Model
Given that the LightGBM model yielded the best results, and that it is essential for the entity to understand how the model’s characteristics influence its predictions, an interpretability analysis was conducted using SHAP. This analysis facilitates the justification of decisions to associates or users, thereby promoting trust and confidence among them.
First, a global interpretability analysis was conducted, the results of which are shown in
Figure 5, to understand how the variables impact the predictions. Low values in the ANTI and SINMORA features result in an increased predicted value, indicating a higher risk. Additionally, high values in the MORA12 feature also increase the prediction value. Finally, high values in COOCDAT cause the prediction value to decrease; that is, subjects with high values in this feature tend to have a better risk rating.
A similar analysis, conducted only for high-risk loans (B rating or higher), reveals that, in this case, the most significant characteristics are MORA12 and SINMORA, with the other characteristics having a minor impact.
Comparing the order of importance of the characteristics obtained from global SHAP analyses with those from LightGBM’s method reveals a difference. This difference is explained by the fact that, while LightGBM calculates the relative importance of the features from the gain provided by each of them to optimize the loss function, SHAP calculates the marginal contribution of each feature, but this can be affected if there are strongly correlated variables (
Holzinger et al., 2022), as is the case in this case with the MORA12 and SINMORA features. We chose to leave both variables, despite their high correlation, because excluding any of them ostensibly degraded the performance of all models.
3.3.4. SHAP Local Analysis of LightGBM Model
With SHAP, local analysis can also be performed, identifying how each characteristic influences a prediction. For example,
Figure 6 illustrates the case of a subject with low risk. For this subject, the prediction was −4.823, which is a lower value than the base prediction, which is −4.6 (the base value corresponds to the average of the values of the target variable). It can be observed that the lowest prediction is due to the values of the characteristics ANTI (2185, a high value), SINMORA (1, a high value), and MORA12 (0, a low value). On the other hand, the value of COOCDAT (0, a low value) causes the value of the prediction to rise; that is, it goes in the opposite direction to the final prediction.
It is important to note that, in a local analysis, the order of relative importance of the features does not necessarily coincide with the order of the global analysis.
3.4. Models for Non-Debit Consumer Credit
For non-debit consumer credit, several regression models, both linear and based on decision trees, were evaluated. The best results obtained are presented below.
3.4.1. Linear Regression Model
The linear model that yielded the best results was a ridge model with λ = 0.164, a power transformation of the numerical variables, and the elimination of outliers (observations above the 99th percentile) for the AP and SALPRES variables. With this model, a mean validation RMSE of 1.30, a test RMSE of 1.34, and an R
2 in the test set of 0.768 were obtained. The equation resulting from this model was
The residuals of this model had a mean of 3.01 × 10−2 and a standard deviation of 1.344.
3.4.2. Tree-Based Models
With a single decision tree, the best model obtained had a complexity coefficient (ccp_alpha) of 2.21 × 10−2. With this model, the mean validation RMSE was 0.486, the test RMSE was 0.476, and the adjusted R2 in the test set was 0.971. It is worth clarifying that neither this model nor the following ones underwent any preprocessing of variables.
The most important variable in this model is MORA12, with a score of 0.974, followed by ANTI (score = 0.015) and AP (score = 0.012). Finally, the score of SALPRES and Active was 0. The residuals of this model had a mean of −1.50 × 10−2 and a standard deviation of 0.476.
With random forest models, the one that gave the best results had 316 estimators and a ccp_alpha of 2.96 × 10−4. With this model, the mean validation RMSE was 0.296, the test RMSE was 0.280, and the R2 in the test set was 0.990. In this model, the residuals’ mean was −5.24 × 10−4, and the standard deviation was 0.280. The most important feature of this model was MORA12, with a score of 0.955, followed by ANTI (score = 0.021) and Active (score = 0.009). Finally, the AP feature had a score of 0.008, and SALPRES had a score of 0.007.
With XGBoost models, the one that gave the best results had the following hyperparameters: colsample_bytree: 0.982, gamma: 1.555, learning_rate: 0.152, max_delta_step: 4, max_depth: 9, min_child_weight: 8, n_estimators: 266, reg_alpha: 1.986 × 10−3, reg_lambda: 2.616, and subsample: 0.987. With this model, the mean validation RMSE was 0.332, the test RMSE was 0.294, and the R2 in the test set was 0.989. In this model, the residuals’ mean was 6.80 × 10−5, and the standard deviation was 0.295. The most important feature of this model was MORA12, with a score of 0.918, followed by Active (score = 0.034) and AP (score = 0.021). Finally, SALPRES had a score of 0.006.
The last model evaluated, and the one that gave the best results, was a LightGBM regression model. With this model, the mean validation RMSE was 0.288, the test RMSE was 0.276, and the adjusted R2 in the test set was 0.990. In this model, the residuals’ mean was 1.75 × 10−3, and the standard deviation was 0.272. The values of the hyperparameters tuned in this model were as follows: colsample_bytree: 0.963, learning_rate: 0.245, max_depth: 11, min_child_samples: 16, n_estimators: 168, num_leaves: 47, reg_alpha: 6.660, reg_lambda: 0.938, and subsample: 0.818. The most important feature of this model was MORA12, with a normalized score of 0.966, followed by ANTI (score = 0.017) and AP (score = 0.014). Lastly, the SALPRES feature had a score of 0.003, and Active had a score of 0.000.
Table 2 summarizes the results of the metrics obtained with each model evaluated. It can be observed that the model yielding the best results is LightGBM. However, the other ensemble models evaluated (random forests and XGBoost) also provided satisfactory results in terms of the RMSE metric.
Figure 7 shows the scatter plots of the actual values versus the values estimated by the different models. It can be observed that the ridge and decision tree models exhibit high variability in their predictions, as evidenced by their high RMSE values. However, paradoxically, this variability is not reflected in the R
2 values. On the other hand, assembly models tend to make more accurate predictions and exhibit a lower level of variability, particularly for high values of Z (greater than 0).
Despite the higher R2, ensemble models are not overfitted, as a visual inspection reveals a high correlation between the estimated and real values, albeit not a perfect one. In addition, the validation and test RMSEs are similar in all cases, which would not be the case if the model were overfitted.
Figure 8 shows the relative importance of the features in each assembly model evaluated. It can be seen that, in all of these models, the most important feature is MORA12. It should also be noted that, in general, all other features have little importance in the models.
Although they are not directly comparable with the sizes of the coefficients of each feature in the linear ridge model, it is interesting to note that, in this model, the highest coefficient, in terms of absolute value, corresponds to the MORA12 feature (2.104), followed by Activo (1.421) and AP (0.380), coinciding (again) with the order of importance of the XGBoost model.
3.4.3. Global SHAP Analysis of LightGBM Model
Regarding the debit consumer portfolio, an interpretability analysis was conducted using SHAP. The global analysis (see
Figure 9) shows that low values of MORA12 have a negative impact on the value of Z, while high values have a positive impact. Low values of ANTI and AP also have a positive impact, albeit not as significant as that of MORA12. However, low SALPRES values negatively impact the Z value.
On the other hand, for subjects with an elevated level of risk, it can be observed that the most important characteristic remains MORA12; however, ANTI and AP exchange their positions, although the direction of the impacts remains the same as in the general case.
In this case, it can be observed that there is a coincidence in the order of importance of the characteristics given by the LightGBM and SHAP methods in the set of all subjects. Regardless, in the case of high-risk subjects, SHAP finds that the second most important characteristic is AP and not ANTI.
3.4.4. SHAP Local Analysis of LightGBM Model
Figure 10 corresponds to a SHAP waterfall plot visualization, which enables the decomposition and analysis of the prediction generated by the model for a particular individual, allowing for the accurate interpretation of the contribution of each explanatory feature to the model’s results.
For example, we will present a local analysis conducted on a high-risk individual. The model has a base prediction expectation of −2.073. From this reference point, the marginal contributions of each customer feature are identified, which modify this prediction until a final value of 2.031 is reached. The feature with the most significant impact is MORA12 (whose value is 270 days (about 9 months) of default), which generates a positive contribution of +3.94 units to the prediction, evidencing a strong association between high levels of default in the last 12 months and the increase in the credit risk rating assigned by the model. This feature dominates the explanation of the result, suggesting that the history of recent defaults is the primary determinant of the risk profile in this case. Other features have marginal impacts: ANTI (permanence in the institution of 1360 days) increases the prediction by +0.22.
On the other hand, the AP feature (with a value of USD 216,274 in available contributions) has a mitigating effect, with a contribution of −0.12, a financial feature that reduces the perceived risk. Finally, SALPRES (ratio of balance to loan value 0.778) provides a small positive contribution (+0.05) without significantly affecting the model’s decision. The weighted sum of these effects enables the model to adjust its prediction from the average value to an individualized output, which, in this case, represents an adverse credit rating.
4. Discussion
Recent regulatory reforms in Colombia’s solidarity sector reflect a global trend of alignment with the Basel II and III agreements. There is a significant transition between the reference model used in 2024 and the new model adopted in 2025 for credit risk assessment. The 2024 model, described in
Section 3 using Equations (1) and (2), incorporates multiple binary variables that represent the structural characteristics of the associate. On the other hand, the 2025 model, as seen in Equations (5) and (6), focuses more clearly on variables related to credit behavior, especially those associated with non-performing loans.
Debit consumer credit 2025:
Non-debit consumer credit 2025:
The findings of this study highlight the practical and regulatory value of explainable machine learning models in improving credit risk assessment within Colombia’s solidarity financial sector. The ridge regression model, which prioritizes variables directly associated with borrowers’ repayment behavior, particularly their delinquency history, shows a clear alignment with the internal ratings-based (IRB) principles of Basel II. By leveraging continuous variables and avoiding the discretization common in traditional regulatory models (e.g., binary indicators for default in specific timeframes), Ridge retains the granularity of the original data, aligning with the logic of the 2025 regulatory update by the Superintendence of the Solidarity Economy (SES), which encourages the use of behavioral variables over rigid rule-based inputs. Ridge’s simplicity, transparency, and low dimensionality make it especially appropriate for cooperatives with limited technical capacity, offering a viable entry point for the gradual adoption of advanced internal risk models.
In contrast, the LightGBM model stands out for its good predictive performance. It achieved an RMSE of 0.224 and an adjusted R
2 of 0.847 in the debit portfolio, and an RMSE of 0.272 with an adjusted R
2 of 0.990 in the non-debit portfolio. These results substantially surpass those of linear models and simple decision trees, confirming LightGBM’s capacity to capture complex interactions and nonlinear patterns that traditional models often miss. This outcome is consistent with the existing literature on the superior predictive power of boosting methods such as LightGBM and XGBoost in credit scoring (
Gatla, 2023). Furthermore, LightGBM provides operational flexibility by allowing the model to be retrained periodically or updated with new indicators without awaiting formal regulatory revisions, an advantage in dynamic credit environments.
However, it is necessary to take the results obtained in the R
2 metric with some caution, since several previous studies have shown its inconvenience to evaluate nonlinear predictive models such as ours, since it tends to give very inflated results (
Sapra, 2014;
Book & Young, 2006). Despite this, we decided to report this result because it is a widely used metric in economics and finance. Consequently, all of our analyses were based on the results of the RMSE metric, which does not have the problems of R
2, and on the visual analysis of the scatter plots of the estimated values versus the actual values in the test set.
Another important consideration is that, since the Superintendency model applies the sigmoid function to the target variable (Z) and then discretizes it into five ranges to grant a risk level rating, it is likely that, for subjects whose estimated values of Z are close to the limits between ranges, the level of risk estimated with our models will be different from that obtained with the models of the Superintendency.
Comparing the models obtained for the two credit modalities reveals that we achieved better results in the non-debit consumer credit modality, not due to the higher R2 values that they present, but rather because they perform better with values in the upper Z range, which corresponds to a higher level of risk. We also found that, in this type of credit, all models, and even SHAP, agree that the most relevant characteristic is MORA12, and that, in contrast to this, the others are less important. In the modality of debit consumer credit, there are divergences in the order of relative importance of the characteristics. Thus, while for three models the most important feature is MORA12, for the other two models (and for SHAP) the most important feature is SINMORA. These differences may be due, among other reasons, to the high collinearity between these variables, which we attempted to solve by excluding some of them or applying techniques such as PCA to obtain new linearly independent characteristics, but we did not obtain good results.
A common concern with high-performing machine learning models is their lack of interpretability. This limitation was effectively addressed through the integration of SHAP (SHapley Additive exPlanations), which decomposes each prediction into the individual contributions of its explanatory variables. SHAP values make it possible to meet supervisory expectations of transparency and accountability by enabling auditors and regulators to trace the rationale behind each credit decision, including adverse ratings or rejections. Additionally, SHAP enables institutions to audit the influence of sensitive variables and identify and mitigate potential biases. SHAP-based explanations offer actionable value. Institutions can provide personalized feedback to applicants, target financial education efforts toward specific risk factors, and refine credit policies based on behavioral insights. SHAP values also help adjudicate borderline credit decisions, providing a defendable basis for approvals or denials during reviews or appeals. This level of explainability aligns with principles of social accountability, traceability, and cooperative member engagement, further reinforcing the ethical dimension of the proposed framework. Although SHAP has been primarily applied in this study for interpretability and internal decision support, its potential in regulatory and governance contexts is increasingly recognized in both academic and applied settings. For example,
Bussmann et al. (
2021) showed that SHAP can be integrated into model governance workflows to document variable importance and ensure consistency in decision-making processes. In our context, SHAP values can be used to generate standardized explanation reports for internal audit trails, identifying the primary risk drivers for each credit decision. These explanations can serve as supporting documentation during supervisory reviews or appeals, particularly in borderline or adverse decisions. A stylized use case is presented in
Figure 6 (
Section 3.3.4), where SHAP values are used to justify a low-risk classification by detailing the contribution of each feature. This level of detail could be integrated into credit committee reports, automated decision dashboards, or client-facing disclosure formats, reinforcing institutional accountability.
The ridge and LightGBM models can be applied at every stage of the loan lifecycle. During origination, they enhance credit assessments with richer behavioral inputs while maintaining interpretability. In the monitoring phase, they support dynamic risk reassessment using updated borrower data, enabling early identification of deterioration. Critically, the models allow for the estimation of expected credit loss (ECL), leading to more accurate provisioning practices and stronger financial management. In the collection and resolution phase, risk estimates inform targeted recovery strategies. While easing capital adequacy requirements under the IRB approach is a long-term regulatory incentive, the immediate value of the proposed framework lies in improving credit decision quality, portfolio oversight, and borrower engagement.
It is important to emphasize that the implementation of internal models, as proposed in this study, does not inherently result in an automatic or unjustified reduction in regulatory capital requirements. Under the Basel II IRB approach, such a reduction is only warranted when internal models demonstrate superior accuracy in estimating expected credit losses, thereby reducing uncertainty and improving risk-adjusted solvency. Although this study does not include a quantitative simulation of capital impacts (an area identified for future research), the high explanatory power of LightGBM (with R2 values near 0.99) indicates that the model captures a significant portion of credit loss variability, laying a technical foundation for more efficient capital use. However, adopting these models must not lead to a weakening of prudential standards. Instead, it allows for more effective capital allocation, redirecting resources to lower-risk segments or strengthening risk mitigation strategies. Provided that models are robust, validated, and subject to adequate supervision, capital optimization can be achieved without increasing credit risk or compromising institutional soundness. Ultimately, the proposed explainable ML framework complements rather than contradicts regulatory standards, enabling a shift toward more adaptive, transparent, and ethically grounded credit risk management in the solidarity sector.
5. Conclusions
In this paper, we propose an alternative model to compute a credit rating for the borrowers of the financial cooperatives in Colombia. Notably, the LightGBM model demonstrates a superior ability to capture the complexity of credit behavior in this sector. One of the main strengths of this approach, in comparison with the reference model, is the use of continuous financial variables, which enable the detection of subtle differences between partners. However, a key point to highlight in terms of practical applicability is the incorporation of SHAP analysis, which facilitates the interpretation of the credit scores generated by LightGBM. SHAP could enable cooperatives to identify the variables that have the most significant impact on the score, and to communicate this information clearly to members, managers, control entities, and supervisors. This feature makes LightGBM not only a highly accurate model but also a tool aligned with the principles of traceability and accountability required in modern risk management systems.
The implementation of explainable machine learning models across the loan lifecycle enhances credit risk management by improving risk assessment, enabling dynamic monitoring through expected credit loss estimation and better provisioning, and supporting tailored recovery strategies. Unlike the reference model, which defines parameters for adjusting the model to national regulations in a generic way, our model was established based on information specific to a solidarity sector entity. Our methodology enables us to identify which variables influenced a specific rating or which ones are most sensitive to the entity in the credit management process.
This research opens multiple lines: the incorporation of alternative data (such as transactional information or digital behaviors), the use of deep learning models combined with interpretability based on SHAP, and the multicenter validation of the model using data from different cooperatives across the country, developing complementary models that estimate the expected loss by integrating the dimensions of exposure to non-compliance, the severity of the loss, and the probability of non-compliance at the associate or segment level, based on the scores obtained from the current models.