You are currently viewing a new version of our website. To view the old version click .
Journal of Risk and Financial Management
  • Article
  • Open Access

10 November 2025

Optimizing the Collection Process in Credit Risk Management: A Comparison of Machine Learning Techniques for Predicting Payment Probability at Different Stages of Arrears

and
Artificial Intelligence and Computer Vision Research Laboratory, Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito 170150, Ecuador
*
Author to whom correspondence should be addressed.
This article belongs to the Section Mathematics and Finance

Abstract

In credit risk, scoring models based on logistic regression have been developed to optimize the default risk assessment. However, these models require complex feature engineering, and their accuracy worsens as the arrears progresses. This study proposes the use of machine learning techniques (XGBoost and artificial neural networks) to generate scores in different arrears segments (No Arrears Segment, 1–30 Days of Arrears Segment, 31–90 Days of Arrears Segment, and All Segments). The Kolmogorov–Smirnov (KS) metric is used to assess the efficiency and predictive power of the models. To ensure the accuracy and reliability of the models, a five-step methodology is employed. It starts with the formulation of the problem, followed by the selection of a data sample and definition of the target variable, then a descriptive analysis of the data is performed to facilitate the data cleaning. Subsequently, the models are trained and tested, and finally, the results are analyzed, and the models obtained are interpreted. The results show that both XGBoost and artificial neural network models outperform logistic regression in most of the arrears segments. In the No Arrears Segment, the XGBoost model is the best with KS = 63.36%. In the 1–30 Segment, XGBoost is also the best with KS = 51.38%. In the 31–90 Segment, the artificial neural network model is the best with KS = 38.77%. Finally, with all segments of arrears, the XGBoost model is again the best with KS = 74.05%.

1. Introduction

1.1. Problem Statement

In mass credit management, scoring models have proven to be the most valuable tool for the past two decades. By analyzing historical data, these models provide predictions of future behavior, which help control portfolios with greater accuracy and less uncertainty. A scoring model considers numerous variables simultaneously, which helps to establish a pattern and group members together based on their likelihood of experiencing an event. These models work best when dealing with large volumes of data with relatively homogeneous values. It is important to note that scoring models are designed to identify patterns and groupings rather than to provide precise predictions for individual cases (see Figure 1).
Figure 1. Credit scoring scheme.
Due to its ease of interpretation, logistic regression is the favorite in these kinds of problems. However, with the increase in available information, the growth in debtors over time, and more lenders, obtaining a score with a high predictive capacity has become a much more complex task.
To measure which model has greater predictive power, the Kolmogorov–Smirnoff statistic is used (), which measures how different or how far apart two probability distributions are. A model is said to have greater predictive power if its KS value is closer to 1 (see Figure 2).
Figure 2. Representation of the separation or divergence of two probability distributions.

1.2. Scoring Models for Credit Collection

Effective collections management is a critical aspect of managing large credit portfolios. It affects not only customer interactions but also collections operations. Without access to agile and effective tools, negative reactions from customers to their payment obligations are likely to occur. In the credit world, it is commonly understood that as the age of arrears increases, the chances of recovering funds decrease. It is therefore imperative to develop effective strategies to prevent this situation from occurring ().
Typically, collections management relies on portfolio segmentation and customer contact channels to determine the appropriate actions based on the number of days in arrears, the type of product, and the likelihood of payment. This approach may involve a variety of methods such as phone calls, field visits, text messages, emails, or letters to encourage customers to meet their obligations and ensure a positive outcome (see Figure 3).
Figure 3. Example of a credit collection strategy.
When implementing scoring models for collections, it is crucial to consider the number of days that have passed since the loan was due in order to determine whether it is still recoverable. Hence, it is recommended to segment the portfolio into 30- or 15-day arrears ranges based on the loan disbursement terms. Therefore, the following arrears segments are possible:
  • 0 No arrears segment
  • 1–30 segment
  • 31–60 segment
  • 61–90 segment
  • 91–120 segment
  • More than 120 segment
More than 120 is considered as a loss segment.
When arrears increase, it is crucial to differentiate between each segment and design a scoring model to distinguish reliable payers from those who default on payments. It is essential to keep in mind that as arrears increase, the number of individuals decreases, which may impact the model’s ability to predict. However, thankfully, we can measure the scoring model’s ability to discriminate using the Kolmogorov–Smirnoff (KS) metric.

1.3. Literature Review

This section reviews some efforts to predict the probability of client default in order to make informed lending decisions. We explore various modeling techniques, including logistic regression, XGBoost, and artificial neural networks, and evaluate their performance using the Kolmogorov–Smirnov (KS) statistic.

1.3.1. Credit Scoring with Logistic Regression

Logistic regression is a frequently utilized statistical method in the credit rating industry because of its capability to model the likelihood of a binary event, such as a credit default (). This model is renowned for its interpretability and predictive accuracy. Recent research has shown that logistic regression has achieved remarkably high accuracy in forecasting credit risk, reaching up to 99% in certain instances ().

1.3.2. Credit Scoring with XGBoost

XGBoost, a decision tree-based machine learning algorithm, is widely acclaimed for its exceptional performance in classification and regression tasks. This model excels at uncovering intricate patterns in data and has demonstrated successful applications in credit risk prediction (). However, there have been instances where XGBoost has underestimated credit risk, indicating the necessity for further refinements and validations ().

1.3.3. Credit Scoring with Artificial Neural Networks

Artificial neural networks are adept at capturing non-linear and complex relationships between variables, making them well-suited for predicting credit defaults (; ). In comparative studies, neural networks have demonstrated performance comparable to logistic regression, achieving an accuracy of 71% in training and 72% in testing ().

1.3.4. Performance Evaluation

The Kolmogorov–Smirnov (KS) statistic is a metric utilized to evaluate the predictive performance of scoring models. It measures the difference between the cumulative distributions of good and bad payers’ scores, indicating the degree of distinction between the two sets of scores. Recent studies have incorporated KS alongside other metrics like the area under the ROC curve and the GINI test to evaluate and compare the effectiveness of various models (; ).

1.3.5. Comparison of Models

In comparing models for credit risk prediction, logistic regression, neural networks, and XGBoost have all been identified as suitable techniques. Each method has its own strengths and weaknesses. Logistic regression is highly interpretable and exhibits high accuracy in predicting credit extension (; ). XGBoost, on the other hand, delivers high performance and can handle large amounts of data, although it may require additional adjustments to prevent underestimating risk (). While neural networks are complex, they can capture non-linear relationships and have demonstrated performance comparable to logistic regression (; ).
In sum, selecting the most suitable modeling technique for credit scoring and collections depends on various factors such as interpretability, data volume, and predictive capability. Viable techniques include logistic regression, XGBoost, and artificial neural networks, each with their distinct advantages and limitations. Evaluating performance using metrics like the Kolmogorov–Smirnov statistic is essential for assessing the efficacy of each model in predicting credit risk.
This study, unlike previous studies, explores the effectiveness of conventional logistic regression compared to two machine learning techniques, extreme gradient boosting (XGBoost) and artificial neural networks (ANNs), in predicting the ability to pay at different delinquency levels of a large retail loan portfolio. Using the KS metric, the performance of each model at each arrears segment is measured.
Currently, efforts are focused on feature engineering to find better predictors to maximize the discrimination of logistic regression-based models. The objective is often to estimate the probability that a customer will pay off his debt, determine his risk level in case of loan approval, or assess the cost-effectiveness of offering several products to the same person.
In this paper, we focus on maximizing the predictive power of the model by testing different supervised learning techniques in the recovery phase of loans at different arrears segments.

2. Methodology

2.1. Kolmogorov–Smirnov Test

The K-S test, or Kolmogorov–Smirnov test, is a non-parametric method utilized to assess the similarity of two distinct continuous distributions. It evaluates the hypothesis of whether or not they are identical. The KS statistic is computed by employing the cumulative empirical distribution function ().
F ^ ( x ) = 1 n i = 1 n 1 , if x i x 0 , otherwise
Consider two samples x s and y s of size n 1 and n 2 , respectively, with cumulative distribution functions F 1 and F 2 of a continuous random variable X. The KS test is used to test the following hypotheses:
H 0 : F 1 ( x ) = F 2 ( x ) , x H 1 : F 1 ( x ) F 2 ( x )
Based on the use of the empirical cumulative distribution function (1), the KS statistic is used to test the null hypothesis H 0 . Its value is obtained using the following expression:
K S = max x | F 1 ^ ( x ) F 2 ^ ( x ) |
The notation F 1 ^ represents the empirical accumulation function of x s , and F 2 ^ represents the empirical distribution function of y s . If the KS statistic is greater than the critical value K S α for a given significance level α , we reject the null hypothesis H 0 . In (), you can find a table of critical values for different sample sizes.
Then, the KS statistic is a measure of divergence between the distributions of two variables (see Figure 2). It is the maximum distance between F 1 and F 2 , and its value ranges between 0 and 1. Values close to 0 indicate that the distributions of x s and y s are identical, while values close to 1 indicate that the distributions of x s and y s differ. Therefore, the KS statistic is useful for distinguishing the differences between two distributions.
In our current project, we aim to determine the classification technique that achieves the highest KS value. We will compare the results of logistic regression, extreme gradient boosting, and artificial neural networks. The predicted score values for individuals who are categorized as good customers with 0 will be represented by x s , while the predicted score values for individuals who are categorized as bad customers with 1 will be represented by y s .

2.2. Data Sample Selection

When building a scoring model, it is important to collect as much information as possible that has been generated during the life of the loan. At the collection stage, it is especially essential to have data on payment behavior at the lending institution, information on the social and demographic profile of the customer, and data on collection performance. Sometimes it may also be useful to include information on payment behavior at other financial institutions.
For our study, we have focused on loans granted directly to consumers in the retail sector, which includes credit for items such as televisions, computers, technology, household appliances, and other consumer goods. To explain the information needed to develop the scoring models, consider Figure 4. The period of time prior to the observation point is called, the behavioral window, which cannot be longer than 36 months according to the provision of the Superintendence of Banks and Insurance of Ecuador.
Figure 4. Historic data selection.
Normally, a 36-month history is used when scoring models are created. However, during the collection phase, using such a long history can be counterproductive. This is because the collection stage is much more dynamic and unpredictable, and mistakes can be made by considering ancient payment behavior that may not reflect the current situation. As a result, it can be difficult to predict a customer’s next payment, so 12 months of history are used. During this period, variables related to the individual’s credit history are generated, such as payment and borrowing habits, maximum and average delinquency, open transactions, telephone transactions, actual telephone contacts, card payments, consumption amounts, etc. At the observation point, socio-demographic variables such as age, marital status, province, region, etc., are generated.
After the observation point, we evaluate an individual’s payment behavior during a period called the “performance window”. This window provides crucial information that helps us define good and bad individuals (dependent variable Y). Since payments are made monthly, we use a one-month window to determine whether a payment has been made or not. After that, we evaluate the individual’s payment behavior over a period of 6 months.

2.3. Dependent Variable Setting

The dependent variable Y is binary, taking the value 0 for individuals classified as a “Good Customer” and 1 for those classified as a “Bad Customer”. Two alternative definitions are considered to evaluate model performance. The first definition relies on payment events observed within a one-month performance window, whereas the second focuses on monthly payment behavior over a six-month performance window.
The adoption of both definitions is motivated by operational practices in credit risk management, where collection strategies are commonly structured according to delinquency ranges to maximize the recovery of overdue installments. Nevertheless, in certain contractual arrangements—such as loans granted with grace periods or those subject to renewals—the first payment obligations may not arise until several months after disbursement. In such cases, disregarding delinquency ranges would be conceptually misleading, as the absence of arrears does not necessarily imply the absence of credit risk. Instead, early identification of potential non-payment patterns allows for the implementation of proactive and preemptive collection strategies, thereby enhancing risk mitigation efforts.
Definition 1.
Payment event
Y 1 = 0 : if paid a complete installment 1 : otherwise
Definition 2.
Monthly payment behavior
Y 2 = 0 : if paid before 30 days of arrears 1 : otherwise
With Y 1 , the aim is to have as few clients as possible switch to higher arrears ranks. This definition will be used in each arrears range to discriminate between good and bad clients. On the other hand, with Y 2 , the aim is to control the deterioration of the portfolio in the medium term in order to avoid excessive losses.

3. Train and Test Models

A scoring model is created by identifying patterns within the predictor variables and can be used to classify individuals into good and bad categories based on the event to be predicted. In machine learning, this model is developed through supervised learning, where the model is trained with data that is different from the data used in the training phase.
It is crucial to have a diverse dataset during the training phase to ensure that the model is trained on a wide range of information. In order to achieve this, the data needs to be randomly split into three datasets. The first dataset, comprising 60% of the data, is used for training. The second dataset, containing 25% of the data, is used for testing. Finally, the remaining 15% of the data is used for validation.
Table 1 shows the distribution of the training, testing, and validation samples. Meanwhile, Table 2 indicates the distribution of customers classified as good and bad for Y 1 and Y 2 .
Table 1. Distribution of the training, testing, and validation samples.
Table 2. Distribution of customers classified as good (G) and bad (B).

3.1. Logistic Regression Training

Logistic regression is a widely used technique for predicting a categorical variable using a set of explanatory variables. It is a parametric method that is formulated as follows:
Consider N quantitative variables X 1 , , X N . For each combination of these variables, the response variable Y follows a Bernoulli distribution ().
Y | ( X 1 = x 1 , , X N = x N ) B ( 1 , p ( x 1 , , x N ) )
We are interested in modeling the conditional expectation.
E [ Y | ( X 1 = x 1 , , X N = x N ) ] = P [ Y = 1 | X 1 = x 1 , , X N = x N ] = p ( x 1 , , x N )
The multiple logistic regression model for Y in terms of the values of the variables X can be modeled as
p ( x 1 , , x N ) = exp ( α + n = 0 N β n x n ) 1 + exp ( α + n = 0 N β n x n )
with α = β 0 and x 0 = 1 .
In matrix terms it would be
p ( x ) = exp ( β t x ) 1 + exp ( β t x )
with x = ( 1 , x 1 , , x N ) and β = ( β 0 , , β N ) .
Finally, a linear model for the logit transformation is obtained.
ln p ( x ) 1 p ( x ) = n = 0 N β n x n

Unbalanced Problem

In some cases where logit, probit, or linear probability models are used, the number of observations in one group is much smaller than in the other. For instance, in lending, the number of bad clients is expected to be much smaller than the number of good clients because if both were equal, the financial institution would face bankruptcy. Therefore, to reach accurate predictions, either a large dataset or a balanced sample containing equal proportions of both groups is needed. In this case, all of the bad customers are considered, and the good customers are sampled to achieve a 50/50 ratio.
The question arises as to how one can analyze data in such cases. We suggest using a weighted logit (or probit or linear probability) model, similar to weighted least squares. If the logit model is used for estimating the coefficients of the explanatory variables, the different sample sizes for the two groups do not affect the coefficients ().
Let m 1 and m 2 be the sample proportions of the two groups, with m 2 > m 1 . Since m 1 is the probability that an observation belonging to the first group is selected, and m 2 is the probability that an observation belonging to the second group is selected, when the samples are disproportionate, the logistic function is shifted as follows:
ln p ( x ) 1 p ( x ) m 2 m 1 = n = 0 N β n x n
When m 1 = m 2 , the logistic function cuts on the x-axis at the value 0.5, as seen in Figure 5. Now, if m 1 = 0.2 , and m 2 = 0.8 , then the curve shifts and cuts on the x-axis at 0.8, as seen in Figure 6.
Figure 5. Logit function.
Figure 6. Shifted logit function.
Therefore, the disproportionality of the samples only affects the constant term of the model, and one has that
p ( x ) = exp ( γ + β t x ) 1 + exp ( γ + β t x )
where γ = ln ( m ) , m = m 2 m 1 ().

3.2. Extreme Gradient Boosting Training

XGBoost (extreme gradient boosting) is a machine-learning algorithm introduced by Chen and Guestrin in 2016. It uses the concept of tree gradient boosting to improve its performance and speed. XGBoost was designed to reduce overfitting by introducing regularization parameters. Gradient boosting trees use regression trees in a sequential learning process as weak learners. These regression trees are similar to decision trees, but they assign a continuous score to each leaf that is then summarized to provide the final prediction ().
Several hyperparameters are relevant when it comes to training a model. Some of them include the learning rate, column subsampling, and regularization rate. Additionally, subsampling (which involves bootstrapping the training sample), the maximum depth of the trees, the minimum weights on the children’s scores to split, and the number of estimators (trees) are also commonly used to address the bias–variance–compensation. While higher values for regularization, the number of estimators, and the weights on secondary grades are associated with reduced overfitting, learning rate, maximum depth, subsampling, and column subsampling, they should have lower values to achieve reduced overfitting. However, setting extreme values for any of these hyperparameters can lead to model misfits.

3.2.1. Hyperparameter Selected and Tuning

Hyperparameters are settings or configurations of the methods (models) which are freely selectable within a certain range and influence model performance (quality).
Grid search in XGBoost is an optimization technique that seeks to find the set of hyperparameters that yields the most accurate predictive model. It operates by defining a grid of hyperparameter values and evaluating the model’s performance for each combination of these values. This process is facilitated by the use of cross-validation, typically k-fold cross-validation, to assess the performance of the model on different subsets of the training data, thereby ensuring that the model’s performance is robust and not overly dependent on the particularities of one set of training data ().
The hyperparameters commonly tuned in XGBoost through grid search include m a x _ d e p t h   m i n _ c h i l d _ w e i g h t , g a m m a , s u b s a m p l e , c o l s a m p l e _ b y t r e e , and l e a r n i n g _ r a t e ( e t a ). The grid search process evaluates the model for each combination of hyperparameters in the grid, which can be computationally intensive but is necessary for identifying the optimal parameters that minimize overfitting and maximize predictive performance.

3.2.2. XGBoost Hyperparameter Nrounds

The parameter n r o u n d s specifies the number of boosting steps and takes values between [ 1 , ] , where only integer values are valid. Since a tree is created in each individual boosting step, n r o u n d s also controls the number of trees that are integrated into the ensemble as a whole. Its practical meaning can be described as follows: Larger values of n r o u n d s mean a more complex and possibly more precise model but also cause a longer running time.

3.2.3. XGBoost Hyperparameter Eta

The parameter e t a is a learning rate and is also called the “shrinkage” parameter, and it takes values between [ 0 ,   1 ] . It controls the lowering of the weights in each boosting step. It has the following practical meaning: Lowering the weights helps to reduce the influence of individual.

3.2.4. XGBoost Hyperparameter Max_DEPTH

The m a x _ d e p t h hyperparameter in XGBoost refers to the maximum depth of a tree and takes values between [ 0 ,   n ] , where only integer values are valid. It is used to control how deep the decision trees within the model can grow during any boosting round. A deeper tree can model more complex patterns in the data, but it also increases the risk of overfitting. The default value is typically set to 6, but it can be adjusted depending on the complexity of the task and the amount of data available.

3.2.5. XGBoost Hyperparameter Min_Child_Weight

Like gamma and maxdepth, m i n _ c h i l d _ w e i g h t restricts the number of splits of each tree and takes values between [ 0 , ] . In the case of m i n _ c h i l d _ w e i g h t , this restriction is determined using the Hessian matrix of the loss function.

3.2.6. XGBoost Hyperparameter Subsample

In each boosting step, the new tree to be created is usually only trained on a subset of the entire dataset, similar to random forest. The s u b s a m p l e parameter specifies the portion of the data approach that is randomly selected in each iteration and takes values between ] 0 ,   1 ] . Its practical significance can be described as follows: An obvious effect of small s u b s a m p l e values is a shorter running time for the training of individual trees, which is proportional to the s u b s a m p l e .

3.2.7. XGBoost Hyperparameter Colsample_Bytree

The parameter c o l s a m p l e _ b y t r e e is the number of features chosen for the splits of a tree and takes values between [ 0 ,   1 ] . In XGBoost this choice is made only once for each tree that is created instead of for each split. Here c o l s a m p l e _ b y t r e e is a relative factor. The number of selected features is, therefore, c o l s a m p l e _ b y t r e e × n . c o l s a m p l e _ b y t r e e enables the trees of the ensemble to have a greater diversity. The runtime is also reduced, since a smaller number of splits must be checked each time (if c o l s a m p l e _ b y t r e e < 1 ).

3.2.8. XGBoost Hyperparameter Lambda

The parameter l a m b d a is used for the regularization of the model. This parameter influences the complexity of the model and takes values between [ 0 , ] . Its practical significance can be described as follows: As a regularization parameter, l a m b d a helps to prevent overfitting. With larger values, smoother or simpler models are to be expected.

3.3. Artificial Neural Networks Training

Training a neural network revolves around the following objects ():
  • Layers, which are combined into a network (or model)
  • The input data and corresponding targets
  • The loss function, which defines the feedback signal used for learning
  • The optimizer, which determines how learning proceeds

3.3.1. Building the Neural Networks

When feeding data into a neural network, it is important to first apply one-hot encoding to the categorical variables. This means that for a variable with n categories, one would create n 1 dummy variables of 0s and 1s. After that, it is essential to standardize the data so that all variables have the same scale. This standardized data is then used as the input for the first layer of the neural network. As for the Y variable, it is kept numerical with 1s and 0s.
A type of network that performs well on binary classification problems is a simple stack of fully connected (“dense”) layers ().
There are two key architecture decisions to be made about such stack of dense layers:
  • How many layers to use
  • How many hidden units to choose for each layer
The intermediate layers will use relu as an activation function, and the final layer will use a sigmoid activation to output a probability (a score between 0 and 1 indicating how likely the sample is to have the target “1”; that is, how likely the review is to be positive). A relu (rectified linear unit) is a function meant to zero out negative values, whereas a sigmoid “squashes” arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.
When setting up a neural network, it is important to select a loss function and an optimizer. For a binary classification problem with network output as probabilities, it is best to use the binary cross-entropy loss. Cross-entropy is a reliable choice for models that deal with probabilities, as it measures the distance between probability distributions or, in this case, the actual distribution and its predictions ().
The optimizer of choice is Adam, (adaptive moment estimation). Adam adjusts the neural network weights more efficiently by calculating adaptive learning rates for each parameter. It uses first and second moment estimates of the gradients (i.e., the mean and non-centered variance) to perform parameter updates.

3.3.2. Adding Dropout

Dropout is a widely used regularization technique for neural networks that was developed by Geoffrey Hinton at the University of Toronto. When dropout is applied to a layer during training, a certain number of output features are randomly set to zero (). For example, if a layer would normally return the vector [ 0.2 ,   0.5 ,   1.3 ,   0.8 ,   1.1 ] for a given input sample during training, applying dropout might result in a vector like [ 0 ,   0.5 ,   1.3 ,   0 ,   1.1 ] .
The dropout rate is the fraction of features that are zeroed out, typically set between 0.2 and 0.5. During testing, no units are dropped out; instead, the layer’s output values are scaled down by a factor equal to the dropout rate to balance the fact that more units are active than at training time. The technique may seem strange and arbitrary, but why would it help reduce over-adjustment? Hinton says he was inspired, among other things, by a fraud prevention mechanism used by banks.
In his own words:“I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know, but they changed them a lot. I assumed it must be because it would take cooperation among the employees to get the bank to cheat. This made me realize that randomly removing a different subset of neurons in each example would prevent conspiracies and thus reduce over-fitting” (). The central idea is that by introducing noise into the output values of a neural network layer, you can break random patterns that are not meaningful (what Hinton calls conspiracies), which the network will start to memorize if there is no noise.
The architecture for each neural network constructed is shown in Table 3, Table 4, Table 5 and Table 6.
Table 3. No Arrears Segment neural network.
Table 4. 1–30 Segment neural network.
Table 5. 31–90 Segment neural network.
Table 6. All Arrears neural network architecture and parameters.

4. Interpretability

4.1. Interpretation of Logistic Regression Coefficients

The estimated coefficients β n in a regression can be better understood by considering the concept of relative risk. Relative risk is the ratio of the probability of an event occurring (p) to the probability of it not occurring ( 1 p ), also known as an odds ratio. Odds ratios indicate how much the odds change per unit change in the explanatory variables ().
The exponential of β n , exp ( β n ) , represents the relative risk, which measures the influence of the variables x n on the risk of the event occurring, assuming all other variables in the model remain constant. Once the values of β n have been estimated, we can determine the probability of the event for different values of x n .
The coefficients of logistic regression are not as easy to interpret as those of linear regression. While the β n coefficients are useful for model validation tests, exp ( β n ) is easier to interpret. exp ( β n ) represents the change in the odds ratio for each one-unit change in the variable x n .
For example, take the variable V4 ( c p _ p l 0.77 ) in the No Arrears Segment model. It means “individuals whose value of installments paid over time is up to 0.77”. Its estimated coefficient β 4 is 1.612, so exp ( β 4 ) = 5.014 indicates that the odds ratio of “individuals whose value of installments paid over time is up to 0.77” (see Table A2 from Appendix B), is 5.014 times higher than other customers if all other variables are held constant. In other words, the probability that individuals with c p _ p l 0.77 will make a payment next month is 5.014 times higher than others.

4.2. Interpretation of XGBoost Models Results

XGBoost is often considered a “black box” algorithm because, while it is highly effective at making accurate predictions, it can be difficult to understand how it arrives at these predictions. This is due to the complexity of the decision tree models it creates and how these trees are combined to form the final model.
Machine learning models like XGBoost, which utilize ensemble and boosting techniques, generate multiple decision trees during the training process. Each tree is constructed to fix the errors of the previous one, leading to a final model that is a weighted sum of many trees ().
As a result of this combination of models and their interactions, it can be challenging to precisely determine which features are influencing the predictions and how they are doing so. Nevertheless, ongoing efforts are being made to enhance the interpretability of these models, including the use of feature importance techniques, SHAP values, and tree visualizations, which can provide insight into model decisions ().
The significance of variables in the XGBoost algorithm pertains to the impact of each feature in the dataset on the accuracy of the model. From an interpretability standpoint, this helps in understanding which variables carry the most weight in the decisions made by the model and how each influences the final result.
In XGBoost, the importance of variables can be assessed in various ways, including information gain, coverage, or frequency of occurrence of a feature in the decision trees. These metrics offer a clear understanding of the relevance of each variable and enable data scientists and analysts to make well-informed decisions regarding feature selection and model optimization.
Gain represents the average contribution of a feature to model improvement each time it is utilized in a tree. A higher value signifies the feature’s greater importance in making splits that enhance model performance.
Weight refers to the frequency of a feature’s appearance in all trees of the model. A feature with a higher weight is deemed more significant.
Cover measures the frequency of a feature’s utilization in the trees, weighted by the amount of data passing through those splits. A high coverage suggests that the feature substantially impacts the model’s predictions.
Figure 7 shows the 10 variables that, in terms of gain, contribute to the splits that improve the performance of the segment model without arrears.
Figure 7. Top 10 important variables in XGBoost No Arrears Segment model.
It has been noted that for the 0 - No Arrears Segment (Figure 7), the variables with the most significant influence on the calculation of the probability of payment in the next month are cp_pl, saldo_cuota_credito, and ctr_pl.

4.3. Interpretation of Neural Networks

The local interpretable model—agnostic explanations (LIME) technique is used to interpret the results of neural networks. This technique allows individual predictions to be explained by locally approximating the decision surface with an interpretable model, such as weighted linear regression (). This approach generates perturbations in the input variables close to the case of interest, evaluates the model’s response, and estimates the contribution of each feature to the prediction, facilitating the interpretation of black box models without the need to access their internal structure. Its applicability in deep learning models has been documented in various areas, including credit risk and health, due to its flexibility and independence from the type of model (; ).
The results of the Table 7 show the local interpretation of a customer with a 99.8% probability of being a bad customer ( Y 1 = 1 ). The LIME analysis revealed three main characteristics that influence this prediction:
Table 7. LIME results for one customer in a neural network for the No Arrears Segment.
  • Credit_installment_balance (standardized value: −1.1, weight: 0.0000237): This variable, which represents the normalized credit installment balance, shows a positive contribution to the prediction of Y 1 , suggesting that lower installment balance values are associated with a customer’s failure to make payments.
  • cp_pl (value: 1.82, weight: 0.0000232): This feature also contributes positively to the prediction, although with a slightly lower weight.
  • Cadena.ARTEFACTA (value: −2.8, weight: −0.000023): This categorical variable, indicating membership to a specific retail chain, shows a negative contribution, suggesting that being a customer of this retail chain reduces the probability of being a bad customer.
The similar magnitude of the weights (of the order of 10 5 ) indicates that these features have relatively balanced influences on the model’s decision for this specific instance.

5. Results and Discussion

Based on the data in Table 8, it appears that both XGBoost and artificial neural networks (ANN) tend to outperform logistic regression (LR) in some segments. However, the superiority of one model over the other may depend on the specific segment.
Table 8. KS in percentage (%) and time in seconds (s) results by arrears segment and dependent variable Y definition.
  • In No Arrears Segment models, XGBoost (63.36%) and ANN (61.84%) outperform LR (56.42%).
  • In 1–30 Segment models, XGBoost (51.38%) and ANN (50.35%) also outperform LR (47.32%).
  • In the 31–90 Segment models, however, ANN (38.77%) outperforms LR (36.62%), but XGBoost (34.47%) does not.
  • Finally, in the All Segments models, both XGBoost (74.05%) and ANN (73.59%) outperform LR (71.01%).
These findings suggest that XGBoost and ANN outperform LR in predicting events. It is essential to consider that these results can vary based on the data characteristics, the quality of feature engineering, and the hyperparameters of the models, among other factors. Furthermore, while XGBoost and artificial neural networks (ANNs) often demonstrate superior predictive capabilities, their training procedures are inherently more intricate than those of logistic regression (LR). In the case of XGBoost, this complexity arises from the need to systematically optimize a potentially large set of hyperparameters, whereas ANNs require the design, calibration, and validation of an appropriate network architecture. Therefore, the selection of a modeling approach should be informed by a rigorous assessment of the trade-off between predictive accuracy and computational efficiency, taking into account the operational constraints and the specific objectives of the predictive task.
Finally, it is crucial to note that these results are specific to this dataset and cannot be generalized to other datasets or prediction tasks. Therefore, it is good practice to cross-validate and fine-tune the hyperparameters for each model and dataset.

6. Conclusions

  • This paper compares three supervised learning models—logistic regression, XGBoost, and artificial neural networks—using the Kolmogorov–Smirnov (KS) statistic as a performance metric. The XGBoost algorithm consistently demonstrated superior performance across various segments, achieving accuracy rates of 63.36% for segments with no arrears, 51.38% for segments with 1–30 days of arrears, and 74.05% when all segments were analyzed together. These results indicate that XGBoost is more effective for binary classification compared to both logistic regression and neural networks.
  • Although logistic regression requires more time for training, its performance in terms of KS did not outperform XGBoost and neural networks in any of the arrears segments.
  • In the 31–90 Days of Arrears Segment, neural networks outperformed XGBoost with a 38.77% KS value, indicating that the complexity and adaptability of neural networks can be advantageous in certain scenarios, despite the longer training time required.
  • The interpretation of XGBoost results relies on how much each variable contributes to the splits in the random trees. In contrast, neural networks are still being studied to achieve satisfactory interpretability. This indicates that when developing a scoring model, one must choose between interpretability and predictability. If the goal is to enhance predictive or discriminative power, XGBoost or neural networks are preferred options. These algorithms can be effectively utilized during the placement, servicing, and collection phases of the credit cycle, as there is a larger volume of data to analyze. This is particularly relevant in collections, where results can change rapidly.
  • Future work involves exploring combinations of models to leverage the individual strengths of algorithms like XGBoost and neural networks, aiming to improve prediction accuracy across different segments or scenarios.

Author Contributions

Conceptualization, A.C.; methodology, A.C.; software, A.C.; validation, M.E.B.; formal analysis, A.C.; investigation, A.C.; resources, A.C.; data curation, A.C.; writing—original draft preparation, A.C.; writing—review and editing, M.E.B.; visualization, M.E.B.; supervision, M.E.B.; project administration, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request to the corresponding author due to confidentiality agreements with the data provider.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Set of Variables for Training Models

Table A1. Variables’ description.
Table A1. Variables’ description.
AliasVariableDescription
M1Atraso_Max_3mesesMaximum arrears in the last three months
M2Atraso_Max_6mesesMaximum arrears in the last six months
M3Atraso_Max_9mesesMaximum arrears in the last nine months
M4Atraso_Max_CreditoMaximum arrears since the start of the loan
M5Atraso_Prom_12mesesAverage arrears over the last 12 months
M6Atraso_Prom_3mesesAverage arrears over the last three months
M7Atraso_Prom_CreditoAverage arrears since the start of the loan
M8CadenaDistribution chain to which the purchased product belongs
M9canal_vtaSales channel through which the loan was acquired
M10cant_con_efe_dom_12mesesNumber of direct contacts at the customer’s home in the last twelve months
M11cant_con_efe_dom_3mesesNumber of direct contacts at the customer’s home in the last three months
M12cant_con_efe_dom_6mesesNumber of direct contacts at the customer’s home in the last six months
M13cant_con_efe_dom_9meseNumber of direct contacts at the customer’s home in the last nine months
M14cant_con_efe_tel_12mesesNumber of phone contacts with the customer in the last twelve months
M15cant_con_efe_tel_6mesesNumber of phone contacts with the customer in the last six months
M16cant_con_efe_tel_9mesesNumber of phone contacts with the customer in the last nine months
M17cant_con_efe_tel_mesantNumber of phone contacts with the customer in the previous month
M18cant_ges_dom_12mesesNumber of visits to the customer’s home in the last twelve months
M19cant_ges_dom_3mesesNumber of visits to the customer’s home in the last three months
M20cant_ges_dom_9mesesNumber of visits to the customer’s home in the last nine months
M21cant_ges_efe_dom_12mesesNumber of unsuccessful home visits in the last twelve months
M22cant_ges_efe_dom_6mesesNumber of unsuccessful home visits in the last six months
M23cant_ges_efe_dom_9mesesNumber of unsuccessful home visits in the last nine months
M24cant_ges_efe_dom_mesantNumber of unsuccessful home visits in the previous month
M25cant_ges_efe_tel_9mesesNumber of unsuccessful phone calls in the last nine months
M26cant_ges_efe_tel_mesantNumber of unsuccessful phone calls in the previous month
M27cant_ges_tel_12mesesNumber of phone calls in the last 12 months
M28cant_ges_tel_6mesesNumber of phone calls in the last 6 months
M29cant_ges_tel_9mesesNumber of phone calls in the last 9 months
M30Cant_Num_Telef_ReferenNumber of reference phone numbers held by the customer
M31Cant_ProductosNumber of billed products
M32CapitalInteresCapital with interest
M33cp_ctrRatio of paid installments to installments due
M34cp_plRatio of paid installments to loan term
M35ctr_plRatio of remaining installments to loan term
M36Cuotas_pagad_creditoNumber of installments paid on the loan
M37Cuotas_pendt_creditoNumber of installments due
M38CuotasGratisIndicator of whether the customer has a free installment promotion
M39desc_mejor_resp_dom_3mesesBest response obtained from home visits in the last three months
M40desc_mejor_resp_dom_6mesesBest response obtained from home visits in the last six months
M41desc_mejor_resp_dom_9mesesBest response obtained from home visits in the last nine months
M42desc_mejor_resp_tel_6mesesBest response obtained from phone calls in the last six months
M43dif_mesNumber of months between the loan disbursement and the reporting date
M44EdadCustomer’s age in years at the time of data extraction
M45ID_Num_Telef_Particular1Indicator of whether the customer has a landline at home
M46ind_ges_preventivaIndicator of whether preventive management was performed
M47IngresosPropiosEstimated value of customer’s own income
M48InicialTotal initial payment at the time the loan was taken
M49InicialbonoAmount less than an installment paid at the start of the loan
M50lineaProduct line
M51MesesGraciaNumber of grace months before the first installment due date
M52Num_atra_may_60dias_anioNumber of delinquencies over 60 days in the past year
M53Num_atra_may30dias_anioNumber of delinquencies over 30 days in the past year
M54Num_pag_12mesesNumber of payments made in the last twelve months
M55Num_pag_3mesesNumber of payments made in the last three months
M56Num_pag_6mesesNumber of payments made in the last six months
M57Num_pag_9mesesNumber of payments made in the last nine months
M58Pago_efec_1mesPayment made for the installment due on the reporting date
M59PlazoTotal number of loan installments
M60RalacionTrabajoIndicator of the customer’s current employment status
M61Rango_mora_max_mesantMaximum arrear range in the month prior to the reporting date
M62Rango_mora_mesactCurrent delinquency range as of the reporting month
M63regionGeographic region of the customer’s home
M64Saldo_cuota_creditoOutstanding installment balance
M65Saldo_vencido_CreditoOverdue loan balance
M66SexoGender as self-identified by the customer
M67TasaCreditoEffective interest rate of the loan
M68tipoinicialbonoType of initial bonus received when the loan was taken
M69TotFacturaInicialTotal billed amount excluding interest
M70Val_pag_1mesesAmount paid the month before the reporting date
M71Val_pag_2mesesAmount paid two months before the reporting date
M72Val_pag_3mesesAmount paid three months before the reporting date
M73ValorCuotaTotal installment amount including interest

Appendix B. Logistic Regression and XG Boost Results

Table A2. Variables and estimated coefficients of the final logistic regression model for the No Arrears Segment (See Table A1 for description details).
Table A2. Variables and estimated coefficients of the final logistic regression model for the No Arrears Segment (See Table A1 for description details).
AliasBeta EstimatedDescription
V1−1.74701M34 ≤ 0.56 & M58(0; 77.62] & M55(2;3]
V2−1.74726M34 ≤ 0.56 & M58(77.62; 102.88] & M72(159.2;293.94]
V3−0.40177M34(0.56;0.77]
V41.61231M34 > 0.77
V50.87094M64 ≤ 197.79
V6−0.85414M64 > 1121.02 & M33(0.8;1] & M70 > 24.61
V7−0.27386M35 ≤ 6.5 & M56(5;6] & M36 ≤ 11
V80.10523(M38 == BONO INICIAL + N CUOTAS GRATIS |
M38 == CUOTAS GRATIS) & M49 <= 0 & (M63 == QUITO |
M63 == GUAYAQUIL)
V9−0.46984M38 == NULL
V100.08418M67(0;15] & M48 > 59
V11−0.15439(M68 == NULL & M71(102.58;136.09] & M73(47.18;77.8]) |
(M68 == NULL & M71(153.54;236.16] & M73 > 77.8)
V12−0.11464M31 > 2 & (M50 == Tienda | M50 == Recojo) & M60 == NO
V13−0.67996M35 ≤ 0.75
Intercept−0.80574
Table A3. Variables of the final logistic regression model for the 1–30 Segment (see Table A1 for description details).
Table A3. Variables of the final logistic regression model for the 1–30 Segment (see Table A1 for description details).
AliasBeta EstimatedDescription
V11.56644M64 ≤ 123.25
V2−0.00106M64(123.25;311.02] (cont)
V3−0.81316M64 > 922.58 & M37 ≤ 0 & M58(58.87;117.52]
V4−0.23837M34(0.28;0.65] & M55(2;3] & M4 ≤ 5
V50.70477M34 > 0.83 (cont)
V6−0.17784M55(2;3] & M4 ≤ 5
V7−0.54773M35 ≤ 3.2 & M65 ≤ 0
V80.64187M35 > 0.87 (cont)
V9−0.01297M36(3;14] (cont)
Intercept−0.39452
Table A4. Variables of the final logistic regression model for the 31–90 Segment (see Table A1 for description details).
Table A4. Variables of the final logistic regression model for the 31–90 Segment (see Table A1 for description details).
AliasBeta EstimatedVariable
V1−0.52883M1 ≤ 11
V2−0.58201M1(11;27] & M64 > 297.19
V30.19268M1 > 42
V4−0.30622M3(29;45]
V50.51679M3(45;55] & M58 ≤ 0
V60.77430M6 > 49.67 & M53 ≤ 2
V70.12466M52 ≤ 0 & M65 > 108.23
V80.16562(M50 == VIDEO | | M50 == AUDIO | | M50 == CONSTRUCCION) & M15 ≤ 0 & M69 ≤ 2046.82
V90.16522(M39 == MENSAJE A TERCEROS | | M39 == CONTACTO SIN COMPROMISO) & M28 > 10
V100.18916M12 ≤ 0 & M26 ≤ 0 & M47 ≤ 353
Intercept0.69485
Figure A1. Top 10 important variables in XGBoost No Arrears Segment model.
Figure A2. Top 10 important variables in XGBoost 1–30 Segment model.
Figure A3. Top 10 important variables in XGBoost 31–90 Segment model.

References

  1. Arnold, T. B., & Emerson, J. W. (2011). Nonparametric goodness-of-fit tests for discrete null distributions. R Journal, 3(2), 34–39. [Google Scholar] [CrossRef]
  2. Bartz, E., Bartz-Beielstein, T., Zaefferer, M., & Mersmann, O. (2023). Hyperparameter tuning for machine and deep learning with r: A practical guide. Springer Nature. [Google Scholar] [CrossRef]
  3. Chen, T., & Guestrin, C. (2016, August 13–17). Xgboost: A scalable tree boosting system. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794), San Francisco, CA, USA. [Google Scholar]
  4. Chen, Y., Wang, H., Han, Y., Feng, Y., & Lu, H. (2024). Comparison of machine learning models in credit risk assessment. Applied and Computational Engineering, 74, 278–288. [Google Scholar] [CrossRef]
  5. Chollet, F. (2018). Deep learning with R/François Chollet; with JJ Allaire. Maning, EEUU. [Google Scholar]
  6. Cifuentes Baquero, N., & Gutiérrez Murcia, L. (2022). Modelo predictivo de la probabilidad de aumento de los días de mora para usuarios de tarjeta de crédito [Masters’s thesis, Universidad de los Andes]. [Google Scholar]
  7. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2019). A survey of methods for explaining black box models. ACM Computing Surveys, 51(5), 93. [Google Scholar] [CrossRef]
  8. Iñiguez, C., & Morales, M. (2009). Selección de perfiles de clientes mediante regresión logística para muestras desproporcionadas, validación, monitoreo y aplicación en la proyección de provisiones. Escuela Politécnica Nacional. [Google Scholar]
  9. Li, Z. (2022). Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, 96, 101845. [Google Scholar] [CrossRef]
  10. Maddala, G. S., Contreras García, J., Lozano López, V., & García Ferrer, A. (1985). Econometría. McGraw-Hill. [Google Scholar]
  11. Massey, F. J., Jr. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253), 68–78. [Google Scholar] [CrossRef]
  12. Molnar, C. (2021). Interpretable machine learning. Available online: https://fedefliguer.github.io/AAI/redes-neuronales.html (accessed on 22 August 2025).
  13. Pérez Tatamués, A. E. (2014). Modelo de activación de tarjetas de crédito en el mercado crediticio ecuatoriano a través de una metodología analítica y automatizada en r [Bachelor’s thesis, Escuela Politécnica Nacional]. [Google Scholar]
  14. Reche, J. L. C. (2013). Regresión logística. Tratamiento computacional con R. Universidad de Granada. [Google Scholar]
  15. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August 13–17). “Why should I trust you?”: Explaining the predictions of any classifier. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 135–144), San Francisco, CA, USA. [Google Scholar] [CrossRef]
  16. Sánchez Farfán, Y. S. (2023). Aplicación del modelo credit scoring y regresión logística en la predicción del crédito, en una entidad financiera de la ciudad del Cusco. Universidad de los Andes. [Google Scholar]
  17. Suquillo Llumiquinga, J. A. (2021). Credit scoring: Aplicando técnicas de regresión logística y modelos aditivos generalizados para una cartera de crédito en una entidad financiera [Bachelor’s thesis, Escuela Politécnica Nacional]. [Google Scholar]
  18. Thomas, L., Crook, J., & Edelman, D. (2017). Credit scoring and its applications. SIAM. [Google Scholar]
  19. Vargas Lara, D. O. (2015). Metodología para la obtención de un modelo de cobranza de créditos masivos. desarrollo y obtención de un modelo de score [Unpublished master’s thesis, Escuela Politécnica Nacional]. [Google Scholar]
  20. Yeh, I.-C., & Lien, C.-H. (2023). Credit scoring using machine learning techniques: A review and open research issues. Mathematics, 11(4), 839. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.