Machine Learning for Credit Risk Prediction: A Systematic Literature Review

: In this systematic review of the literature on using Machine Learning (ML) for credit risk prediction, we raise the need for ﬁnancial institutions to use Artiﬁcial Intelligence (AI) and ML to assess credit risk, analyzing large volumes of information. We posed research questions about algorithms, metrics, results, datasets, variables, and related limitations in predicting credit risk. In addition, we searched renowned databases responding to them and identiﬁed 52 relevant studies within the credit industry of microﬁnance. Challenges and approaches in credit risk prediction using ML models were identiﬁed; we had difﬁculties with the implemented models such as the black box model, the need for explanatory artiﬁcial intelligence, the importance of selecting relevant features, addressing multicollinearity, and the problem of the imbalance in the input data. By answering the inquiries, we identiﬁed that the Boosted Category is the most researched family of ML models; the most commonly used metrics for evaluation are Area Under Curve (AUC), Accuracy (ACC), Recall, precision measure F1 (F1), and Precision. Research mainly uses public datasets to compare models, and private ones to generate new knowledge when applied to the real world. The most signiﬁcant limitation identiﬁed is the representativeness of reality, and the variables primarily used in the microcredit industry are data related to the Demographic, Operation, and Payment behavior. This study aims to guide developers of credit risk management tools and software towards the existing ability of ML methods, metrics, and techniques used to forecast it, thereby minimizing possible losses due to default and guiding risk appetite.


Introduction
The digitalization of processes and AI are already part of our daily lives and have been developing in all areas with which we interact, especially during the period of the COVID-19 pandemic [1,2].This trend continued with the promotion of online loans and Internet sales [3], and in turn, with the increase in demand for credit and crowdfunding through the Internet in the short term [4][5][6], and considering the pandemic as an external factor, it has considerably changed the economy, increasing uncertainty for financial institutions, and consequently the need to generate new models to manage it [4].The World Bank (WB) emphasizes that the banking sector is crucial because it improves the well-being of a country's population and is essential for the growth of the economy [7].
In this competitive environment, financial institutions seek to differentiate themselves, generate shareholder value, improve the customer experience, and promote financial inclusion.In this sense, they face the challenge of adopting data-driven innovation (DDI) to manage credit, customer and operational risk appetite, and seeking efficiency [8] while being supported by information technology-cloud services, Internet of Things (IoT), BIG DATA, AI and mobile telephony-known as the fourth industrial revolution [9].This is followed by its sustainable evolution towards virtuous interaction between humans, technology, and the environment, called the fifth industrial revolution [10].
There are still many use cases for DDI and ML in financial institutions to solve [11,12].There are gaps in the evaluation and precariousness in the results when processing large volumes of information [13] using ML algorithms for real-time applications [8].On the other hand, using machine learning techniques, through association, collaborative filtering, along with recommended and personalized content, systems can identify individual preferences that could enrich the risk assessment [8].Likewise, BIG DATA tools and technology could help improve forecasts in changing markets, considering their analysis potential [11,14].
Utilizing machine learning in business intelligence to reduce the uncertainty of payment behavior, also called credit risk, is a necessity in the microfinance industry since it allows for the analysis of large volumes of information generated, especially in the context of post-pandemic COVID-19 and technological development [4,15].The challenges are determining which configurations of attributes and algorithms are best suited for the tasks, together with identifying limitations in the applications.Consequently, we propose the research topics shown in Table 1.

Motivation Research Topics
We wish to know what models the industry and academics use to predict credit risk.
The algorithms, methods, and models used to predict credit risk.We wish to know what metrics to use in the industry for academics to evaluate the performance of algorithms, methods, or models to predict credit risk.
The metrics to evaluate the performance of algorithms, methods, or models.
We wish to know the metrics accuracy, precision, F1 measure, and AUC of algorithms, methods, or models to predict credit risk.
We wish to know what datasets to use in the industry and for academics to predict credit risk.
The datasets are used in the prediction of credit risk.We wish to know what variables or features to use in the industry and for academics to predict credit risk.
The variables or features used in the prediction of credit risk.
We wish to know the main problems or limitations to predicting credit risk.
The main problems or limitations of predicting credit risk.

Materials and Methods
Based on the research topics, we pose the questions in Table 2 as a first step.Using the most relevant words related to our research topic, we build our search string and apply it to the recognized databases: IEEE Xplore, Scopus, and Web Of Science (WOS).We consider the studies belonging to articles from journals and conferences published from 2019 to May 2023 and are related to the computer science area as shown in Figure 1.
We identified 275 studies.There were 77 eliminations for being duplicates and 131 eliminations from applying the eligibility criteria (without full access to the document, without ranking in Scimagojr, review article) and 15 eliminations for not having relevance after the complete analysis of the document.We obtained as a result 52 revealing articles; we show in Table 3 the result of the application of the inclusion and exclusion criteria.

Id. Metrics
RQ1 What are the algorithms, methods, and models used to predict credit risk?

Research Strategy
To determine the importance of studies, we assess that they include relevant conclusions or results; attributes or features of datasets; descriptions of the proposed model or algorithm; the metrics with which the models were evaluated; preprocessing techniques; identified problems or limitations; and future studies.We also consider when the article includes more than one set of data or experiments; we select one of them to present the metrics considering the most relevant in terms of ACC, AUC, or another metric in the event the indicated ones are not used.According to our criteria and in the cases which it applies, we decided on the German Dataset, which is the dataset of the University of California Irvine (UCI), considering that it is imbalanced [16], of frequent use, and it will allow us to compare results between investigations that include it.
The main limitation when carrying out a systematic review is bias, and considering how the decisions of researchers influence the application of the method framed in experiences and previous knowledge.For example, the choice of topic, choice of electronic resources, proposal of research questions, methodology for data collection, and their evaluation.We have tried to meticulously follow the PRISMA method, taking into account the application of the most appropriate criteria and procedures in the construction of this document.

Current Research
The demand for online credit generates considerable information, with which, when analyzed using BIG DATA [17], designs new products, machine learning models, and credit risk assessment methods [18].Consequently, in scenarios of increased demand, credit risks also escalate considerably, in a non-linear manner, considering the level of risk, rate, and terms of credit [19].In the same manner, there is an expectation of an increase in fraud in the following years [20].Another problem to consider is the consistency of the information recorded at the different stages of the process, such as sales data [21], cultural variables, environmental data [22], macroeconomics [23], innovation capacity management and development, exchange rate evolution, Gross Domestic Product (GDP) growth trends [23,24], economic activity, and experience [25].
The relevant problems are addressed by various research papers, using different ML approaches for the respective interpretations and decision-making [26].However, there are also difficulties with the implemented models [27], which generally follow the black box model; that is, to predict, for example, the good and bad payers [28][29][30].These models have presented problems, especially in difficult times, such as "the financial crisis of 2008" [1,31], since financial institutions focus on loans that generate the most income, being, therefore, of higher risk due to payment defaults [19,22].Automatic evaluation models, based on credit data, could confuse good paying customers for bad ones [20], and apply penalties on possible benefits [32].The low explainability of advanced non-linear ML models is slowing down their implementation [33].The challenge is in the development of Explainable artificial intelligence (XAI), whose objective is to provide predictive models with inherent transparency, especially in complex cases [34,35].In that sense, one could use Shapley Additive exPlanations (SHAP) or the Generalized Shapley Choquet integral (GSCI) to expose the parameter dependency [36,37], and a model interpretable payment risk forecast from a penalized logistic tree to enhance the performance of Logistic Regression (LR), and LR + Slack Based Measure (SBM) [38,39].The MLIA model and variable logistic regression coefficient more intuitively reveal the contribution of a variable to the predicted outcome [40].The "non-payment" problem is important because it could generate significant losses for financial entities [41,42].
Here, the challenge of machine learning is to consider the existing multicollinearity in the input data [43,44] where there is a high correlation of variables and some that is not useful for classification [2].This imbalance in the data actually used, with probable overfit [20], could generate biases in machine learning [45,46], causing chaotic reputation management, and malicious or criminal deception [47].In other words, the challenge is to determine the effective, relevant features [48], for example, for the training of neural networks (NNs) (which is performed end-to-end by interactions).These are additional constraints with desirable features known a priori in order to reduce the number of interactions and prioritize smoothness; they are factors of explanatory interest, in the domain, control, and generative modeling of features [49].Other authors recommend the use of genetic algorithms (GAs) to guide training, with data sequences that have the best result [50]; the use of hive intelligence (HI) is also highlighted for this purpose [51][52][53]; Boost Category models (BCat) such as Adaboost (ADAB), XGBoost (XGB), LightGBM (LGMB), and Bagging (BAG) [30,40,[54][55][56]; multi-classification (MC) and information fusion models (MIFCA) [44]; there is also noise removal using fuzzy logic (FL) and contribution to the identification of main attributes [28,37,57].It is worth noting the possibility to evaluate the interaction of borrowers within their supply chain to enrich predictive models [17,58].The use of images, interviews, text and social information, and interaction with mobile technology would give the credit risk assessment a multidimensional and multi-focus characteristic [15,59,60]; this indicates the inclusion of integrated accounting information with statistical variables of profitability, liquidity, asset quality, management indices, capital adequacy, efficiency, SCORECARD scorecard and the maximization of the internal rate of return (IRR), risks of the industry, and GDP [1,23].
Some authors maintain that the most relevant characteristics are gender, educational level, mortgage credit, microfinance, debt balance, and days past due [61].Other authors maintain that the most relevant variables are the days of default, especially those greater than 90 days, to determine non-payment behavior and consider that discriminatory variables such as gender, age, and marital status should not be considered [55,61].Consequently, the challenge of validating the quality of the features arises [39,62].For this part, [17] maintains that the number of influential variables for risk assessment has increased, and the linear and non-linear time series relationships have increased their complexity.
In Table 4, we summarize the challenges in credit risk prediction and the ML methods and techniques proposed by the authors to address them.The main challenge is high uncertainty.The authors propose various classification models, including Boosted category models, neural networks, deep learning enhanced with fuzzy logic, genetic algorithms, and hive intelligence.For the low explainability of the results, the authors propose applying Explainable artificial intelligence, Shapley Additive exPlanations, Generalized Shapley Choquet integral, and MLIA.To address the complexity of ML models, the authors propose optimizing the hyperparameters of the models supported with KFold CV, genetic algorithms, and grid search.For the multivariate origin of the data, the authors propose applying BIG DATA to take advantage of its prominent volume characteristics and its unstructured nature.To address the natural characteristics of unbalanced data, the authors propose the application of SMOTE, RUS, ROS, KFold, and ADASYN.In their research to forecast credit risk, the authors use ML models: 72.76% nonassembled (N-Ass) and 27.24% assembled (Ass), which are shown in Table 5.In Figure 3, the Category Boosted family was used 43.9% of the time, NN/DL has a use rate of 9.8%, Traditional has a use rate of 22%, Other Models have a use rate of 3.7%, Fuzzy Logic has a use rate of 12.2%, and Collective Intelligence is used in 8.5% of the total.These results will demonstrate that N-Ass models are mostly used in credit risk prediction.However, the authors use these as a baseline to compare them with Ass models generated by nesting N-Ass models and improving them using fuzzy logic and hive intelligence.We collected the metrics used by the researchers in their articles, taking into account that for the cases in which they evaluate more than one dataset and more than one model is applied, a pair is searched, looking forthe best ACC and AUC values, or another value in the cases where these are not used, as shown in Figure 4. From this simplification, the authors propose 48% assembled models and 52% nonassembled models.
Of the total assembled models, the Boosted Category has 21%; Collective Intelligence has 8%; NN/DL has 8%, Traditional has 8%, and Fuzzy Logic has 4%.Of the non-assembled models, the Boosted Category makes up 25%; Traditional is 21%; and NN/DL is 6%.
These results could demonstrate that the most-used metrics for credit risk prediction are AUC and ACC since they allow for the comparison of different models.However, AUC is prioritized because the distribution of the classes does not influence it and has better behavior when using unbalanced data.We listed the values of the five metrics most used by the authors in their research: AUC, ACC, F1 measure, Precision, and Recall, as shown in Table 6.Furthermore, we have taken the values in each case according to the tuple defined in the question.To compare experiments, the characteristics of the dataset and metric to be evaluated should correspond.Below, we show only the metrics' values the authors used and evaluated in their research in Table 7.If the metrics have an empty value, the authors do not consider them in their experiment design.Considering that there are experiments in which the same dataset is used, we can compare their prediction capabilities.This occurs in the case of the UCI German Dataset, in which the XGB + DT model has an ACC of 84 [63], against the LR models [28], and Random Forest (RF) + C4.5 [16].Other less-used datasets on which we can compare the metrics are the Tsinghua-Pekin U RESSET database and the Kaggle Prosper dataset, as shown in Table 7.It is essential to indicate that the Lending Club dataset is possible to compare; however, it is necessary to consider the different date ranges for the investigations that the authors used [35,36]  The datasets used by the authors can be divided into 53.85%public and 46.15% private, seeking mainly in the first case to validate new models.To be able to compare them with experiments, we show the data in Figure 5-private datasets used to extract knowledge by revalidating existing models in real scenarios.In the public group, the most used is the UCI German dataset with a usage of 15.38%, possibly due to its characteristics [16]; the second most used is the Lending Club platform, which is a P2P loan platform used in 15.38% of the studies.Some authors in their experiments validate both public and private data, and Ass and N-Ass models, and determine their behavior in different scenarios.
The financial industry needs machine learning tools to support credit risk prediction, as many experiments with private datasets demonstrate.However, since these use sensitive data, their access is limited, which could slow down community efforts to improve prediction using actual cases.

Answer to RQ5: What Variables or Features Are Used to Predict Credit Risk?
The authors propose many variables, using different methods to identify the variables with the best predictive capacity.GA, FL, hive intelligence, statistical methods, and functions are used to determine the correlation; in Table 8 we display the details.
To simplify the analysis, we have grouped the proposed variables into Demographic, which has a 54.09% share; Operation with 29.18%; Payment behavior with 7.62%; External factors with 6.69%; Unstructured data with 1.30%, and Transaction with 1.12%, as shown in Table 9.
The preference in the use of demographic variables for the prediction of credit risk would have to be explained because these can represent the behavior, preferences, and socioeconomic profile of the client or the segments to which they belong; their inclusion contributes positively to the models, but it is not sufficient.It must be accompanied by Payment behavior variables, characteristics of the operation, and environmental variables that could influence the results.In the reviewed articles, the authors state limitations or problems that they have faced during the experiments, although in each case there are nuances.We have grouped them into the following: Representativeness of reality, seen in 32% of the studies, refers to the fact that many of the existing variables do not reflect the true nature of the information.Unbalanced data seen in 28%, refers to the fact that, according to some authors, the usage of highly unbalanced data significantly reduces the performance of the models.Inconsistency in information recording is noted in 17%, where reference is made to the fact that the existing records have been entered with errors, bias, and noise that generates the need to apply cleaning techniques, with the risk of losing certain information.Lack of ability to explain the proposed results in 13% refers to the explainability limitation that most robust ML models have.Availability of information and centralized processing in 6% refers to the need to process information centrally, which can generate additional losses, noise, and delays.Adaptability in processing structured and unstructured information in 4%, refers to the need to process structured and unstructured data within the operation process.We show the results in Table 10.For credit risk prediction models, the difficulties center on the consistency of available information.Furthermore, the capacity of the models to process it is a limitation, since in this industry, its fundamental nature corresponds to unbalanced data.

Additional Findings
During the development of this RSL, we identified preprocessing techniques that the authors refer to during their experiments.
The techniques most used by the authors for estimating the hyperparameters are KFold in 58.33% of the studies and Grid Search in 22.22%.We display the details in Table 11.
The most-used algorithms to mitigate the problem of imbalance in the datasets are SMOTE at 29.55%, followed by KFold with 20.45%, RUS at 11.76%, and ROS similarly at 11.36%.We show the detail in Table 12.Also used are ADASYN, some variants of SMOTE like B-SMOTE, SMOTE-T, adapted classification algorithms like KN-SMOTE, Under-Bagging, and techniques to identify missing values such as NMISS, CC, CS-Classifiers, and RESAMPLE.
Credit risk prediction can be schematized as a classification problem with unbalanced samples [20]; in that sense, the authors' preprocessing techniques can be considered the baseline for new research.

Discussion
The most widely used ML models to assess credit risk correspond to the Traditional family, possibly due to their easy implementation.On the other hand, those with the best prediction results correspond to the Boosted Category, both in the Ass and in the N-Ass groups.This trend is evidenced in Table 13, where this family obtains 24 evaluations out of 52 and is constantly growing through recent years.Another trend identified is that better results are acquired with the N-Ass models [67], for example, in the experiments which obtained AUC 91.00 with the AdaB + RF model [63] and 91.20 with XGB + DT [25], respectively.As shown in Table 7 and Figure 4, these results could be explained by gradientbased optimization features, parallelism, high-volume throughput, and missing values.The most-used metrics correspond to AUC, ACC, F1 measure, Recall, and Precision; the authors propose more specialized metrics according to the situation being evaluated.AUC and ACC in 16.11% and 14.22% of the studies, are mainly due to their ability to measure the capacity of different types of ML models.In the first case, it does not vary before the transformation by normalization, which allows analyzing unbalanced data with a high cost of classification; in the second case, it works better with balanced data, and that of easy explanation.Likewise, of the experiments carried out, the most significant databases used are 53.85%public datasets, while 46.15% correspond to private ones; the first serves to evaluate the predictive capacity of the new models proposed by the authors, comparing the results with previous experiences and the second to generate new knowledge through its application in the real world.In the experiments, the authors identify as the main problem in the datasets for the design of ML models the misrepresentation of reality due to possible bias, inconsistency, or error when recording the information.The second problem corresponds to the imbalance of the data, which can impair the excellent performance of the models.To face this problem, the SMOTE algorithm is mainly used, and for the optimization of the hyperparameters, the Kfold CV and Grid Search Method techniques are utilized.However, some authors propose hive intelligence [53] and genetic algorithms [26,52] to guide optimization.Finally, the most-used variables that best represent the intention and ability to pay, which in turn originate credit risk, correspond to the Demographic, Transaction, and Payment behavior features.These encompass the main characteristics expected to predict it; see Table 8.However, the corresponding external features and unstructured data must be considered, bearing in mind the former, the influence in the hyperconnected world, the growing development of DDI, and the processing capacity of BIG DATA.

Conclusions and Future Research
In this systematic review article of the literature on credit risk prediction using ML, we reach the following conclusions:

•
The Boosted Category is a family of ML models being investigated.They are the most used in Ass and N-Ass situations, highlighting the XGB model, with the tendency being its use in Ass.This category in the experiments obtained better results than other models due to its ability to process categorical variables-numerical, with noise, missing, and unbalanced data-and applying regularization could avoid overfitting.However, since they are complex models, they are challenging to interpret and are not very tolerant of applying atypical values.

•
The five most-used metrics are AUC, ACC, Recall, F1 measure, and Precision, although, in practice, the problem must considered when choosing the most appropriate metrics.However, AUC stands out for its ability to not be influenced by the distribution of classes and preferable behavior when processing unbalanced data.• Public datasets are more utilized; of this group, the commonly used are UCI German Dataset and Landing Club Dataset.Their main use is to validate the behavior against other models under the same conditions.Private datasets generate knowledge from the application to a specific situation.

•
For the evaluation of credit risks through ML, demographic variables are mainly used, which represent behavior, preferences, socioeconomic profile, and operations that represent the characteristics of the financial product acquired.However, this information is insufficient and is complemented by external variables and those related to unstructured data such as images, video, or others generated from hyperconnectivity, which is supported by DDI and BIG DATA development and processing.

•
The main problems are the representativeness of reality, the imbalance of data for the training, and the inconsistency in recording information.All cases arise due to biases, errors, or problems in recording information.

•
The most widely used method to solve the imbalance problem is SMOTE to optimize the performance of ML models, while the methods to determine the hyperparameters are KFold-CV and Grid Search to guide their optimization.
The credit risk prediction contribution corresponds to the stage where the credit originates.In this sense, we propose to extend the application of ML to precise credit datasets from specialized companies, including these models in other processes, such as credit collection and customer retention, considering the regulatory impositions governments are implementing to mitigate possible losses in the industry.Credit risk prediction can be enhanced with BIG DATA analysis, especially on unstructured data such as images, text, writing, sentiment, and hive intelligence, to assess adaptability to changing scenarios [17].Finally, and in the same sense, including variables that represent the state of the environment could contribute to reducing uncertainty in this sector in the face of unexpected external events.

1 .
Answer to RQ1: What Are the Algorithms, Methods, and Models Used to Predict Credit Risk?
For better presentation we group them into the following families: Boosted Category, the models related to the Boosted algorithm; Collective Intelligence, models related to collective or swarm intelligence including Ant Colony Optimization (ACO), Bat Algorithm (BAT), and Particle Swarm Optimization (PSO); Fuzzy Logic, models related to Fuzzy Logic including Clustering-Based Fuzzy Classification (CBFC); NN/DL, models related to neural networks or Deep Learning (DL) including Back Propagation Neural Network (BPNN), Artificial Neural Network (ANN), Multilayer Perceptron (MLP), Wide and Deep Neural Networks (Wide&Deep), Gated Recurrent Unit (GRU), Geometric Deep Learning (GDL), Graph Neural Networks (GNNs), Deep Genetic Hierarchical Network of Learners (DG), and Convolutional Neural Networks (CNNs); Other Models, for the models not cataloged; and Traditional, for the models cataloged but not related to previous models including Decision Tree (DT), Decision Tree C4.5 (C4.5),Classification And Regression Tree (CART), K-Means (KM), Linear Discriminant Analysis (LDA), Non-Linear Logistic Regression (NLR), Naive Bayes (NB), Random Forest (RF), Random Tree (RT), Support Vector Machine (SVM), and Sequential Minimal Optimization (SMO).Of the total models used, 50.83% correspond to the Traditional model family, 27.24% to the Boosted Category, and 11.96% to NN/DL.Analyzing the groups separately corresponding to N-Ass as shown in Figure2, the Category Boosted family was used in 21% of the studies, NN/DL has a use rate of 12.8%, Traditional has a use rate of 61.6%, and Other Models have a use rate of 4.6%.

4. 2 .
Answer to RQ2: Which Are the Metrics to Evaluate the Performance of Algorithms, Methods, or Models?

Figure 4 .
Figure 4. Best Models with family and author. .

4. 4 .
Answer to RQ4: What Datasets Are Used in the Prediction of Credit Risk?

Table 3 .
Application of inclusion and exclusion criteria.

Table 4 .
Challenge of Credit Risk Prediction.

Table 5 .
Family of algorithms, methods, and models.

Table 11 .
Techniques for determination of hyperparameters.