Mathematical Modeling and Analysis of Credit Scoring Using the LIME Explainer: A Comprehensive Approach

: Credit scoring models serve as pivotal instruments for lenders and ﬁnancial institutions, facilitating the assessment of creditworthiness. Traditional models, while instrumental, grapple with challenges related to efﬁciency and subjectivity. The advent of machine learning heralds a transformative era, offering data-driven solutions that transcend these limitations. This research delves into a comprehensive analysis of various machine learning algorithms, emphasizing their mathematical underpinnings and their applicability in credit score classiﬁcation. A comprehensive evaluation is conducted on a range of algorithms, including logistic regression, decision trees, support vector machines, and neural networks, using publicly available credit datasets. Within the research, a uniﬁed mathematical framework is introduced, which encompasses preprocessing techniques and critical algorithms such as Particle Swarm Optimization (PSO), the Light Gradient Boosting Model, and Extreme Gradient Boosting (XGB), among others. The focal point of the investigation is the LIME (Local Interpretable Model-agnostic Explanations) explainer. This study offers a comprehensive mathematical model using the LIME explainer, shedding light on its pivotal role in elucidating the intricacies of complex machine learning models. This study’s empirical ﬁndings offer compelling evidence of the efﬁcacy of these methodologies in credit scoring, with notable accuracies of 88.84%, 78.30%, and 77.80% for the Australian, German, and South German datasets, respectively. In summation, this research not only ampliﬁes the signiﬁcance of machine learning in credit scoring but also accentuates the importance of mathematical modeling and the LIME explainer, providing a roadmap for practitioners to navigate the evolving landscape of credit assessment.


Introduction
In today's dynamic and interconnected global economy, credit plays a pivotal role in facilitating economic activities and fostering financial growth.Lenders and financial institutions rely heavily on credit scoring models to assess the creditworthiness of individuals, businesses, and other entities seeking access to financial products and services.
A credit score is a numerical representation of an individual's creditworthiness, which helps lenders gauge the risk associated with extending credit and making lending decisions.As such, credit scoring has become an indispensable tool in the modern financial landscape, shaping access to credit and influencing financial outcomes for millions of borrowers worldwide.The development and refinement of credit scoring models have evolved significantly over the years, driven by advancements in data analytics, statistical modeling techniques, and the availability of vast amounts of financial and nonfinancial data [1,2].Traditionally, credit scoring was primarily based on a few key factors, such as payment history, outstanding debt, length of credit history, and new credit applications.However, contemporary credit scoring models have incorporated a more diverse set of variables and sophisticated algorithms to enhance predictive accuracy and provide a more comprehensive assessment of credit risk.The importance of credit scoring cannot be overstated, as it not only affects the availability of credit but also influences interest rates, loan terms, and overall financial inclusion.Access to affordable credit is crucial for individuals and businesses to pursue their aspirations, invest in productive ventures, and contribute to economic growth.Moreover, credit scoring also plays a vital role in mitigating risks for lenders, enabling them to make informed decisions, manage their loan portfolios effectively, and maintain the stability of the financial system [3,4].
Traditionally, credit scoring has heavily relied on manual processes, limited variables, and subjective criteria, leading to inefficiencies and potential biases in the evaluation process.However, the advent of machine learning techniques has revolutionized the credit scoring landscape, offering automated and data-driven approaches that can significantly enhance the accuracy and efficiency of credit assessments.Machine learning algorithms have demonstrated remarkable capabilities in handling complex and high-dimensional data, learning patterns and relationships, and making predictions based on historical information.By training on large-scale credit datasets, machine learning models can capture intricate credit patterns that may be overlooked by traditional methods.Furthermore, these algorithms can adapt and evolve as new data become available, ensuring their relevance in dynamic credit markets.The utilization of machine learning in credit scoring holds the promise of providing lenders with more objective, consistent, and reliable credit assessment models [5][6][7].
The objective of this research is to conduct experiments and analyses using various machine learning classifiers on different credit approval datasets.The research aims to evaluate and compare the performance of these classifiers in terms of multiple evaluation metrics, including accuracy, sensitivity, specificity, precision, F1 score, receiver operating characteristic (ROC), balanced accuracy, and weighted sum metric (WSM) performance.Through these systematic experiments, the research seeks to assess how well these classifiers can predict credit approval outcomes and identify which classifiers perform best under different dataset conditions.Additionally, the research involves analyzing the impact of various data preprocessing techniques, such as feature selection and scaling, on the classifier's performance.The key research questions to be addressed in this study include the following: -How do mathematical formulations underpin the machine learning algorithms employed in credit score classification, and which algorithms demonstrate superior predictive performance?- In the context of credit scoring models, how does the mathematical modeling of feature selection, especially using the PSO metaheuristic optimizer, influence the model's accuracy and efficiency? - In what ways can the mathematical optimization of hyperparameters (e.g., learning rates, regularization strengths) in machine learning models influence their performance in credit score classification, and are there specific optimization algorithms that are more effective for this domain?
-Through the incorporation of the mathematical model of the LIME explainer, what insights can be gleaned concerning the strengths, limitations, and interpretability of various machine learning approaches in the context of credit scoring?-Based on the empirical findings and mathematical rigor introduced in this study, how can practitioners be better equipped to choose the most suitable machine learning techniques and feature selection methods for credit score classification?
To achieve these research objectives, a comprehensive experimental analysis is conducted using publicly available credit datasets.Various machine learning algorithms, including but not limited to logistic regression, decision trees, random forests, support vector machines, and neural networks, are implemented and evaluated.The performance of these algorithms is measured using standard evaluation metrics such as accuracy, precision, recall, and F1 score.Furthermore, the impact of feature selection methods are investigated to determine the most relevant variables for credit scoring.The significance of this research extends across several dimensions, providing a comprehensive understanding of credit scoring models due to the following: -This study introduces a robust mathematical framework underpinning machine learning algorithms and preprocessing techniques in the realm of credit scoring.This framework ensures the solidity of credit scoring models, providing a sound basis for analysis and decision making.-Through the rigorous mathematical modeling of feature selection, particularly harnessing the Particle Swarm Optimization (PSO) metaheuristic optimizer, this research provides valuable insights into the identification of the most relevant variables for credit scoring.This optimization process not only improves the accuracy of credit scoring models but also enhances their computational efficiency, making them more practical for real-world applications.The findings of this research not only contribute to the current state of credit scoring but also pave the way for further advancements in the field.By identifying areas where mathematical rigor, feature selection, and interpretability can be enhanced, it opens doors to future research, innovation, and continuous improvement in credit scoring models and practices.This advancement is essential in keeping pace with evolving financial landscapes and data-driven technologies.
The rest of this research paper is structured as follows: Section 2 presents a review of the literature concerned with feature selection and machine learning algorithms for credit scoring.Section 3 presents the datasets utilized in this study.Section 4 provides a detailed discussion of the proposed approach.Section 5 describes the experiments conducted and discusses their outcomes.Lastly, Section 6 concludes this paper and outlines future research directions.

Feature Selection, Machine Learning, and Credit Scoring: A Review of Literature
Default risk is a primary concern in online lending, prompting the use of credit scoring models to assess borrower creditworthiness.Existing efforts have mainly focused on improving assessment methods, without adequately addressing the quality of credit data, often plagued by noisy, redundant, or irrelevant features that hinder model accuracy.Effective feature selection methods are crucial for enhancing credit evaluation accuracy.Current feature selection methods in online credit scoring suffer from issues like subjectivity, time consumption, and low accuracy, necessitating the introduction of innovative approaches.Zhang et al. [8] proposed a solution called the local binary social spider algorithm (LBSA), which incorporates two local optimization strategies (i.e., opposition-based learning (OBL) and improved local search algorithm (ILSA)) into BinSSA.These strategies address the aforementioned drawbacks.Comparative experiments conducted on three typical online credit datasets (i.e., Paipaidai (PPD), Renrendai (RRD) in China, and Lending Club (LC) in the United States) concluded that LBSA significantly reduces feature subset redundancy, enhances iterative stability, and improves credit scoring model accuracy and effectiveness.
Tripathi et al. [9] directed their efforts toward enhancing credit scoring models employed by financial institutions and credit industries.Their primary objective was to enhance model performance by introducing a hybrid methodology that combines feature selection with a multilayer ensemble classifier framework.This hybrid model was meticulously crafted in three distinct phases: initial preprocessing and classifier ranking, followed by ensemble feature selection, and ultimately the utilization of the selected features within a multilayer ensemble classifier framework.To further optimize ensemble performance, they introduced a classifier placement algorithm based on the Choquet integral value.Then, the researchers conducted experiments using real-world datasets, including Australian (AUS), Japanese (JPD), German-categorical (GCD), and German-numerical (GND).The findings indicated that the features chosen through their proposed approach exhibited enhanced representativeness, leading to improved classification accuracy across various classifiers such as quadratic discriminant analysis (QDA), Naïve Bayes (NB), multilayer feed-forward neural network (MLFN), time-delay neural network (TDNN), distributed time-delay neural network (DTNN), decision tree (DT), and support vector machine (SVM).Additionally, for all the credit scoring datasets considered, the proposed ensemble model consistently outperformed traditional ensemble models in terms of accuracy, sensitivity, and G-measure.
Furthermore, Zhang et al. [10] introduced a novel multistage ensemble model with enhanced outlier adaptation to enhance credit scoring predictions.To mitigate the impact of outliers in noisy credit datasets, an improved local outlier factor algorithm was employed, incorporating a bagging strategy to identify and integrate outliers into the training set, thereby enhancing base classifier adaptability.Additionally, for improved feature interpretability, a novel dimension-reduced feature transformation method was proposed to hierarchically evolve and extract salient features.To further enhance predictive power, a stacking-based ensemble learning approach with self-adaptive parameter optimization was introduced, automatically optimizing base classifier parameters and constructing a multistage ensemble model.The performance of this model was evaluated across ten datasets (e.g., Australian, Japanese, German, Taiwan, and Polish credit datasets) using six evaluation metrics, and the reported experimental results demonstrated the superior performance and effectiveness of the suggested approach.
A sequential ensemble credit scoring model based on XGBoost, a variation of the gradient boosting machine, was proposed by Xia et al. [11].The proposed XGBoost-based credit scoring model consists of three phases (i.e., data preprocessing, data scaling, and missing value marking).The redundant features are then removed using a model-based feature selection approach, which enhances performance and lowers computing costs.The final model is trained using the acquired configuration after the hyperparameters have been tuned using the Tree-structured Parzen Estimator (TPE) method.The results show that TPE hyperparameter optimization outperforms grid search, random search, and manual search.
The proposed model also provides feature importance scores and decision charts, which enhance the interpretability of the credit scoring model.Moreover, Liu et al. [12] introduced two tree-based augmented GBDTs, AugBoost-RFS and AugBoost-RFU.These methods incorporate a stepwise feature augmentation mechanism to diversify base classifiers within GBDT, and they maintain interpretability through tree-based embedding techniques.Experimental results on four large-scale credit scoring datasets demonstrated that AugBoost-RFS and AugBoost-RFU outperform standard GBDT.Moreover, their supervised tree-based feature augmentation achieved competitive results compared with neural network-based methods, while significantly improving efficiency.
Chen et al. [13] proposed a multilevel Weighted Voting classification algorithm based on the combination of classifier ranking and the Adaboost algorithm.Four feature selection methods were used to select the features; then, seven commonly used heterogeneous classifiers were used to select five classifiers and calculate their ranks, and then AdaBoost was used to boost the performance of the selected base classifiers and calculate the updated F1 and ranks.The effects of ensemble framework Majority Voting (MV), Weighted Voting (WV), Layered Majority Voting (LMV), and Layered Weighted Voting (LWV) were all evaluated from the aspects of accuracy, sensitivity, specificity, and G-measure.The outcome of the experiments showed that the presented method achieved significant results in Australian credit score data and some progress on the German loan approval data.In Gicić et al. [14], stacked unidirectional and bidirectional LSTM networks were applied to solve credit scoring tasks.The proposed model exploited the full potential of the three-layer stacked LSTM and BiLSTM architecture with the treatment and modeling of public datasets.Attributes of each loan instance were transformed into a sequence of the matrix with a fixed sliding window approach with a one-time step.The proposed models outperformed existing and more complex deep learning models and, thus, succeeded in preserving their simplicity.
Kazemi et al. [15] proposed an approach based on a Genetic Algorithm (GA) and neural networks (NNs) to automatically find customized cut-off values.Since credit scoring is a binary classification problem, two popular credit scoring datasets (i.e., the "Australian" and "German" credit datasets) were used to test the proposed approach.The numerical results reveal that the proposed GA-NN model could successfully find customized acceptance thresholds, considering predetermined performance criteria, including Accuracy, Estimated Misclassification Cost (EMC), and AUC for the tested datasets.Furthermore, the bestobtained results and the paired samples t-test results showed that utilizing the customized cut-off points leads to a more accurate classification than the commonly used threshold value of 0.5.Khatir and Bee [16] aimed to pinpoint the most significant predictors of credit default to construct machine learning classifiers capable of efficiently distinguishing defaulters from nondefaulters.They proposed five machine learning classifiers, and each of them was combined with different feature selection techniques and various data-balancing approaches.Given the imbalance in the used dataset (i.e., German Credit Data), three sample-modifying algorithms were used, and their impact on the performance of the classification models was evaluated.The key findings highlighted that the most effective classifier is a random forest combined with random forest recursive feature elimination and random oversampling.Moreover, it underscored the value of data-balancing algorithms, particularly in enhancing sensitivity.
Khan and Ghosh [17] introduced an improved version of the random wheel classifier.Their proposed approach was evaluated using two datasets (i.e., Australian and South German credit approval datasets).The results showed that their approach not only delivers more accurate and precise recommendations but also offers interpretable confidence levels.Additionally, it provided explanations for each credit application recommendation.This inclusion of recommendation confidence and explanations can instill greater trust in machine-provided intelligence, potentially enhancing the efficiency of the credit approval process.Haldankar [18] discussed the use of data mining techniques to identify fraud in various domains, particularly focusing on risk detection.The study proposed a cost-sensitive classifier for detecting risk using the Statlog (German Credit Data) dataset.
The study demonstrated the effectiveness of proper feature selection combined with an ensemble approach and thresholding in reducing the overall cost.The study reported an ACC of 76% and a SPC of 55%.
Wang et al. [19] focused on ensemble classification.They conducted an analysis and comparison of SVM ensembles using four different ensemble constructing techniques.They reported the highest ACC of 85.35% using the Statlog (Australian credit approval) dataset and 76.41% using the Statlog (German Credit Data) dataset.Additionally, Novakovic et al. [20] presented the performance of the C4.5 decision tree algorithm with wrapper-based feature selection.They conducted tests using eighteen datasets to compare the classification ACC results with the C4.5 decision tree algorithm.The authors demonstrated that wrapper-based feature selection, when applied to the C4.5 decision tree classifier, effectively contributed to the detection and elimination of irrelevant, redundant data, and noise in the data.They reported an ACC of 71.72% using the J48 reduced approach on the Statlog (German Credit Data) dataset.

Attribute Description
Attribute

Methodology
The current study proposes the framework depicted in Figure 1.The figure comprises three components: (a) The abstract view of the suggested training and optimization framework.It involves loading the dataset, applying a normalization technique, extracting the most promising features, selecting a model, and tuning the selected model.(b) The flow of the feature selection process utilizing the PSO metaheuristic optimizer.It encompasses initializing the PSO hyperparameters and solutions, calculating fitness scores for different solutions, and updating the solutions.(c) A model explanation using the LIME explainer model takes the instance that requires explanation and the tuned model as input, and it generates an explanation for it.

Features Scaling Techniques
Scalers, also known as data normalization or feature scaling techniques, are preprocessing methods used to transform the values of features in a dataset to a common scale.Scaling is crucial in machine learning tasks, as it helps to ensure that features with different ranges or units contribute equally to the learning process [24].In this section, a background on several commonly used scalers is provided, including L1, L2, and max scalers.The L1 scaler, also known as the least absolute deviations scaler, normalizes the features in a dataset by dividing each feature by the sum of their absolute values.This scaler ensures that the sum of absolute feature values is equal to 1.It is particularly useful when the presence or absence of features is important, and their magnitudes are not relevant.The L2 scaler, also known as the Euclidean norm scaler, normalizes the features by dividing each feature by the square root of the sum of their squares.This scaler ensures that the sum of squared feature values is equal to 1.It is commonly used when both the presence or absence of features and their magnitudes are relevant [25,26].Equations ( 1) and ( 2) show how to calculate the L1 and L2 scalers, respectively, where X represents the original feature values, X scaled represents the scaled feature values, and |(X)| represents the absolute values of the elements in X.The Max scaler, also known as the maximum scaler, scales the features by dividing each feature by the maximum value across the entire feature set.This scaler maps the features into the range [0, 1].It is particularly useful when the distribution of the features is highly skewed or contains outliers.Standardization (STD), also known as the Z-score scaler, transforms the features by subtracting the mean of the feature set and dividing by the standard deviation.This scaler ensures that the transformed features have zero mean and unit variance.It is commonly used when the features follow a Gaussian distribution or when algorithms assume standardized input.The MinMax scaler scales the features by subtracting the minimum value and dividing by the difference between the maximum and minimum values.This scaler maps the features into the range [0, 1].It preserves the relative relationships and proportions of the feature values and is useful when the distribution of the features is not necessarily Gaussian.The Max-Absolute scaler scales the features by dividing each feature by the maximum absolute value across the entire feature set.This scaler maps the features into the range [−1, 1].It is particularly useful when preserving the sign of the data is important, such as in sparse datasets [27,28].Equations ( 3)-( 6) show how to calculate the Max, STD, MinMax, and Max-Absolute scalers, respectively, where mu represents the mean of the feature set and σ represents the standard deviation of the feature set.

Features Selection Using Particle Swarm Optimization (PSO)
Feature selection plays a crucial role in machine learning and data mining tasks by identifying the most informative and relevant subset of features from a given dataset.It aims to improve model performance, reduce computational complexity, and enhance interpretability by selecting a subset of features that are highly predictive of the target variable.Particle Swarm Optimization (PSO) is a population-based optimization algorithm inspired by the social behavior of bird flocking or fish schooling.It has been widely applied to feature selection due to its ability to efficiently explore high-dimensional search spaces and find near-optimal solutions [29].The main goal of using PSO for feature selection is to find an optimal subset of features that maximizes the performance of a given machine learning model.The process involves defining a fitness function that quantifies the quality of a feature subset based on its predictive power or some other criterion.The fitness function can be based on classification accuracy, regression error, or any other evaluation metric appropriate for the task at hand.PSO-based feature selection offers several advantages.It can effectively handle high-dimensional feature spaces and explore a large number of possible feature combinations.PSO's ability to balance exploration and exploitation helps in finding near-optimal solutions efficiently.Furthermore, PSO is a versatile technique that can be combined with various machine learning algorithms, making it applicable to different problem domains [30][31][32].
In PSO-based feature selection, each particle in the swarm represents a potential feature subset.The swarm collectively explores the search space of possible feature combinations by adjusting their positions and velocities.The position of a particle corresponds to a binary string, where each bit represents the presence or absence of a particular feature.The velocity represents the direction and magnitude of change in the binary string.During the optimization process, particles update their velocities and positions based on their own experience (i.e., personal best) and the best solution found by any particle in the swarm (i.e., global best).The personal best represents the best feature subset the particle has encountered so far, while the global best represents the best feature subset found by any particle in the swarm.These best positions guide the movement of particles towards promising regions in the search space.The update equations for PSO-based feature selection involve modifying the velocities and positions of particles based on the current velocities, personal bests, and global best.The specific equations may vary depending on the variant of PSO used and the problem formulation.The iterative optimization process continues until a termination criterion is met, such as reaching a maximum number of iterations or convergence of the particle positions.The resulting global best position represents the selected feature subset that optimizes the performance of the chosen machine learning model [33][34][35].
Let P be the population of particles in the swarm, where each particle p i represents a potential feature subset (i.e., ith particle).Each particle p i has a position vector x i and a velocity vector v i , where x ij and v ij represent the jth element of x i and v ij , respectively.The feature subset is represented as a binary string x i of length n, where x ij denotes the presence (x ij = 1) or absence (x ij = 0) of feature j in particle i.
The position of particle p i is denoted as and the velocity is represented as During the optimization process, particles update their velocities and positions based on their personal best (p best j ) and the global best (p best ) solution found by any particle in the swarm.
The velocity update equation for particle p i is given by Equation (7), where v (t+1) ij is the updated velocity of feature j in particle i at iteration (t + 1), w is the inertia weight, c 1 and c 2 are acceleration constants, r (t) 1 and r (t) 2 are random values at iteration t, p best ij is the personal best value of feature j for particle i, and p best j is the global best value of feature j among all particles.The position update equation for particle p i is given by Equation ( 8) The personal best value p best ij is updated if the fitness of the current position is better than the previous personal best, as presented in Equation ( 9).The global best value g best j is updated by selecting the best position from all particles, as presented in Equation (10), where i * is the index of the particle with the best fitness among all particles.
The process continues iteratively until a termination criterion is met, such as a maximum number of iterations or convergence.The final feature subset is represented by the binary string of the global best position (i.e., g best ), which optimizes the performance of the chosen model.

Machine Learning Classification and Tuning
Machine learning classifiers are algorithms that are designed to learn patterns and make predictions based on labeled training data.They are widely used in various domains, including image recognition, natural language processing, fraud detection, and credit scoring.This study utilized the following machine learning classifiers: LGBM, XGB, KNN, DT, LR, RF, AdaBoost, HGB, and MLP.The LGBM (Light Gradient Boosting Model), as presented in Equation (11), is a gradient boosting framework that uses tree-based learning algorithms.It is known for its high efficiency and scalability, making it suitable for large-scale datasets.LGBM utilizes a gradient-based optimization strategy to construct an ensemble of weak models that sequentially minimize the loss function.It incorporates features such as histogram-based binning and leafwise tree growth to achieve faster training and better accuracy.N is the number of weak models, α i are the coefficients, and h i (x) are the weak models.XGB (Extreme Gradient Boosting), as presented in Equation ( 12), is another popular gradient boosting algorithm that excels in predictive accuracy.It uses a similar approach to LGBM but incorporates additional regularization techniques to prevent overfitting.XGB employs a combination of gradient boosting and decision tree algorithms, optimizing a differentiable loss function through successive iterations.It offers flexibility in terms of customizing the optimization objectives and evaluation metrics [36,37].f i (x) are the base models, T(x; Θ t ) are the decision trees, and γ t are the step sizes.
LGBM: Ensemble KNN (K-Nearest Neighbors), as presented in Equation ( 13), is a nonparametric algorithm that classifies new instances based on their similarity to the labeled training instances.
It operates on the principle that objects with similar attributes tend to belong to the same class.KNN determines the class of an unseen instance by considering the labels of its k-nearest neighbors in the feature space.The choice of k influences the trade-off between model complexity and accuracy.y neighbors are the class labels of the k-nearest neighbors of x.DT (Decision Trees), as presented in Equation ( 14), are hierarchical models that recursively partition the feature space based on attribute values.Each internal node represents a decision based on a specific feature, while each leaf node represents a class label or a prediction.Decision trees are interpretable, capable of handling both categorical and numerical features, and they are resistant to outliers.However, they are prone to overfitting, especially when the trees become too complex.N i are internal nodes, S i are splits, and θ i are threshold values.
KNN: Class(x) = mode(y neighbors ) DT: LR (Logistic Regression), as presented in Equation ( 15), is a linear classifier that models the relationship between the input features and the probability of belonging to a certain class.It is commonly used for binary classification tasks but can be extended to handle multiclass problems as well.LR applies a Sigmoid function to the linear combination of the input features, mapping the result to a probability between 0 and 1.It learns the optimal weights through maximum likelihood estimation.β 0 to β n are the factors of the LR equations.RF (Random Forest), as presented in Equation ( 16), is an ensemble learning method that combines multiple decision trees to make predictions.It constructs each tree by using a random subset of the training data and a random subset of the input features.RF leverages the principle of the "wisdom of crowds" to reduce overfitting and improve generalization performance.It provides feature importance measures and can handle highdimensional data effectively [38,39].T is the number of trees, and T i (X) are individual decision trees. LR: AdaBoost (Adaptive Boosting), as presented in Equation (17), is an ensemble learning technique that combines weak classifiers to create a strong classifier.It assigns higher weights to misclassified instances, allowing subsequent classifiers to focus on difficult examples.The final prediction is determined by a weighted vote of all weak classifiers.AdaBoost is particularly effective in handling complex datasets and can achieve high accuracy even with weak base classifiers.α t are weights and h t (x) are weak classifiers.HGB (Histogram Gradient Boosting), as presented in Equation ( 18), is a histogram-based gradient boosting algorithm that combines the advantages of gradient boosting and histogram binning.It discretizes the continuous input features into histograms, allowing for faster training and efficient memory usage.HGB incorporates various optimization techniques, including early stopping and feature subsampling, to enhance performance.f i (x) are base models, H(x; Θ t ) are histograms, and γ t are step sizes.
MLP (Multilayer Perceptron), as presented in Equation (19), is a type of artificial neural network that consists of multiple layers of interconnected nodes (neurons).It learns by adjusting the weights and biases associated with each connection, enabling it to approximate complex nonlinear functions.MLP is a versatile classifier capable of handling a wide range of problem domains.It requires careful architecture design and appropriate activation functions to achieve good performance [37,40].
Hyperparameter tuning is a critical aspect of machine learning model development, as it involves selecting the optimal configuration of hyperparameters that govern the behavior of the model.Hyperparameters significantly impact the model's performance, and finding the best combination can be a challenging and time-consuming process.One popular technique for hyperparameter optimization is the Tree-structured Parzen Estimator (TPE).The TPE algorithm is a sequential model-based optimization approach that uses Bayesian optimization to efficiently search the hyperparameter space.It models the relationship between hyperparameters and the performance metric of interest, typically using a Gaussian Process.TPE divides the search space into two parts: the exploration space, where hyperparameters are randomly sampled, and the exploitation space, where the most promising hyperparameters are selected based on their expected improvement.Adaptive TPE takes the TPE algorithm further by incorporating adaptive mechanisms to dynamically adjust the search process based on the observed performance.It continuously learns from the optimization process and adapts the exploration-exploitation trade-off accordingly.This adaptivity allows Adaptive TPE to focus the search on promising regions of the hyperparameter space and efficiently explore different configurations [41][42][43].
The Adaptive TPE algorithm follows these key steps: Initialization: The search process begins with an initial set of hyperparameter configurations randomly sampled from the search space.Evaluation: Each configuration is evaluated using cross-validation or another appropriate evaluation method to obtain the performance metric.Modeling: A probabilistic model, such as a Gaussian Process, is constructed to capture the relationship between hyperparameters and the performance metric.Selection: Based on the probabilistic model, the next set of hyperparameter configurations is selected using the expected improvement or another acquisition function.This balances the exploration of unexplored regions and the exploitation of promising configurations.Update: The selected configurations are evaluated, and the performance results are used to update the probabilistic model.Iteration: Steps 4 and 5 are repeated iteratively until a stopping criterion is met, such as a maximum number of iterations or convergence of the performance metric [41,44].
The advantages of Adaptive TPE for hyperparameter tuning include its ability to efficiently explore the search space, adapt to the observed performance, and converge to promising configurations.It balances exploration and exploitation to find the best hyperparameters within a reasonable computational budget.Adaptive TPE is applicable to various machine learning algorithms and can significantly improve model performance compared with using default or suboptimal hyperparameter settings [41,42,44].
As mentioned, the TPE aims to maximize the conditional probability of hyperparameters given the performance metric, y as presented in Equation (20), where x represents hyperparameter configurations, y , p(x|y) represents the conditional probability of hyperparameters given the performance metric, p(y|x) represents the conditional probability of the performance metric given hyperparameters, p(x) represents the prior probability of hyperparameters, and p(y) represents the marginal probability of the performance metric.TPE: arg max The Adaptive TPE algorithm follows the key steps presented in Equation (21).In it, the Gaussian Process (GP), as presented in Equation ( 22), models the underlying function f (x) mapping hyperparameters x to the performance metric y, considering Gaussian noise ε.The expected improvement (EI) acquisition function, as presented in Equation ( 23), measures the potential improvement over the current best performance metric f (x min ) for a given hyperparameter configuration x.

Adaptive TPE: Initialization
Gaussian Process: (5) k-Fold Cross-Validation helps in identifying overfitting, which occurs when a model performs well on the training set but fails to generalize to new, unseen data.By evaluating the model's performance on multiple validation sets, it provides insights into the model's generalization ability and potential overfitting issues [45,46].k-Fold Cross-Validation can be expressed as presented in Equation (24), where k is the number of folds, M i is the model trained on the ith fold, and Performance (M i ) is the performance of the model on the validation set.
Performance metrics are essential tools used to evaluate the effectiveness and quality of machine learning models.They provide quantitative measures to assess how well a model performs on classification or prediction tasks.In this section, a background on several commonly used performance metrics is provided, including accuracy, recall, precision, F1 score, specificity, balanced accuracy, and receiver operating characteristic (ROC) curve.Accuracy, as presented in Equation (25), is a widely used performance metric that measures the proportion of correctly predicted instances out of the total number of instances.It provides a general overview of how well a model performs across all classes.However, accuracy may not be suitable for imbalanced datasets, where the majority class dominates the performance evaluation.Recall (Sensitivity or True Positive Rate), as presented in Equation ( 26), measures the proportion of correctly predicted positive instances (true positives) out of all actual positive instances.It focuses on identifying as many positive instances as possible and is particularly useful when the goal is to minimize false negatives.
In medical diagnostics or fraud detection, recall is crucial to ensure the identification of all relevant cases, even at the cost of higher false positives [47,48] Precision, as presented in Equation ( 27), measures the proportion of correctly predicted positive instances (true positives) out of all predicted positive instances (true positives and false positives).It focuses on the accuracy of positive predictions and is particularly useful when minimizing false positives is crucial.Precision is important in scenarios where false positives have significant consequences, such as in spam email filtering or legal systems.F1 Score, as presented in Equation ( 28), combines precision and recall into a single metric, providing a balanced measure of a model's performance.It is the harmonic mean of precision and recall, offering a single value that represents both metrics.The F1 score is suitable when there is an imbalance between precision and recall, and a balance between the two is desired.Specificity (True Negative Rate), as presented in Equation ( 29), measures the proportion of correctly predicted negative instances (true negatives) out of all actual negative instances.It is the complement of the false positive rate (FPR) and provides a measure of how well a model identifies negative instances.Specificity is particularly relevant when minimizing false positives is critical, such as in medical testing or manufacturing quality control [46,48] Balanced Accuracy, as presented in Equation ( 30), takes into account the proportion of correctly predicted instances for each class, providing an overall measure of model performance that accounts for class imbalance.It calculates the average of sensitivity (recall) across all classes.Balanced accuracy is valuable when there are significant differences in the number of instances among different classes.ROC Curve (receiver operating characteristic) is a graphical representation of the trade-off between the true positive rate (i.e., sensitivity) and the false positive rate (i.e., 1-specificity) for different classification thresholds.It helps evaluate the performance of a model across various thresholds and provides a visual tool to compare different models [46,47].

Model Explainability Using LIME
Model explainability is a crucial aspect of machine learning, particularly in domains where decisions have significant implications, such as healthcare, finance, and legal systems.While complex machine learning models, such as deep neural networks, often deliver high predictive accuracy, they lack interpretability, making it challenging to understand how they arrive at their predictions.Local Interpretable Model-agnostic Explanations (LIME) is a popular technique that addresses this issue by providing post hoc explanations for black-box models.LIME (Local Interpretable Model-Agnostic Explanations) aims to explain the predictions of any machine learning model by approximating its behavior locally.The key idea behind LIME is to create interpretable models, such as linear models or decision trees, that are locally faithful to the predictions of the black-box model.By explaining the model's predictions in a human-understandable manner, LIME helps users comprehend and trust the decisions made by the machine learning model.The advantages of LIME include its model-agnostic nature, as it can be applied to any black-box model without requiring knowledge of its internal workings.LIME also provides interpretable explanations at the local level, which can enhance trust and understanding of the model's decisions.Additionally, LIME can handle various types of data, including text, images, and structured data [49][50][51].
The LIME process involves the following steps: Selection of Instances: Initially, a set of instances or data points for which explanations are required is selected.These instances represent the inputs for which the model's predictions need to be explained.Perturbation: For each selected instance, LIME generates perturbed versions by randomly sampling data points near the original instance while preserving the important features.The perturbed instances are created to assess the model's behavior in the local neighborhood of the original instance.Model Prediction: The black-box model's predictions are obtained for the perturbed instances, capturing the output of the model within the local neighborhood of the original instance.Feature Selection: LIME identifies the important features for the selected instance by employing a technique such as sparse linear regression or decision tree induction.These features play a significant role in the model's predictions within the local context.Weights and Explanations: LIME assigns weights to the perturbed instances based on their proximity to the original instance and uses these weights to learn an interpretable model.This interpretable model approximates the behavior of the blackbox model in the local neighborhood, providing explanations for the original instance's prediction.Explanation Generation: Finally, LIME generates explanations by highlighting the important features and their contributions to the prediction.This can be visualized as feature importance scores or rules that indicate the influence of each feature on the model's output [50,52].The LIME process can be summarized mathematically as follows: Step 1: Model Explainability: Understanding decisions made by complex machine learning models, e.g., f ML (x), is critical in domains like healthcare and finance.
Step 2: LIME Approximation: f LIME (x ) ≈ f ML (x), where f LIME is an interpretable model and x is a local perturbation.The LIME objective is to find an approximate of the black-box model f ML (x) with an interpretable model f LIME (x ) by minimizing the loss between them (Equation ( 31)).min Step 3: LIME Process: -Selection of Instances: D = x 1 , x 2 , . . ., x n , where x i are instances to be explained.-Perturbation: , where i is a small perturbation.-Model Prediction: f ML (x i ), predicting using the black-box model.-Feature Selection: Identify significant features such as x i = [x i1 , x i2 , . . ., x im ] from the perturbed instances using a regression model, such as Lasso regression (Equation ( 32)).min -Weights and Explanations: Learn weights w j for features and create the interpretable model (Equation ( 33)).
-Explanation Generation: Generate explanations such as feature importance scores or rules.-Feature Importance Scores (Explanation Generation): Calculate feature importance scores based on the absolute values of feature weights (Equation ( 34)).
Feature Importance = |w j | -Rule Extraction (Explanation Generation): Extract rules from the interpretable model.For example, "If x 1 is significantly different from x 1 , it strongly influences the prediction".
As noted, LIME often uses a kernel function to assign weights to perturbed samples based on their proximity to the original instance.The choice of kernel function (e.g., Gaussian kernel) and the bandwidth parameter can affect the weighting mathematically.The kernel weighting formula typically looks like Equation (35), where the weight w i assigned to a perturbed sample x i based on its distance to the original instance x and the bandwidth parameter σ in the context of kernel weighting in LIME.d(x, x i ) represents the distance between the original instance x and the perturbed sample x i [53].
The computational complexity of LIME can be analyzed by considering the number of operations required for its various steps.Firstly, in the step of generating perturbed samples, LIME creates these samples for each of the n instances.For each instance, small random values are added to the features, resulting in k perturbed samples per instance.This operation's complexity is O(n × m × k), where n is the number of instances, m is the dimensionality of the feature space, and k is the number of perturbed samples per instance.
Next, when evaluating the black-box model for each perturbed sample, the computational cost depends on the complexity of the black-box model itself.If the evaluation of the black-box model has a complexity of O( f ), then the total complexity for this step is Moving on to feature selection and the creation of an interpretable model, techniques like Lasso regression are employed, involving optimization problems.The complexity of solving these optimization problems depends on factors like the number of iterations (t) and the number of features selected (s).Training the interpretable model, typically a linear regression model, has a complexity of O(n × s × m), where n is the number of instances.
Finally, generating explanations, such as feature importance scores, is typically a linear time operation with respect to the number of features.The complexity for generating explanations is O(s × m).The overall computational complexity of LIME can be approximated as formulated in Equation (36).
The most significant factors affecting complexity include the number of instances (n), the dimensionality of the feature space (m), the number of perturbed samples per instance (k), the complexity of the black-box model evaluation ( f ), the number of iterations in feature selection (t), and the number of selected features (s) [54].

Experiments and Discussions
The present study utilized the Python 3.11.4programming language to conduct experiments, employing various packages, including scikit-learn 0.24.2 and LIME 0.2.0.1.The experiments were executed on a device equipped with 128 GB of RAM, a 4 GB NVIDIA graphics card, and the Windows 11 operating system.
Among the classifiers, AdaBoost achieved an ACC of 87.54%.It demonstrated good SNS (88.27%) and SPC (86.95%), indicating its ability to correctly identify both positive and negative instances.However, its PRC (84.42%) was slightly lower compared with other classifiers.The F1 score (86.31%) and ROC (87.61%) were reasonably high, reflecting a balanced performance.The BAC was 87.61%, and the WSM performance was 86.96%.AdaBoost utilized L1 regularization, and the hyperparameters included a logistic regression value of approximately 0.87672 and an estimate of 26.The selected features were X1, X3, X5, X8, X9, X11, and X12.

Overall Discussion and Explainability
Table 6 presents the performance of the best classifiers on three different datasets after conducting 100 trials.The results are reported in terms of mean values and their corresponding standard deviations.The first metric, accuracy, displays the mean accuracy values followed by their standard deviations for the datasets.For example, the mean accuracy value for the Australian Credit dataset is 87.57, with a standard deviation of 0.45.Similarly, the subsequent metrics include the other metrics, where each metric provides the mean values and standard deviations for the corresponding datasets.For the "Statlog (Australian Credit Approval)" dataset, the features were X4, X5, X8, X9, X10, X11, X13, utilizing the L2 scaler and RF classifier (with 90 estimators, max depth of 2, and Gini criterion).For the "South German Credit" dataset, the features were laufkont, laufzeit, moral, hoehe, beszeit, buerge, verm, pers, telef, utilizing the STD scaler and MLP classifier (with ReLU activation, Adam optimizer, 272 hidden layers, and constant learning rate).For the "Statlog (German Credit Data)" dataset, the features were C1, C2, C3, C5, C6, C7, C9, C10, C11, C12, C13, C17, and C20, utilizing the STD scaler and LGBM classifier (with 81 estimators, max depth of 9, and learning rate of ≈0.15232).
In the "Statlog (Australian Credit Approval)" dataset, Figure 2 provides the LIME explanation for the model's positive decision (i.e., Yes) with a 97% confidence regarding a testing instance.The figure demonstrates that this decision was primarily influenced by the high confidence value of X8, which was 1 (within the range of −1 to 1).In the "South German Credit" dataset, Figure 3 provides the LIME explanation for the model's positive decision (i.e., Yes) with a 75% confidence regarding a testing instance.The figure demonstrates that this decision was primarily influenced by the high confidence value of laufkont, which was 4 (within the range of 2 to 4), and buerge, which was 1 (less than or equal to 1).In the "Statlog (German Credit Data)" dataset, Figure 4 provides the LIME explanation for the model's negative decision (i.e., No) with a 96% confidence regarding a testing instance.The figure demonstrates that this decision was primarily influenced by the high confidence value of C1, which was 3 (within the range of 1 to 3).

Related Studies Comparison
Table 7 shows a comparison between the suggested approach and related studies concerning the same used datasets.It can be observed from the literature that there are different research works for credit scoring.For "Statlog (Australian Credit Approval)" [21], the current study achieved an accuracy of 88.84%.This result is competitive with the highest accuracy reported for this dataset, which was 91.91% by Kazemi et al. [15].In the case of the "Statlog (German Credit Data)" [22], the current study achieved an accuracy of 78.30%.While this accuracy is an improvement over some earlier studies, it falls short of the highest accuracy reported for this dataset, which was 88.89% by Gicić et al. [14].For "South German Credit " [23], this study achieved an accuracy of 77.80%.In comparison, Khan and Ghosh [17] reported a slightly higher accuracy of 80.50%.In summary, the current work demonstrates a consistent and competitive performance across all three datasets.The credit scoring model developed in this study appears to be effective in predicting credit risk for various datasets, with accuracy values ranging from 77.80% to 88.84%.
While some related works may surpass the performance of our current study on individual datasets, our approach demonstrates superior performance when applied to specific datasets.For instance, despite Kazemi et al. [15] achieving an impressive 91.91% accuracy in their study on the "Statlog (Australian Credit Approval)" [21] dataset, our model excels with an accuracy of 78.30% when evaluated on the "Statlog German Credit Data" dataset, surpassing Kazemi et al.'s results (accuracy: 67.49%).This underscores the adaptability and competitiveness of our credit scoring model, showcasing its ability to outperform others in various scenarios across diverse datasets.It is worth noting that the relative performance depends on the specific dataset and the benchmarks set by each work.

Conclusions and Future Work
Credit scoring models have evolved significantly, transitioning from traditional manual processes to sophisticated machine learning techniques.This transformation has been catalyzed by the availability of vast datasets and advancements in data analytics.Contemporary models now leverage intricate algorithms and diverse variables, ensuring more accurate credit risk assessments.These advancements not only shape access to credit but also bolster the stability of the financial system.This research has delved deep into the mathematical underpinnings of machine learning techniques in credit scoring.Central

Figure 1 .
Figure 1.Graphical presentation of the suggested framework in the current study.

) 4 . 4 .
Cross-Validation and Evaluation Metrics k-Fold Cross-Validation is a widely used technique in machine learning and model evaluation.It provides a robust and unbiased estimate of a model's performance by partitioning the available data into k subsets or folds.The process involves training and testing the model k times, each time using a different fold as the validation set and the remaining folds as the training set.The major benefits of it are as follows: (1) k-Fold Cross-Validation provides a more reliable estimate of a model's performance compared with a single train-test split.By using multiple validation sets and averaging the results, it reduces the potential bias and variability that can arise from a particular data split.(2) k-Fold Cross-Validation makes efficient use of available data by utilizing all instances in both the training and validation phases.This maximizes the amount of information used for model training and evaluation, resulting in more robust and accurate performance estimates.(3) k-Fold Cross-Validation is commonly used in model selection and hyperparameter tuning.It allows for comparing different models or different hyperparameter settings by evaluating their performance across multiple iterations and providing a fair comparison.(4) k-Fold Cross-Validation is beneficial when dealing with imbalanced datasets, where the distribution of classes is uneven.It ensures that each fold contains a representative distribution of instances, reducing the potential for biased evaluation due to class imbalance.

Figure 2 .
Figure 2. LIME explanation of a decision taken for a testing instance from the "Statlog (Australian Credit Approval)" dataset.

Figure 3 .
Figure 3. LIME explanation of a decision taken for a testing instance from the "South German Credit" dataset.

Figure 4 .
Figure 4. LIME explanation of a decision taken for a testing instance from the "Statlog (German Credit Data)" dataset.
wohnzeit Length of time (in years) the debtor has lived in the current residence verm The debtor's most valuable property alter Age in years weitkred Installment plans from providers other than the issuing bank wohn Type of housing the debtor resides in bishkred Number of credits, including the current one, the debtor has (or had) with this bank beruf Quality of the debtor's job pers Number of individuals financially dependent on the debtor telef Presence of a landline telephone registered under the debtor's name gastarb Whether the debtor is a foreign worker kredit Compliance status of the credit contract (good or bad) . .

Table 4 .
Performance report using the "South German Credit" dataset.

Table 6 .
Performance of the best classifiers on the different datasets after running 100 trials.The results are reported as mean (standard deviation).

Table 7 .
Comparison between the current study and the related studies.