5.1. Hyperparameter Optimization
This study evaluated the effectiveness of four hyperparameter tuning methods, Grid Search (GS), Random Search (RS), Hyperopt (Hopt), and Optuna (Opt), in optimizing the performance of four machine learning models: Logistic Regression (LR), Random Forest (RF), XGBoost, and LightGBM. The goal was to assess their impact on predictive accuracy and computational efficiency in credit risk modeling for peer-to-peer lending, using three real-world datasets: Lending Club, Australia, and Taiwan.
To ensure a fair comparison of the four hyperparameter optimization methods, Grid Search, Random Search, Hyperopt, and Optuna, the search spaces for tunable hyperparameters were consistently defined across all methods for each model (Logistic Regression, Random Forest, XGBoost, and LightGBM), as detailed in
Table 2. Key hyperparameters critical to model performance, such as regularization strength for Logistic Regression, tree complexity and ensemble size for Random Forest, and learning rate and regularization parameters for XGBoost and LightGBM, were selected based on their impact on accuracy and generalization, as informed by prior studies [
13]. The ranges of values for these hyperparameters were carefully chosen to span wide yet practical intervals, balancing model complexity and computational feasibility, with discrete sets for GS and equivalent continuous or discrete distributions for RS, Hyperopt, and Optuna to ensure comparable exploration of the hyperparameter space.
Fixed hyperparameters were set to default values (
Table 3) to maintain uniformity across experiments. All methods used three-fold cross-validation on the training set, with RS, Hyperopt, and Optuna performing 50 iterations each, and GS exhaustively evaluating all combinations. This standardized approach ensured that performance differences arose solely from the optimization strategies, enabling a robust and equitable assessment of their impact on predictive performance and computational efficiency in credit risk modeling.
Table 4 presents the optimal hyperparameter configurations identified by each tuning method across the three datasets. The results revealed that different tuning strategies often converged on distinct hyperparameter sets, reflecting their unique search mechanisms. Grid Search exhaustively evaluates all combinations within a predefined grid, ensuring the identification of the global optimum within the specified space but at a high computational cost. Random Search, by contrast, samples configurations stochastically, offering greater efficiency in high-dimensional spaces but potentially missing optimal regions due to its random nature. Bayesian optimization methods, Hyperopt and Optuna, leverage probabilistic models (e.g., tree-structured Parzen estimators) to adaptively focus on promising regions of the search space, balancing exploration and exploitation while incorporating pruning mechanisms to terminate unpromising trials early.
For the Lending Club dataset, which is the largest and most complex, significant variability in optimal hyperparameter values was observed across tuning methods. For Logistic Regression, the inverse-regularization parameter C showed considerable variation, ranging from 1 (Grid Search) to (Random Search). Random Forest exhibited lower sensitivity to hyperparameter tuning, with consistent selections of max_depth (10) and n_estimators (200) across methods, suggesting a robust configuration space with multiple near-optimal solutions. In contrast, XGBoost and LightGBM displayed greater sensitivity, particularly in parameters like reg_lambda (ranging from to 10 for LightGBM) and num_leaves (7 to 31 for LightGBM), indicating a more complex hyperparameter landscape with multiple high-performing regions.
Similar trends were observed in the Australia dataset, with notable variability in max_depth and n_estimators for tree-based models. The Taiwan dataset, however, showed greater stability across tuning methods, with variations primarily in reg_lambda and max_depth for XGBoost and LightGBM. This stability suggests a narrower optimal region, increasing the risk of overfitting or underfitting if hyperparameters are not carefully tuned.
5.2. Performance Analysis
This section evaluates the predictive performance and computational efficiency of four hyperparameter tuning methods, Grid Search (GS), Random Search (RS), Hyperopt, and Optuna, applied to Logistic Regression, Random Forest, XGBoost, and LightGBM across three real-world datasets: Lending Club, Australia, and Taiwan. Performance was assessed using the Area Under the ROC Curve (AUC), Sensitivity, Specificity, and Geometric Mean (G-Mean), which are suitable for imbalanced credit risk datasets (
Section 4.3). We also compared tuned models against baselines with default hyperparameters (No HPO) and analyzed computational time to highlight the trade-offs between accuracy and efficiency, as detailed in
Table 5.
For the Lending Club dataset, LightGBM and XGBoost achieved the highest value for AUC (70.77% for both with GS, 70.74% for XGBoost with Optuna), significantly outperforming No HPO baselines (68.56% for XGBoost, 70.38% for LightGBM). Random Forest and Logistic Regression showed lower AUCs (70.21% for Random Forest across GS, Hyperopt, and Optuna; 70.12% for Logistic Regression with RS), with No HPO results trailing by 1–2% (69.17% for Random Forest, 67.76% for Logistic Regression). LightGBM and XGBoost balanced Sensitivity (67–68%) and Specificity (61–62%), yielding G-Mean values around 65%, compared to No HPO G-Mean values of 63.53% (XGBoost) and 64.61% (LightGBM). Logistic Regression favored Specificity (up to 64.88% with RS) over Sensitivity (around 65%), while Random Forest prioritized Sensitivity (up to 69.13% with Hyperopt) at the cost of Specificity (around 60%). These results highlight the superiority of gradient-boosting models in handling class imbalance, consistent with Lessmann et al. [
12]. Computationally, GS was the most time-intensive (e.g., 241.47 min for LightGBM, 132.40 min for XGBoost), while Hyperopt was the most efficient, reducing runtime by up to 75.7-fold (e.g., 3.19 min for LightGBM). RS and Optuna also outperformed GS, with runtimes of 5.39 and 6.12 min for LightGBM, respectively, making them viable for large datasets.
In the Australia dataset, the smaller dataset size led to higher AUCs (91.63–93.61%) across all models, with XGBoost tuned by RS achieving the best performance (93.61%), compared to 92.42% without HPO. LightGBM followed closely (93.25% with GS), outperforming its No HPO baseline (92.44%). Random Forest and Logistic Regression had AUCs of 92.99% (RS) and 91.95% (RS), respectively, improving on No HPO results (92.11% for Random Forest, 91.63% for Logistic Regression). Sensitivity was high (89–95%), but Specificity was lower (71–79%), reflecting moderate class imbalance (
Table 1). G-Mean values (81.66–84.58%) confirmed robust class balance, particularly for XGBoost and LightGBM, compared to No HPO G-Mean values (e.g., 81.66% for Logistic Regression). GS required minimal time (e.g., 0.01 min for Logistic Regression, 7.66 min for LightGBM), but Hyperopt and RS were faster for complex models (e.g., 0.21 min for LightGBM with Hyperopt vs. 7.66 min with GS), demonstrating efficiency in smaller datasets.
The Taiwan dataset showed similar patterns, with LightGBM achieving the highest AUC (77.85% with Optuna), followed by XGBoost (77.78% with RS), both surpassing No HPO baselines (76.91% for LightGBM, 75.29% for XGBoost). Random Forest and Logistic Regression had lower AUCs (77.56% for Random Forest with RS, 70.69% for Logistic Regression with GS), with No HPO results notably weaker (75.86% for Random Forest). Sensitivity ranged from 59.15% (LightGBM, Hyperopt) to 63.45% (LightGBM, GS), while Specificity was higher (up to 81.42% for Random Forest, Hyperopt), reflecting the dataset’s significant class imbalance. G-Mean values (69.37–70.99%) underscored LightGBM’s balanced performance, improving on No HPO’s G-Mean values (e.g., 69.45% for Random Forest, 69.89% for LightGBM). GS was computationally expensive (e.g., 25.41 min for LightGBM), while Hyperopt and RS reduced runtimes significantly (e.g., 0.68 and 0.58 min for LightGBM), with Optuna close behind (0.70 min).
The results align with and extend findings from prior studies, as reported in
Appendix A (
Table A1,
Table A2 and
Table A3). Ko et al. [
38] reported a LightGBM AUC of 74.92% on the Lending Club dataset, higher than our 70.77%, likely due to differences in data preprocessing or feature sets (
Table A1). However, our tuned models achieved more balanced Sensitivity (67–68%) and Specificity (61–62%) compared to their 65.66% and 71.47%, respectively, reflecting improved handling of class imbalance through HPO and random undersampling. Song et al. [
15] reported lower AUCs (e.g., 62.07% for Random Forest, 61.40% for GBDT) on a similar dataset, with G-Mean values (61.93% for Random Forest, 61.38% for GBDT) below our 64.45–65.19% (
Table A2), underscoring the impact of our HPO strategies. Xia et al. [
22] achieved a higher XGBoost AUC (67.08% with RS) on a different credit dataset, but our LightGBM AUC of 77.85% on the Taiwan dataset surpasses their 66.97% with TPE, highlighting dataset-specific advantages and the efficacy of Optuna (
Table A3). These comparisons confirm that our HPO methods, particularly Bayesian approaches, enhance predictive performance over prior benchmarks, especially for imbalanced datasets.
Tuned models consistently outperformed No HPO baselines across all datasets, with AUC improvements of 1–3%, emphasizing the critical role of HPO in credit risk modeling. LightGBM and XGBoost excelled due to their robust handling of imbalanced data, as noted by Ko et al. [
38]. Bayesian methods (Hyperopt, Optuna) matched or slightly outperformed GS in AUC while drastically reducing computational time, with Hyperopt being the most efficient. RS offered a simpler, yet effective alternative, particularly for high-dimensional datasets. Statistical tests (paired
t-tests,
) confirmed no significant AUC differences between tuning methods for a given model, suggesting multiple near-optimal configurations. These findings advocate Bayesian methods or RS in P2P lending platforms, where computational efficiency is crucial for scalability, and highlight the necessity of HPO to achieve robust predictive performance.
5.3. Sensitivity Analysis
To evaluate the robustness of the hyperparameter configurations identified by Grid Search (GS), Random Search (RS), Hyperopt, and Optuna, a sensitivity analysis was conducted on key hyperparameters for Logistic Regression, Random Forest, XGBoost, and LightGBM using the Lending Club dataset. This analysis assessed the impact of small perturbations (
) in optimal hyperparameter values on model performance, with Area Under the Receiver Operating Characteristic Curve (AUC) as the primary metric for credit risk prediction, as described in
Section 4.3. The objective was to quantify the stability of hyperparameter configurations and determine whether minor changes significantly affected predictive performance, ensuring the reliability of tuned models for peer-to-peer (P2P) lending applications [
52].
The sensitivity analysis followed a structured procedure. Initially, for each model (Logistic Regression, Random Forest, XGBoost, LightGBM) and tuning method (GS, RS, Hyperopt, Optuna), the optimal hyperparameter set was selected from
Table 4 (
Section 5.1). Key hyperparameters were chosen based on their influence on model performance, as supported by prior studies [
13,
52]. Specifically,
C was analyzed for Logistic Regression,
n_estimators and
max_depth for Random Forest,
learning_rate for XGBoost, and
learning_rate and
num_leaves for LightGBM, as reported in
Table 6. Each selected hyperparameter was perturbed by
from its optimal value, one at a time, while maintaining all other hyperparameters at their optimal settings. For continuous parameters, such as
learning_rate, the perturbation was calculated as the optimal value multiplied by (
). For discrete parameters, such as
max_depth and
num_leaves, the perturbed value was rounded to the nearest integer. If rounding resulted in no change, the perturbation was adjusted to the next valid integer to ensure a distinct value.
For each perturbation (
and
), the model was retrained on the training set (
of the Lending Club data, balanced via random undersampling as outlined in
Section 4.1) using the perturbed hyperparameter set. To account for stochasticity in model training, such as random undersampling and random initialization in tree-based models, five repeated runs were performed with different random seeds for each perturbation direction, resulting in ten runs total (five for
and five for
). Each run involved training the model on the undersampled training set and evaluating it on the test set (
of the data, retaining the original class distribution) to compute the AUC (
). The ten AUC values (five per perturbation direction) were used to compute the change in AUC as
where
is the AUC of the model with the optimal hyperparameter set (
Table 5). The mean AUC change was computed as
, averaged across the ten
values (five per direction) for reporting in
Table 6 as a percentage. Only mean AUC changes explicitly different from zero were considered in
Table 6.
A one-sample
t-test was conducted to determine whether the mean
across the ten runs was significantly different from zero, indicating sensitivity to perturbations. The
t-statistic was computed as
where
is the mean
,
is the standard deviation of the ten
values, and
is the sample size. The null hypothesis
assumed
, implying robustness to perturbations, while the alternative
suggested sensitivity
. Two-tailed
p-values were calculated with a significance level of
. High
p-values (
) indicated robust configurations, while low
p-values suggested sensitivity.
Table 6 presents the sensitivity analysis results for the Lending Club dataset, reporting the mean AUC change (in percentage),
t-statistic, and
p-value for key hyperparameters across different models and tuning methods. The results indicated that the hyperparameter configurations were generally robust, with minimal AUC variations and high
p-values, suggesting that small perturbations (
) did not significantly impact predictive performance. For Logistic Regression, the
C parameter (inverse of regularization strength) exhibited small AUC changes. With Random Search (
), the mean AUC change was +0.00535% (from 70.12% to 70.13%), with a
t-statistic of 5.095 and
p-value of 0.123, indicating robustness despite some variability due to the high
t-statistic. The large optimal
C suggested weak regularization, so perturbations (25.696 or 21.024) had minimal effect, as the model was close to an unregularized state, consistent with Logistic Regression’s limited ability to capture non-linear patterns in the Lending Club dataset [
23]. For Hyperopt (
), the mean AUC change was +0.00320%, with a
t-statistic of 2.909 and
p-value of 0.211, reflecting greater stability due to lower variability and stronger regularization, though the baseline AUC was slightly lower [
12].
For Random Forest, the
n_estimators parameter demonstrated high robustness across tuning methods. With Grid Search (
), the mean AUC change was
(from
to
), with a
t-statistic of
and
p-value of 0.780, reflecting minimal variability due to Random Forest’s variance reduction through bagging [
24]. Random Search (
) showed the smallest change (
), with a
t-statistic of 0.093 and
p-value of
, indicating exceptional stability, as additional trees beyond 200 yielded diminishing returns [
13]. Optuna (
) had a mean AUC change of
, with a
t-statistic of
and
p-value of 0.844, confirming robustness with slight variability. For
max_depth, Grid Search (
) showed a mean AUC change of
(from
to
), with a
t-statistic of 0.420 and
p-value of 0.747, suggesting moderate sensitivity as deeper trees (11) slightly improved AUC, but the high
p-value confirmed robustness. Random Search (
) had the largest change (
, from
to
), with a
t-statistic of
and
p-value of
, indicating robustness despite a slightly less optimal baseline AUC. Hyperopt (
) showed a small change (+0.08480%, from 70.21% to 70.29%), with a
t-statistic of
and
p-value of
, suggesting greater stability due to the TPE optimization. Optuna (
) had a change of
(from 70.21% to 70.36%), with a
t-statistic of 0.453 and
p-value of
, indicating robustness with minor variability.
For XGBoost, the
learning_rate parameter exhibited small AUC changes. With Grid Search (
), the mean AUC change was +0.01930% (from
to
), with a
t-statistic of
and
p-value of
, indicating robustness due to XGBoost’s regularization stabilizing performance [
25]. Random Search (
) had a near-zero change (
), with a
t-statistic of
and
p-value of
, reflecting exceptional stability in a flat performance region. Hyperopt (
) showed a change of
(from
to
), with a
t-statistic of
and
p-value of
, indicating robustness despite a slight performance drop. Optuna (
) had the largest change (
, from
to
), with a
t-statistic of
and
p-value of
, suggesting robustness with minor variability due to potential overfitting or underfitting.
For LightGBM, the
learning_rate parameter showed small changes. With Grid Search (
), the mean AUC change was
(from
to
), with a
t-statistic of
and
p-value of
, indicating robustness despite some variability due to LightGBM’s leaf-wise growth [
26]. Random Search (
) had a change of
(from
to
), with a
t-statistic of
and
p-value of
, confirming robustness. Optuna (
) showed a change of
(from
to
), with a
t-statistic of
and
p-value of
, indicating robustness with minor variability. For
num_leaves ((
), Grid Search), the mean AUC change was
(from
to
), with a
t-statistic of
and
p-value of
, confirming robustness as perturbations (to 8 or 6) had minimal impact due to LightGBM’s regularization.
Overall, the sensitivity analysis demonstrated that hyperparameter configurations for all models were robust to
perturbations on the Lending Club dataset, with mean AUC changes typically below
and high
p-values (>0.05), indicating no statistically significant impact on performance. Random Forest exhibited the greatest stability, particularly for
n_estimators, due to its variance-reducing bagging approach. Logistic Regression and gradient-boosting models (XGBoost, LightGBM) showed slight sensitivity to
C and
learning_rate, respectively, but remained robust, supported by their regularization mechanisms. These findings confirm that the tuned models are reliable for P2P lending applications on the Lending Club dataset, with Hyperopt and Optuna offering stable configurations alongside computational efficiency (
Section 5.2), enhancing their suitability for scalable credit risk prediction.
5.4. Feature Importance Analysis
The feature importance analysis conducted for the LightGBM model, which demonstrated superior predictive performance across the Lending Club, Australia, and Taiwan datasets, provided critical insights into the key drivers of default risk in peer-to-peer lending. Given LightGBM’s high AUC scores (70.77% on Lending Club, 93.25% on Australia, and 77.85% on Taiwan), this analysis focused on the Lending Club dataset due to its large size (233,015 records) and rich feature set (52 attributes), offering a robust context for evaluating feature contributions. The analysis leveraged two complementary metrics: the gain metric, which measures each feature’s contribution to reducing prediction error through impurity reduction across tree splits, and SHAP (Shapley Additive Explanations) values, which quantify each feature’s marginal contribution to individual predictions while accounting for feature interactions. The results are visualized in
Figure 2 for gain-based rankings across all four hyperparameter tuning methods (Grid Search, Random Search, Hyperopt, and Optuna) and in
Figure 3 and
Figure 4, for SHAP-based rankings, with
Figure 3 showing rankings for Grid Search (upper panel) and Random Search (lower panel), and
Figure 4 showing rankings for Hyperopt (upper panel) and Optuna (lower panel). This comprehensive evaluation revealed the stability and consistency of feature importance across tuning methods, highlighted key predictors of default risk, and identified potential concerns regarding fairness and bias.
Starting with
Figure 2, which presents the top ten feature importance rankings based on the gain metric across all four tuning methods, the debt-to-income (DTI) ratio emerges as the most influential feature across Grid Search, Random Search, Hyperopt, and Optuna. The DTI ratio, which measures a borrower’s total debt obligations relative to their income, is a critical indicator of financial strain and repayment capacity, making its prominence unsurprising and consistent with domain knowledge in credit risk modeling. A high DTI ratio often signals increased default risk, as borrowers with substantial debt burdens relative to their income are less likely to meet repayment obligations. Following DTI, employment title consistently ranks as the second most important feature across all tuning methods. This feature captures socioeconomic factors, such as job stability and income potential, which are closely tied to creditworthiness. For instance, borrowers in higher-paying or more stable professions may exhibit lower default rates, while those in precarious or lower-income roles may face higher risks. Other consistently high-ranking features include loan amount, interest rate, and annual income, which appear among the top five across all methods. Loan amount and interest rate reflect the financial burden of the loan itself, with larger loans and higher interest rates often correlating with increased default probability due to greater repayment pressure. Annual income, on the other hand, provides insight into a borrower’s overall financial capacity, serving as a counterbalance to debt-related features. Additional features, such as credit history-related attributes (e.g., number of open accounts, recent credit inquiries) and loan-specific variables (e.g., term length), appear in the top ten, underscoring their role in assessing borrower reliability and loan risk. The remarkable consistency of these rankings across tuning methods is evidenced by a Spearman correlation coefficient exceeding
, indicating that the choice of hyperparameter optimization method has minimal impact on the relative importance of features. This stability enhances LightGBM’s reliability for practical deployment in P2P lending, as it ensures that key risk drivers remain consistent regardless of the tuning approach, facilitating transparent and interpretable decision-making.
Figure 3 and
Figure 4 provide a more granular perspective by presenting SHAP-based feature importance rankings for LightGBM.
Figure 3 displays rankings optimized by Grid Search in the upper panel and Random Search in the lower panel, while
Figure 4 displays rankings optimized by Hyperopt in the upper panel and Optuna in the lower panel. SHAP values, grounded in game theory, quantify each feature’s contribution to individual predictions, capturing complex interactions and offering a robust measure of interpretability. In
Figure 3 (upper panel, Grid Search), the DTI ratio dominates, with a high SHAP value reflecting its substantial influence on model outputs, aligning with the gain-based findings and reinforcing DTI’s role as a primary driver of default risk. Employment title follows as the second most important feature, consistent with its high gain-based ranking, emphasizing its socioeconomic significance. In the lower panel (Random Search), DTI and employment title maintain their top positions, with similar rankings for loan amount, interest rate, and annual income, reflecting the stability of feature importance across these tuning methods. However, the prominence of employment title raises concerns about potential biases, as certain job titles may correlate with protected attributes such as race, gender, or socioeconomic status, which could lead to unfair lending decisions. For example, if job titles associated with lower-income or less stable professions disproportionately predict defaults, the model may inadvertently penalize borrowers from marginalized groups. Loan-specific features, including interest rate, loan amount, and term length, rank highly in
Figure 3, consistent with their gain-based importance, as they directly influence the financial burden on borrowers. Annual income and credit history features, such as the number of open accounts and recent credit inquiries, also appear in the top ten, reflecting their role in assessing borrower stability and credit behavior. These features capture signals of financial distress, such as overextension through multiple credit lines or frequent credit applications, which are often precursors to default.
Figure 4, with Hyperopt in the upper panel and Optuna in the lower panel, shows nearly identical rankings to each other, with DTI and employment title maintaining their positions as the top two features in both panels. Interest rate and loan amount follow closely, underscoring their universal importance across tuning methods. Slight variations in SHAP values for lower-ranked features, such as credit inquiries, suggest minor differences in how Hyperopt’s and Optuna’s Bayesian optimization approaches prioritize feature interactions. For instance, Optuna may place slightly greater emphasis on credit history features due to its iterative sampling of promising hyperparameter configurations, which could influence how feature interactions are modeled. Nevertheless, the overall rankings in both panels remain highly correlated (Spearman correlation
), confirming the robustness of feature importance across Bayesian methods. The high consistency across
Figure 3 and
Figure 4 highlights the stability of LightGBM’s feature importance, regardless of the tuning method, which is critical for ensuring reliable and interpretable underwriting decisions in P2P lending platforms. The Spearman correlation coefficient exceeding
across all methods reinforces this stability, suggesting that hyperparameter tuning variations do not significantly alter the model’s focus on key risk drivers.
A notable observation is the subtle difference between SHAP-based and gain-based rankings. While DTI and employment title remain the top features in both metrics, SHAP values in
Figure 3 and
Figure 4 place slightly greater emphasis on credit history features, such as the number of open accounts and recent credit inquiries, compared to the gain metric in
Figure 2. This discrepancy arises because SHAP accounts for feature interactions and their impact on individual predictions, whereas the gain metric focuses on aggregate error reduction across tree splits. For example, credit inquiries may interact with loan amount or interest rate to signal financial distress, an effect that SHAP captures more explicitly by assigning higher importance to these interactions. In contrast, the gain metric prioritizes features that contribute most to overall error reduction, which may downplay interaction effects. This complementary nature of gain and SHAP metrics provides a comprehensive understanding of feature importance, with gain offering a high-level view of predictive power and SHAP providing nuanced insights into individual prediction contributions. The high correlation between rankings (Spearman
) across both metrics and all tuning methods further validates LightGBM’s interpretability, making it a reliable choice for P2P lending applications where transparency is essential for regulatory compliance and stakeholder trust.
The practical implications of these findings are significant for P2P lending platforms. Prioritizing features like DTI, loan amount, interest rate, and annual income in underwriting processes can enhance default prediction accuracy, as these factors directly relate to financial strain and repayment capacity. DTI, as the top-ranked feature across all methods in
Figure 2,
Figure 3 and
Figure 4, is a straightforward indicator of a borrower’s ability to manage debt, making it a cornerstone of credit risk assessment. Loan amount and interest rate, which reflect the size and cost of the loan, are critical for evaluating repayment feasibility, while annual income provides a broader context for financial stability. Credit history features, such as open accounts and inquiries, offer additional signals of borrower behavior, helping platforms identify risky patterns like overextension or frequent borrowing. However, the high importance of employment title across all methods necessitates careful scrutiny. While it captures valuable socioeconomic signals, its use could introduce bias if certain job titles were disproportionately associated with specific demographic groups. For instance, if low-paying or unstable job titles correlate with higher default rates, the model may unfairly penalize borrowers from certain socioeconomic backgrounds, raising ethical concerns. This issue aligns with prior work emphasizing the need for fairness in credit scoring, as noted in the paper’s concluding remarks, which suggest complementing feature importance analysis with fairness audits using tools like SHAP to ensure ethical decision-making.
The stability of feature rankings across tuning methods and metrics enhances LightGBM’s suitability for scalable credit risk prediction. The computational efficiency of Bayesian methods (Hyperopt and Optuna), which achieve comparable feature importance stability to Grid Search with significantly reduced runtime (e.g., 3.19 vs. 241.47 min for LightGBM), further supports their practical deployment, as shown in the consistent rankings in
Figure 4. This efficiency is particularly valuable for large datasets like Lending Club, where computational resources are a limiting factor. Moreover, the consistent identification of key risk drivers, such as DTI and employment title, enables platforms to develop transparent and interpretable models that align with regulatory requirements. However, the reliance on employment title underscores the need for further investigation into potential biases, as its prominence could inadvertently perpetuate inequities in lending decisions. Future research, as suggested in the paper, could explore feature interactions (e.g., DTI with annual income) to uncover nuanced risk patterns, potentially enhancing model performance and interpretability. Additionally, fairness-focused analyses, such as auditing employment title for correlations with protected attributes, could mitigate bias and ensure equitable credit scoring, aligning with the ethical considerations highlighted in the study’s conclusions.
The operational impact of implementing the feature importance findings from this analysis, namely, improved default prediction, increased investor satisfaction, platform retention, and compliance readiness, depends on several key factors. First, the quality and consistency of input data, such as accurate borrower financial information (e.g., DTI, annual income) and standardized employment title reporting, are critical to ensuring reliable model predictions in production. Second, the effective integration of the LightGBM model into P2P lending platforms requires robust infrastructure, including scalable computational resources and real-time data processing capabilities, to handle large datasets like Lending Club (233,015 records, 52 attributes). Third, addressing fairness concerns, particularly for features like employment title, necessitates ongoing fairness audits and bias mitigation strategies to align with regulations such as the EU AI Act and the U.S. Equal Credit Opportunity Act. Finally, stakeholder collaboration, including clear communication of model outputs to investors and regulators, is essential to translate transparency into trust and compliance. By addressing these factors, P2P lending platforms can maximize the practical benefits of this research, reducing default rates while enhancing investor confidence and regulatory alignment.