1. Introduction
Corporate credit ratings are central to modern financial systems, influencing investment decisions, debt pricing, and regulatory capital requirements. Beyond their static role, rating transitions and defaults reflect the dynamic evolution of firm risk, particularly during periods of macroeconomic stress or structural disruption. Accurately modeling and explaining these transitions is vital for investors, regulators, and institutions seeking forward-looking insights into credit deterioration.
Traditionally, credit risk models have relied on structural frameworks such as Merton’s [
1] option-theoretic approach and statistical models like LR [
2,
3]. While theoretically sound, these models assume linearity, firm homogeneity, and stable relationships between financial variables—assumptions that are increasingly challenged by the complexity of real-world credit behavior. Moreover, reliance on risk-weighted assets (RWAs) under Basel frameworks has been criticized for encouraging regulatory arbitrage and weakening alignment between modeled and actual risk [
4].
Recent advances in ML offer promising alternatives. ML models such as RF, XGBoost, and LSTM networks can detect nonlinearities, handle high-dimensional data, and adapt to changing environments. For instance, Kandi and García-Dopico [
5] showed that LSTM outperforms XGBoost in imbalanced datasets by capturing temporal dependencies without complex feature engineering—an important benefit in financial sequence data. However, many ML models remain opaque, hindering their adoption in regulated settings.
To address this, explainable artificial intelligence (XAI) techniques such as SHapley Additive exPlanations (SHAP) and LIME are gaining traction. Nallakaruppan et al. [
6] demonstrated that integrating these tools into financial models supports both interpretability and regulatory compliance. Despite these developments, key gaps persist: few studies apply segment-specific modeling across rating categories [
7], and many overlook firm-level disclosure complexity, which increases informational asymmetry [
8].
This study proposes a hybrid ML framework to explain corporate rating transitions and defaults. We stratify firms into rating segments, apply out-of-time validation to simulate real-world performance, and use SHAP values to interpret model outputs. By benchmarking LR, RF, XGBoost, and SVM, we contribute a robust, explainable, and segment-aware early warning system (EWS) for credit risk management.
2. Theoretical Background and Literature Review
2.1. Traditional Approaches to Credit Risk Modeling
Traditional credit risk assessment relies on structural and statistical models grounded in economic theory. Merton’s [
1] structural model conceptualizes a firm’s equity as a European call option on its assets, with default occurring when asset value falls below debt obligations at maturity. Although this model offers a theoretical link between default probability, asset volatility, and capital structure, its practical use is limited by the need for unobservable inputs and strong assumptions about market efficiency and continuous trading.
Due to these limitations, empirical research has increasingly favored statistical models based on observable firm-level data. LR gained prominence for its interpretability and operational simplicity. Foundational studies by Altman [
2] and Ohlson [
3] demonstrated the predictive power of financial ratios for bankruptcy and default, laying the groundwork for early credit scoring systems used in academic and regulatory contexts.
However, logistic models have important limitations. They assume linear and additive relationships between predictors and default probability and often presume homogeneity across firms and conditions. As Böhnke et al. [
4] point out, reliance on internally developed RWA models under Basel II and III can lead to regulatory arbitrage and a weakening link between modeled risk and actual exposures. These models also struggle to capture dynamic risk profiles and economic shocks.
Structural models have been further criticized for ignoring stochastic debt issuance. Feldhütter and Schaefer [
9] provide evidence that short-term leverage volatility often exceeds asset volatility, contradicting assumptions of constant debt levels. This highlights the need for models that incorporate evolving capital structures.
While LR remains common due to its transparency and ease of use, it is increasingly outpaced by more flexible ML methods. Zedda [
10] confirms that although logistic models are still used in regulatory settings, they underperform in capturing nonlinear interactions in firm data. These shortcomings have paved the way for advanced approaches better suited to modeling heterogeneity and dynamic credit behavior, as explored in the next section.
2.2. Evolution to ML in Credit Risk
Limitations of traditional models have accelerated the adoption of ML techniques in credit risk assessment. Unlike LR, which assumes linearity, ML models can capture nonlinear relationships and adapt to changing economic conditions. With fewer parametric assumptions, ML techniques can extract patterns from high-dimensional data, improving prediction of defaults and rating transitions [
11].
A key strength of ML is its ability to handle data heterogeneity and class imbalance, which are common in credit datasets where defaults are relatively rare. Ensemble models, in particular, have shown strong classification performance and robustness to noise and imbalanced class distributions. Techniques such as Synthetic Minority Oversampling Technique (SMOTE) further enhance generalizability by balancing training data without distorting distributions [
12]. In personal credit contexts, Wang et al. [
11] show that XGBoost outperforms traditional models in assessing borrower risk with high precision.
Empirical evidence highlights the strengths of ML in credit risk modeling. Xu and Zhang [
13] demonstrated that combining SVM with ensemble methods improves classification performance in the technology sector, showcasing the adaptability and effectiveness of hybrid ML models in corporate credit scoring.
AutoML platforms like H2O-AutoML and AutoGluon lower technical barriers and streamline deployment. Papík and Papíková [
14] show these tools deliver superior area under the curve (AUC) scores and faster model development in bankruptcy prediction, particularly in data-constrained environments like Slovak manufacturing firms. Similar findings are echoed by Papík and Papíková [
15], who highlight the operational feasibility of AutoML systems for practitioners with limited modeling expertise.
To address concerns about model opacity, explainable AI (XAI) tools have become indispensable. SHAP stands out as a leading method for interpreting ML outputs. Hamida et al. [
16] highlight the importance of explainability in high-stakes domains, emphasizing that XAI enhances transparency and fosters stakeholder trust in AI-driven decision systems.
Hybrid approaches are also emerging. Sun et al. [
17] propose an interpretable SHAP-based framework for macroeconomic forecasting, incorporating news narrative sentiment, with applications in credit cycle modeling. Tribuvan et al. [
18] further demonstrate that combining feature selection with ensemble classifiers enhances model precision in default prediction across diverse financial datasets.
2.3. Gaps in Literature
Despite ML’s promise, several key limitations hinder its broader application in credit risk modeling.
Interpretability remains a primary concern. While SHAP and other XAI tools provide insights, few studies systematically integrate them into credit rating transition or early warning models. Noriega et al. [
19] point out that many models still operate as “black boxes,” limiting their acceptance by regulators and practitioners.
Segment-level performance is often overlooked. Most models are evaluated on aggregated data, assuming firm homogeneity. This masks important variations across rating categories. Bitetto et al. [
7] and Beltman et al. [
20] highlight the need to tailor models to firm-specific profiles, warning that a lack of stratification leads to biased predictions and weaker diagnostics.
Validation practices are also inadequate. Many studies rely on random cross-validation, ignoring the temporal structure of financial data and risking information leakage. Machado et al. [
21] advocate for out-of-time validation to ensure models remain robust across economic cycles.
Feature diversity remains limited, as most models rely heavily on traditional financial ratios while overlooking non-financial indicators. Hu [
22] and Safiullah et al. [
23] demonstrate that incorporating ESG factors can enhance forecast accuracy, particularly in sensitive sectors.
Macroeconomic regime shifts are under-modeled. James and Menzies [
24] show that structural changes in the market environment affect the predictive relevance of risk factors, yet most models fail to account for such dynamics.
Methodologically, techniques like generative adversarial networks (GANs) remain underused despite their promise. Strelcenia and Prakoonwit [
25] find GAN-based models outperform traditional sampling in fraud detection, suggesting their utility in default prediction.
Ethical concerns are often treated as an afterthought. Fox and Rey [
26] argue that fairness should be embedded in model design, not assessed post hoc. Horobet et al. [
27] note that limited collaboration between finance and AI fields hampers innovation and policy alignment.
These gaps highlight the need for credit models that are not only accurate but also interpretable, adaptable, and ethically robust. The following framework integrates these priorities.
2.4. Theoretical Framework
This study adopts an interdisciplinary framework to support a hybrid ML approach to credit risk modeling. It integrates perspectives from credit theory, decision science, information economics, portfolio theory, XAI, institutional theory, and stakeholder theory—aligning innovation with regulatory and ethical requirements.
Credit risk theory provides a foundation, with models such as Altman’s Z-score [
2], Ohlson’s [
3] logit model, and Merton’s [
1] framework still central to Basel II and III regulation [
4]. However, assumptions like static debt structures are unrealistic [
9], requiring more dynamic modeling. ML enhances these models by identifying nonlinear patterns and enabling forward-looking indicators like expected default frequency.
Decision science frames credit risk as a classification problem with asymmetric costs—where false negatives (missed defaults) are more damaging than false positives. ML facilitates probabilistic outputs and threshold tuning, strengthening EWS [
20], especially when validated through temporal forecasting [
21].
Information economics and signaling theory emphasize ML’s role in reducing information asymmetries. Models can detect latent signals from firm disclosures, ESG reports, or behavioral data [
28,
29].
Portfolio theory [
30] underscores the importance of accurate credit predictions for capital allocation and diversification. ML improves prediction quality at the portfolio level [
14]. Peer-matching techniques, such as in Bitetto et al. [
7], refine assessments of unlisted firms, enhancing capital efficiency.
Explainability is critical for adoption. Gafsi [
31] emphasizes that interpretability is essential for building regulatory trust and fostering stakeholder engagement. SHAP values contribute to this by offering both global and local transparency in model behavior.
According to institutional theory, the adoption of interpretable ML models can be understood as a response to coercive (regulatory), normative (professional), and mimetic (competitive) pressures that drive organizational isomorphism [
32]. In sensitive domains such as credit scoring, embedding fairness and accountability into model design is essential to meet these institutional expectations [
26].
Stakeholder theory supports the integration of ESG and innovation indicators. Green innovation enhances reputational capital and financial resilience [
23], while strong ESG profiles correlate with reduced default risk [
33].
This framework informs the design of a robust, interpretable, and segment-aware ML approach to credit risk management.
2.5. This Study’s Contribution
This study contributes to the literature by proposing a hybrid ML framework that addresses four challenges: performance, interpretability, segment-level heterogeneity, and temporal robustness.
First, it applies segment-specific modeling, grouping firms by credit quality (e.g., investment-grade, speculative-grade, distressed). This stratification improves diagnostic precision by revealing patterns obscured in aggregate evaluations [
7,
20].
Second, it uses temporal validation to assess generalizability across future periods, avoiding biases introduced by random cross-validation [
21]. It also applies SHAP to enhance interpretability at global and segment levels, clarifying feature importance across firm types and time horizons.
Third, it compares four algorithms—LR, RF, XGBoost, and SVM—assessing both predictive accuracy and interpretability. This enables a nuanced understanding of trade-offs between transparency and performance [
13,
18,
31]. By including SVM-based models and applying feature selection techniques where relevant, the study reflects contemporary best practices in the credit modeling literature.
Finally, it develops a practical EWS that integrates segmentation, dynamic validation, and explainable outputs. This EWS supports timely insights for credit monitoring, portfolio management, and regulatory oversight [
14,
20].
2.6. Research Hypotheses
Building on the theoretical and empirical foundations discussed in the literature review, this study develops and tests five hypotheses that structure the analysis of rating transitions and defaults. These hypotheses address three dimensions: the comparative performance of different modeling approaches, the value of segment-specific analysis, and the heterogeneity of explanatory drivers across rating categories.
H1. Ensemble models (RF, XGBoost) outperform traditional models (LR, SVM) in predicting rating transitions and defaults.
Structural and statistical models such as Merton’s option-theoretic framework [
1] and LR [
3] provide transparency but assume linearity, firm homogeneity, and stable relationships between financial variables. Recent advances in ML, however, demonstrate that ensemble models can capture nonlinearities and high-dimensional interactions that traditional approaches overlook [
5]. We therefore expect RF and XGBoost to achieve superior predictive performance relative to LR and SVM.
H2. Segment-specific modeling by rating band produces higher predictive performance than aggregated modeling.
Traditional approaches often impose the assumption of uniform risk drivers across firms. Yet prior research highlights that credit risk dynamics differ substantially by rating category, undermining the homogeneity assumption embedded in linear frameworks [
2]. Consistent with this evidence, we hypothesize that stratifying firms by rating segment will yield more robust and accurate predictive outcomes than a single aggregated model.
H3. The determinants of rating transitions vary significantly across rating segments.
The literature shows that the factors influencing rating changes are not uniform across credit qualities. We thus hypothesize that SHAP analysis will reveal heterogeneous sets of explanatory drivers across investment-grade and speculative-grade categories.
H4. Liquidity and coverage metrics play a stronger role in explaining transitions for lower-rated firms.
Liquidity constraints and refinancing risks have long been identified as key vulnerabilities in speculative-grade entities [
2]. Metrics such as funds from operations to debt and interest coverage ratios are therefore expected to exert stronger influence in explaining transitions among lower-quality firms.
H5. Management and Governance factors are more relevant in explaining transitions for higher-rated firms.
For investment-grade issuers, where liquidity constraints are less severe, long-term stability is more closely tied to governance and financial policy choices. Prior studies emphasize the role of governance and capital structure in maintaining rating stability and credibility in regulated settings [
6]. Accordingly, we expect these factors to dominate in higher-quality rating segments.
Together, these hypotheses connect the predictive performance of ML models with the interpretability of risk drivers across rating categories. In subsequent sections, we describe the methodological procedures employed to test these hypotheses and present their empirical verification.
3. Research Design and Data
3.1. Data Collection and Sources
The dataset employed in this study was compiled from Capital IQ Pro and Bloomberg, comprising firm-year observations from 2017 to 2022. The initial dataset included over 35,000 observations across multiple geographies and industries. Following the removal of duplicates, incomplete entries, and firms lacking consistent credit rating histories, the final panel consisted of 31,151 firm-year observations.
The sample ends in 2022 because more recent financial and rating data were still incomplete or pending validation at the time of collection. Using 2017–2022 ensures consistency, comparability, and reliability across firms and rating segments.
To ensure consistency and minimize reporting bias, we applied several data quality filters:
Only firms with publicly available financial statements and recognized credit ratings from S&P Global were retained;
All variables were cross validated to eliminate entry errors and outliers;
Financial ratios were calculated using standardized formulas to avoid distortion from varying accounting practices.
Credit ratings were assigned based on the full set of 23 S&P Global notches (AAA to D), subsequently grouped into six broader rating segments—High Quality, BBB, BB, B, CCC, and High Risk + Default—to balance granularity and statistical power.
The binary target variable was set to 1 if the firm experienced a rating downgrade or default within the subsequent 12 months, and 0 otherwise. Rating upgrades were treated as non-events (Class 0), consistent with the study’s focus on identifying credit deterioration risks.
Table 1 summarizes the distribution of downgrade/default events (Class 1) and non-events (Class 0) across six credit rating segments. While the total sample size is held constant across segments (
N = 31,151), the proportion of Class 1 events varies significantly, reflecting heterogeneity in downgrade or default risk. To ensure comparability across segments and control for sample size effects, we created balanced subsamples of 31,151 firm-year observations for each rating category. This design allows for direct performance comparisons without bias introduced by unequal class distributions.
The High Risk + Default segment shows the lowest incidence of Class 1 events (0.64%), suggesting these ratings are typically assigned to already deteriorated credits with limited further room for rating decline. In contrast, the B segment presents the highest proportion of Class 1 cases (30.71%), highlighting it as a critical transition zone where firms are more prone to rating actions.
The BBB and BB segments exhibit moderate Class 1 proportions (27.39% and 20.21%, respectively), indicating elevated but still manageable risk levels relative to lower-rated entities. The High Quality segment displays a relatively low downgrade/default rate (12.97%), while the CCC segment’s lower Class 1 proportion (7.15%) may reflect a floor effect—entities at this level already face severe credit stress, limiting further deterioration within the rating scale.
This distribution highlights the need for segment-specific modeling, as downgrade and default probabilities vary substantially across the rating spectrum. Applying a one-size-fits-all approach may obscure meaningful risk patterns within each segment and ultimately reduce predictive performance.
3.2. Rating Grouping and Class Consolidation
Rating Grouping
In this study, we utilize the full range of S&P Global credit rating grades, comprising 23 ordered levels from AAA (indicating the highest creditworthiness) to D (default). Given the concentration of credit risk at the lower end of the spectrum, market participants often consolidate ratings of CCC+ and below into broader categories due to their elevated probability of default and financial distress.
To address issues of class imbalance—particularly relevant in ML applications—and to preserve interpretability, we consolidated the 23 rating levels into six broader segments. This grouping reduces data sparsity in rare rating categories and improves the statistical power of model training and segment-level analyses.
Table 2 presents the mapping of the original S&P credit ratings into these six consolidated classes:
3.3. Data Preprocessing and Treatment
To prepare the dataset for ML modeling, several preprocessing steps were undertaken to ensure data integrity, address class imbalance, and enhance predictive performance.
First, all continuous numerical variables were standardized using the StandardScaler, which adjusts each variable so that they are measured on the same scale by removing differences in average size and spread. This prevents variables with larger numbers from dominating the analysis. Categorical variables, such as industry classification and management as well as governance, were transformed via one-hot encoding to retain non-ordinal relationships. Categorical variables, such as industry classification and management as well as governance, were transformed via one-hot encoding to retain non-ordinal relationships.
To reduce redundancy and multicollinearity, highly correlated features (Pearson’s ρ > 0.9) were identified and removed based on correlation matrices and variance inflation factors (VIF). Missing values in financial ratio variables were imputed using industry-median values. Variables with a high proportion of missingness or those deemed non-imputable led to the exclusion of corresponding records.
The dataset was further stratified by credit rating segment, ensuring that Class 1 (downgrade or default event) was defined within each segment to capture segment-specific credit risk. This stratification allowed for more context-aware binary classification modeling.
Given the inherent class imbalance—especially in segments such as High Risk + Default—SMOTE was applied post-train–test split to prevent information leakage [
34,
35].
Two validation strategies were employed: (1) five-fold cross-validation for in-sample evaluation and hyperparameter tuning, and (2) temporal validation, wherein models were trained on observations from 2017 to 2021 and evaluated on 2022 data to approximate real-world generalization.
Finally, all models underwent hyperparameter optimization via RandomizedSearchCV, with performance evaluated primarily using the F1-Score and Matthews Correlation Coefficient (MCC is a balanced measure of classification quality particularly suitable for imbalanced datasets). The parameter grids and tuning strategy for each model are detailed in
Section 3.7.
3.4. ML Models
We evaluated four supervised learning algorithms—LR, RF, XGBoost, and SVM—to address the challenge of predicting rare credit rating downgrades and defaults. These models were selected to reflect a range of trade-offs across interpretability, nonlinearity handling, and computational scalability, which are critical design considerations in credit risk modeling.
LR is a benchmark model widely adopted in credit scoring due to its interpretability and computational simplicity. It assumes a linear relationship between predictors and the log-odds of the target class. While its transparent coefficient structure aids explainability, it lacks the flexibility to capture nonlinear relationships commonly observed in distressed corporate environments. Recent applications in bankruptcy prediction show that automated model selection frameworks outperform baseline LR in predictive power while retaining transparency [
15].
RF is an ensemble-based model that aggregates multiple decision trees to reduce overfitting and improve generalization. Its use of feature bagging and bootstrap sampling enhances robustness in high-dimensional and noisy datasets—a common feature of financial and accounting data in credit risk. Comparative studies have demonstrated that RF achieves competitive performance in credit risk classification, particularly when combined with feature selection techniques [
18].
XGBoost builds on ensemble methods through sequential model optimization via gradient boosting. Its regularization mechanisms help mitigate overfitting while improving classification performance, especially for imbalanced data. In personal credit risk applications, XGBoost has outperformed traditional models in both accuracy and sensitivity to default-prone classes, underscoring its potential in financial domains involving rare-event prediction [
11].
SVM offers strong performance in high-dimensional spaces by constructing optimal hyperplanes that maximize class separation margins. However, its sensitivity to kernel selection and computational cost limits scalability in large datasets. Moreover, its performance tends to decline under severe class imbalance—an inherent feature in downgrade/default prediction. Empirical studies in the credit assessment of listed firms confirm that SVM may require hybrid approaches to remain competitive [
13].
3.5. Target Variables
The target variable is defined as binary: the value of 1 indicates that the firm experienced a rating downgrade or default within a 12-month horizon; a value of 0 denotes all other cases. A downgrade is operationalized as any transition to a lower rating category (e.g., from BBB to BB, or from BB to B), including default events. In contrast, upgrades—transitions to higher rating categories (e.g., from BB to BBB)—are considered non-events and thus classified as Class 0, consistent with the modeling objective of early identification of credit deterioration.
Although upgrades are grouped within Class 0 to maintain a binary classification structure, we acknowledge the potential for rating volatility, including downgrade-upgrade cycles within short time frames. In such cases, the downgrade event is prioritized if it occurs within the 12-month prediction window, as the analytical focus is on identifying early signs of credit deterioration, which hold greater relevance for risk management and regulatory purposes.
3.6. Feature Engineering
The final model incorporates 46 features, comprising a diverse set of profitability ratios, leverage and coverage metrics, efficiency indicators, cash flow variables, balance sheet components, structural attributes, and qualitative governance factors. These variables were selected based on their theoretical relevance to credit risk and their empirical performance in prior studies.
Table 3 provides an overview of the variables used in model training, grouped into thematic categories along with a brief description or calculation formula for each.
This comprehensive feature set captures both quantitative financial health and qualitative risk dimensions, aligning with credit theory and enabling explainable modeling through SHAP analysis.
3.7. Methodology
This study evaluated the predictive performance of four supervised ML models—LR, RF, XGBoost, and SVM—to classify corporate credit rating transitions. All models were implemented in Python v3.10 using the Scikit-learn v.1.3.2 and XGBoost v1.7.6 libraries.
To optimize model performance, we conducted hyperparameter tuning using the RandomizedSearchCV function with a 5-fold stratified cross-validation strategy. The F1-score was selected as the primary metric to guide the search. The hyperparameter search space for each model was defined as follows:
3.7.1. XGBoost
n_estimators: [50, 100, 200];
max_depth: [3, 6, 9];
learning_rate: [0.01, 0.1, 0.2];
subsample: [0.7, 0.8, 0.9].
3.7.2. RF
n_estimators: [50, 100, 200];
max_depth: [10, 20, 30, None];
min_samples_split: [2, 5, 10];
min_samples_leaf: [1, 2, 4].
3.7.3. LR
C (inverse of regularization strength): [0.01, 0.1, 1, 10, 100];
max_iter: [500, 1.000, 1.500].
3.7.4. SVM
To address class imbalance, the SMOTE was applied to the training set within each fold of cross-validation. This ensured that oversampling did not introduce information leakage from the test set. After hyperparameter optimization, the best-performing estimator for each model was further evaluated using a separate 5-fold cross-validation to confirm the robustness and generalizability of the results.
3.8. Validation Approach
To ensure robust evaluation, we adopt two validation strategies: (i) 5-fold cross-validation and (ii) temporal validation, where models were trained using data from 2017 to 2021 (approximately 25,000 observations) and tested on a hold-out set from 2022 (6151 observations). This approach simulates real-world scenarios by assessing generalization to future periods, a critical step for credit risk prediction models with temporal dependencies.
The combination of both validation methods enhances the robustness and external validity of the results.
3.9. Evaluation Metrics
Performance is evaluated using Accuracy, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), Precision, Recall, F1-Score, and MCC.
4. Results
4.1. Comparative Performance by Rating Segment
As hypothesized in H1, ensemble models (RF and XGBoost) consistently outperformed traditional approaches (LR and SVM) across all rating segments (
Table 4).
This section evaluates the predictive performance of four supervised learning algorithms—LR, RF, XGBoost, and SVM—across six credit rating segments. Each model was assessed using a comprehensive set of classification metrics, including AUC-ROC, F1-Score, MCC, Precision, Recall, and computational efficiency. These metrics provide a comprehensive view of model efficacy and illuminate trade-offs in large-scale credit risk settings.
XGBoost demonstrated the highest overall performance and computational efficiency. In the High Quality segment, it achieved perfect predictive metrics (AUC-ROC = 1.000, F1-Score = 1.000, MCC = 1.000), completing the evaluation in just 6 s. In more challenging segments, such as CCC and High Risk + Default, XGBoost maintained strong generalization capacity (F1-Score = 0.987 in CCC). In the High Risk + Default segment, it attained perfect Precision and Recall (1.000), as shown in
Table 5, underscoring its robustness under class imbalance and noisy data conditions [
36]. Moreover, its integration with explainable AI techniques, such as SHAP values, supports transparent and interpretable decision-making in regulatory environments [
6,
37].
RF also delivered strong results, particularly in the B segment (AUC-ROC = 0.999, F1 = 0.973, MCC = 0.960). While competitive with XGBoost across several segments, its training times were significantly longer—e.g., 52 s in the CCC segment. These findings align with ensemble benchmarking studies, which emphasize RF’s robustness and interpretability, albeit with higher computational costs [
38].
LR showed intermediate performance. While it performed adequately in investment-grade segments (e.g., F1 = 0.877 in High Quality), its effectiveness diminished in speculative categories. In the BB segment, F1 dropped to 0.548 and MCC to 0.424, suggesting that LR’s linear assumptions may limit its capacity to model complex nonlinearities present in distressed firms. These results are consistent with earlier critiques of traditional linear credit scoring models [
2].
SVM exhibited the lowest overall predictive performance. In the BB segment, it attained AUC-ROC = 0.714, F1 = 0.419, and MCC = 0.233, with training times of 6 s. Although performance was more stable in the High Quality segment (F1 = 0.912), SVM consistently lagged behind ensemble-based models. Its limited scalability and sensitivity to class imbalance reduce its applicability in operational credit risk settings [
38].
In summary, XGBoost emerges as the most suitable model for EWS in credit risk, especially where timely identification of deterioration is critical. RF remains a viable alternative, balancing performance, and interpretability at the expense of longer training times. LR may still be favored in low-risk environments prioritizing transparency. SVM, however, is less suited to large-scale applications due to its lower performance and scalability.
Table 4 presents a comparative summary of classification metrics—AUC-ROC, F1-Score, MCC, Precision, Recall—and training times for each model across the six credit rating segments. This facilitates a detailed, segment-wise evaluation of predictive accuracy and computational efficiency among the supervised learning algorithms.
4.2. Confusion Matrices
Figure 1 presents the consolidated confusion matrices for the four evaluated ML models—LR, RF, XGBoost, and SVM—applied across six credit rating segments: High Quality, BBB, BB, B, CCC, and High Risk + Default. The primary focus of interpretation lies in the diagonal cells, which represent correctly classified instances where the predicted rating segment matches the true label. High values along the diagonal indicate effective classification and segmentation of credit quality.
Among the evaluated models, XGBoost demonstrates the most consistent performance, with diagonal accuracies exceeding 90% across nearly all segments. This result suggests that XGBoost effectively captures complex, non-linear relationships within financial and operational features, leading to more precise credit risk differentiation. RF also performs well, particularly in the High Quality and High Risk + Default segments. However, moderate confusion is observed in intermediate categories such as BBB and BB, indicating some misclassification between adjacent risk classes. These off-diagonal errors are particularly challenging to manage when class boundaries are ambiguous or overlapping, as often occurs in mid-tier ratings [
39].
In contrast, LR exhibits lower discriminative power, especially between BB and B, where misclassifications are more frequent. This limitation likely stems from its linear nature and reduced capacity to model intricate feature interactions. SVM shows the least reliable performance, with dispersed prediction patterns and lower diagonal values—particularly in the CCC and High Risk + Default segments—highlighting challenges in detecting distressed firms.
Overall,
Figure 1 illustrates the superior performance of tree-based ensemble models, particularly XGBoost, in credit risk classification tasks. These models achieve both lower false positives and FNRs, which are crucial for balancing sensitivity (correctly identifying deteriorating firms) and specificity (avoiding false alarms) in EWS.
4.3. ROC Curve Analysis
The Receiver Operating Characteristic (ROC) curve is a standard diagnostic tool for evaluating the discriminatory power of classification models in credit risk analysis. It depicts the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across different decision thresholds. The summary statistic, AUC-ROC, quantifies a model’s ability to distinguish between risky and safe firms, with values closer to 1 indicating stronger discriminative performance [
40].
In our study, XGBoost and RF consistently achieved AUC-ROC values equal to or approaching 1.000 across all six credit rating segments. In investment-grade segments (e.g., High Quality and BBB), both models demonstrated near-perfect separation. Even in speculative-grade categories such as BB and CCC, AUC-ROC values remained above 0.998. Conversely, LR exhibited deteriorating performance as credit risk intensified, with AUC declining to 0.826 in the BB segment. SVM presented the weakest results, particularly in the BB segment (AUC = 0.714) and the High Risk + Default group, where high error rates undermined its reliability.
To complement this analysis, we assessed TPR at fixed FPR thresholds (0.1, 0.2, and 0.5), offering practical insights into model sensitivity under varying false alarm tolerances. Assessing classifier sensitivity at pre-defined false positive levels has been widely used in screening applications to evaluate performance under real-world operational constraints [
41]. XGBoost and RF consistently maintained TPR values near 1.000 across all thresholds and segments. In contrast, LR and SVM underperformed, particularly in high-risk segments. For instance, in the BB segment at FPR = 0.1, LR achieved a TPR of 0.483, while SVM registered only 0.293, compared to 0.999 for XGBoost.
A detailed summary of AUC-ROC values, TPR at fixed FPR thresholds (0.1, 0.2, and 0.5), and Equal Error Rates (EER) for each model and rating segment is presented in
Table 5. These metrics provide a comprehensive view of each algorithm’s discriminative capacity and operational sensitivity across varying levels of credit risk.
We also report the EER—the point at which the FPR equals the False Negative Rate (FNR)—as an indicator of model calibration. Both XGBoost and RF exhibited minimal EER values (e.g., EER FPR < 0.01; EER TPR > 0.99), reinforcing their robustness in imbalanced credit risk settings, where accurately identifying minority-class (risky) firms is especially challenging [
42].
4.4. Precision–Recall Curve Analysis
The Precision–Recall (PR) Curve provides a critical metric for evaluating classifier performance under class imbalance, which is inherent in the prediction of corporate rating transitions and defaults. Unlike the ROC Curve, which assesses global class separability, the PR Curve focuses on the minority class—firms experiencing downgrades or defaults—making it particularly relevant in financial applications where early identification of distress is essential [
43,
44].
Key indicators extracted from this analysis include:
Average Precision (AP): Reflects the area under the PR curve, summarizing the trade-off between precision and recall across all thresholds. Higher AP values indicate stronger performance in ranking positive cases early in the decision process;
Precision at Fixed Recall Levels (0.2, 0.5, 0.8): Demonstrates the model’s ability to retain precision as it seeks to identify a broader set of positive instances;
Maximum F1-Score: Represents the optimal balance between precision and recall, indicating the point of highest classification efficiency.
These metrics are summarized in
Table 6, providing a comparative basis for evaluating model robustness across different rating segments.
4.4.1. LR: Stable in Investment-Grade, Limited in Speculative Segments
LR displayed consistent and acceptable performance in investment-grade categories. For instance, AP values reached 0.911 for High Quality and 0.850 for BBB, indicating reliable performance where class imbalance is less severe. However, its precision deteriorated in lower-rated segments, particularly at higher recall levels:
In the BB segment, AP declined to 0.505, with a significant drop in precision at recall = 0.5538;
In CCC and High Risk + Default categories, AP values fell to 0.661 and 0.685, respectively, with maximum F1-scores below 0.732.
These results suggest that LR struggles to maintain predictive quality when required to identify a broader population of risky firms, thus limiting its applicability in early-warning or proactive risk surveillance contexts.
4.4.2. SVM: Inconsistent Performance and Poor Recall Precision Trade-Offs
SVM presented the most variable performance among all models. While it achieved competitive AP values in isolated cases—0.922 for High Quality and 0.843 for B—the model underperformed in more speculative segments:
In the BB segment, SVM recorded an AP of only 0.376 and a maximum F1-score of 0.448;
In the High Risk + Default category, AP dropped to 0.588, with an F1-score of 0.636.
Precision at recall = 0.8 remained consistently low across these segments, revealing the model’s vulnerability when tasked with identifying a larger set of at-risk firms. These weaknesses undermine the model’s reliability for credit risk applications that demand high sensitivity and early detection.
4.4.3. Tree-Based Models: Superior Precision–Recall Trade-Offs Across All Segments
Tree-based models—XGBoost and RF—consistently outperformed other classifiers across all segments, particularly in scenarios of heightened credit risk. Both models maintained high precision even at elevated recall levels, highlighting their capacity to detect risky firms without incurring excessive false positives:
In the CCC and High Risk + Default segments, XGBoost achieved AP values near 1.000, with precision at recall = 0.8 exceeding 0.900;
RF also demonstrated robust results, with AP values above 0.990 in all segments and competitive F1-scores.
These findings align with previous studies emphasizing the effectiveness of PR Curve-based evaluation for tree-based classifiers under class imbalance [
43] and reinforce the importance of such metrics in EWS for credit risk [
44]. Their favorable trade-offs between precision and recall—especially under imbalance—highlight their practical relevance for dynamic credit surveillance and stress testing frameworks.
The PR-based findings are consistent with the ROC-based results presented in
Section 4.3, reinforcing the models’ generalizability and resilience under multiple validation schemes.
4.5. Temporal Validation (Out-of-Time Testing)
Consistent with H2, segment-specific modeling by rating band delivered higher predictive performance than aggregated analysis, underscoring the value of stratified approaches in credit risk (
Table 4 and
Table 7).
To evaluate the temporal robustness of the models, we implemented an out-of-time validation procedure. All models were trained using data from 2017 through 2021 and tested exclusively on firm-year observations from 2022. This approach replicates a realistic forecasting scenario, in which credit risk models are applied to future data unseen during model development. The objective was to assess each model’s generalization capacity under temporal distributional shifts, which is critical in the context of volatile economic and credit environments.
4.5.1. XGBoost: Strongest Temporal Generalization
XGBoost consistently exhibited superior generalization performance when applied to the 2022 dataset.
Robustness Across Segments: Across all credit risk categories, XGBoost preserved near-perfect performance, with AUC-ROC values exceeding 0.999 and F1-Scores above 0.987;
High-Risk Segment Performance: In the “High Risk + Default” segment, the model achieved an AUC-ROC and F1-Score of 1.000, identical to the in-time evaluation. This confirms its exceptional capacity to identify high-risk firms with complete accuracy, even under temporal shifts;
Stability Analysis: The comparison between in-time and out-of-time performance revealed negligible degradation, reinforcing the model’s robustness for use in dynamic credit risk settings where data patterns evolve over time.
4.5.2. RF: Slight Degradation, Yet High Reliability
RF showed mild, yet measurable, performance decline in the temporal test, although it remained a highly reliable model overall.
Segment-Level Degradation: In the BB segment, the F1-Score declined from 0.965 to 0.954, and AUC-ROC decreased from 0.999 to 0.997. Despite this reduction, the model continued to deliver robust predictions;
Consistent in Low-Risk Segments: In the “High Quality” and “BBB” categories, F1-Scores remained above 0.979 and AUC-ROC values above 0.999, suggesting minimal sensitivity to temporal drift in stable firms;
Implications: While slightly less stable than XGBoost, RF remains a strong candidate for temporal credit forecasting, especially when model interpretability and resource efficiency are prioritized.
4.5.3. LR: Stable but Limited Predictive Capability
LR maintained stable performance between in-time and out-of-time evaluations; however, its overall predictive capability remained inferior, particularly in high-risk segments.
Predictive Ceiling: In the BB segment, the F1-Score increased modestly from 0.548 to 0.556, with AUC-ROC improving from 0.826 to 0.830. Despite stability, these values are significantly below those of tree-based models, limiting the model’s practical utility;
High-Risk Segment Performance: The “High Risk + Default” category yielded an F1-Score of 0.533, slightly higher than the in-time score of 0.523. Nonetheless, this level of performance falls short of the accuracy required for proactive risk management;
Structural Constraints: The model’s linear assumptions hinder its ability to capture complex patterns inherent in credit risk data, particularly in speculative-grade segments where firm characteristics are more heterogeneous and volatile.
4.5.4. SVM: Marked Instability over Time
SVM demonstrated the most significant performance deterioration in the temporal validation, confirming its limited applicability in credit forecasting contexts.
Sharp Declines: In the BB segment, the F1-Score declined from 0.474 to 0.357, while AUC-ROC dropped sharply from 0.749 to 0.422, indicating a substantial loss of discriminatory power;
Performance in CCC Segment: The CCC segment also showed substantial degradation, with the F1-Score falling from 0.636 to 0.283—a reduction of over 50% in predictive efficacy;
Conclusion: SVM exhibited high sensitivity to temporal changes, particularly in segments with greater class imbalance or non-linear risk patterns. Its instability and computational inefficiency suggest that it is not well-suited for credit risk applications where model robustness over time is essential.
This table summarizes the temporal validation results (2022 test set) for all four models—XGBoost, RF, LR, and SVM—across six credit risk segments. Metrics include AUC-ROC, Accuracy, F1-Score, MCC, Precision, and Recall.
4.6. Summary of Model Findings
To consolidate the main findings of this study,
Table 8 provides a comparative overview of the four models analyzed—XGBoost, RF, LR, and SVM—highlighting their performance across key dimensions: AP, temporal robustness, F1-score range, FPR in risk segments, and recommended use cases.
This comparative summary confirms the superior performance of XGBoost, which consistently achieved perfect or near-perfect results across all metrics (e.g., AUC = 1.000, AP = 1.000), along with strong generalization in the temporal holdout. Its main limitation remains moderate interpretability, a known trade-off in complex ensemble models.
RF performed nearly as well, with slightly lower precision but improved interpretability and greater operational simplicity. This makes it a viable alternative in contexts where transparency or computational resources are limiting factors.
LR, while interpretable and fast to deploy, underperformed in speculative-grade and high-risk segments. Its application should be limited to stable portfolios or as a benchmark in model comparisons.
SVM demonstrated the weakest results, exhibiting low temporal generalization, wide variability in performance, and high FPR in critical segments. These issues render it unsuitable for practical use in financial credit risk classification systems.
In summary, this synthesis emphasizes that beyond predictive accuracy, robustness, interpretability, and deployment context are essential considerations when selecting models for real-world credit risk applications.
5. Explainability: SHAP Analysis
Supporting H3, SHAP results reveal heterogeneous determinants of rating transitions across rating categories, with different sets of features driving outcomes in investment-grade versus speculative-grade firms (
Figure 2 and
Figure 3).
Overall SHAP Framework
In line with H5, governance quality and capital structure were dominant explanatory factors among higher-rated entities, consistent with prior evidence on rating stability (
Figure 3).
Aligned with H4, liquidity and coverage variables exerted the strongest influence in lower-rated firms, particularly in the CCC and High Risk + Default segments (
Figure 3).
To enhance the interpretability of the ML models, we applied SHAP, a model-agnostic technique that attributes prediction outcomes to individual input variables. SHAP enables both global and segment-level analyses by quantifying the marginal contribution of each feature to the model’s output. This capability is essential for understanding and validating automated credit risk assessments, particularly in regulated environments where transparency is critical.
SHAP has been successfully adopted across multiple high-stakes domains, including geological classification [
45], healthcare diagnostics [
46], and photovoltaic forecasting [
47]. These applications reinforce its relevance in settings that demand explainability, robustness, and domain-specific interpretability—characteristics that align closely with the requirements of credit risk modeling.
Figure 2 presents the consolidated SHAP summary plots by credit rating segment, revealing the most influential predictors of rating transitions and defaults.
Across all segments, key variables include management and governance quality, industry classification, liquidity, fixed charge coverage, and operating efficiency. These features are not only statistically important but also economically intuitive, reflecting established principles in credit evaluation.
In lower-rated segments—specifically CCC and High Risk + Default—liquidity emerges as a dominant variable, highlighting its relevance in assessing short-term solvency and distress risk. Conversely, higher-rated segments such as High Quality and BBB place greater weight on structural and strategic attributes, including management quality, industry exposure, and scale or diversification, which reflect the importance of forward-looking risk drivers among financially sound entities.
Profitability volatility, measured via earnings fluctuations, demonstrates consistent relevance across segments, particularly in BB and B categories. This underscores the role of earnings stability in signaling future rating transitions.
These findings suggest that the ML models dynamically adjust their explanatory basis based on the firm’s credit quality. Rather than relying on a static set of predictors, the algorithms prioritize different variables depending on the risk profile, demonstrating adaptability to the heterogeneity of real-world credit environments.
Figure 3 further illustrates this pattern, summarizing the top features across rating segments.
While variables like liquidity, operating efficiency, and governance consistently appear among the most impactful, their relative influence shifts: short-term solvency and earnings risk dominate in distressed segments, whereas industry dynamics and firm-level strategic characteristics prevail in investment-grade categories. These results align with domain knowledge and regulatory practices.
In conclusion, the SHAP-based interpretability framework confirms that the ML models offer not only strong predictive performance but also economic and theoretical coherence. Their decisions are grounded in well-established financial indicators, enhancing auditability, transparency, and trust—essential characteristics for the practical adoption of AI systems in credit risk management.
6. Discussion
To ensure coherence between the theoretical foundations, the research design, and the empirical findings, we evaluated the five hypotheses formulated in
Section 2.6. Each hypothesis was tested against the results presented in
Section 4, covering both predictive performance and feature interpretability across rating segments.
Table 9 summarizes the verification outcomes, indicating whether each hypothesis was supported by the evidence.
The empirical evidence provides comprehensive support for all five hypotheses, emphasizing the significance of this study’s contributions. The superior performance of ensemble models demonstrates the capacity of advanced ML techniques to model nonlinear relationships and complex interdependencies within credit risk data. The benefits of segment-specific analysis reveal the heterogeneous dynamics underlying different rating categories, underscoring the necessity of disaggregated approaches. Moreover, the distinct roles of liquidity, coverage, and governance variables illustrate the value of interpretable methods such as SHAP in uncovering the economic rationale for rating transitions. Taken together, these findings highlight that predictive accuracy and interpretability can be jointly pursued, offering a more transparent and practically relevant framework for credit risk assessment and management.
6.1. Strengths of Tree-Based Models
The results indicate that tree-based models, particularly XGBoost and RF, outperform traditional classifiers in predicting rating transitions and corporate defaults. These models achieved exceptional predictive performance across all rating segments, consistently posting AUC-ROC and F1-Scores near or equal to 1.000. Their robustness was further confirmed in temporal validation, where XGBoost maintained high predictive power in 2022 out-of-sample data, demonstrating strong generalization capacity.
Both models also showed resilience to class imbalance. In minority-class segments such as “High Risk + Default” and “CCC,” XGBoost achieved high recall and precision simultaneously—a task where traditional models often struggle—due to its capacity to model complex, non-linear relationships among financial and governance indicators.
From an operational standpoint, RF delivered competitive performance with slightly greater interpretability and reduced computational demands, suggesting its suitability for deployment in resource-constrained environments. SHAP analysis confirmed that model predictions were driven by meaningful variables (e.g., liquidity, fixed charge coverage, management and governance quality, and operating efficiency), reinforcing their reliability and explainability [
48,
49].
As detailed in
Table 9, XGBoost consistently achieved cross-validated F1-Scores above 0.986 and MCC values near 1.000 across all rating segments. RF also demonstrated solid generalization, with F1-Scores and MCCs typically exceeding 0.950. These findings support the efficacy of ensemble-based models in predictive credit risk analytics and their viability for real-world application in financial decision-making.
6.2. Limitations of Linear and SVM Models
While LR produced adequate results in high-quality and investment-grade segments, its performance declined substantially in lower-rated classes. Its linear assumptions limit the model’s ability to capture the complex, nonlinear relationships often exhibited by financially deteriorating firms. For instance, in the BB segment, LR yielded a cross-validated F1-Score of 0.544 and an MCC of 0.420 (
Table 9), limiting its practical applicability in EWS and proactive risk management.
In contrast, SVM demonstrated the most inconsistent outcomes among the evaluated models. Despite employing kernel methods and regularization, SVM underperformed across most evaluation metrics, particularly in segments characterized by data imbalance. It exhibited elevated FPR, longer computational times, and a high sensitivity to hyperparameter settings. In the BB segment, for example, it recorded a CV F1-Score of just 0.353 and an MCC of 0.106 (
Table 9), highlighting its instability in this context.
These limitations underscore the importance of utilizing models that can effectively manage high-dimensional, nonlinear, and imbalanced data—features that are intrinsic to corporate credit risk datasets.
6.3. Model Trade-Offs: Performance vs. Interpretability
Although XGBoost outperformed other models in terms of accuracy and generalization, it does so at the expense of interpretability. Its complexity—stemming from multiple decision trees and iterative boosting—hinders direct human understanding. In contrast, RF, while slightly less accurate, provides greater model transparency due to its ensemble averaging mechanism and reduced sensitivity to overfitting.
LR continues to serve as a benchmark for interpretability, as its coefficients can be directly associated with economic reasoning. This makes it particularly suitable for explainability-focused contexts, such as regulatory oversight [
6,
37]. However, its simplicity also constrains its performance in more complex or volatile rating segments.
Table 10 further illustrates this trade-off: LR maintained stable cross-validated F1-Scores in investment-grade segments (e.g., 0.947 in High Quality and 0.830 in BBB), despite a noticeable decline in performance across speculative-grade classes.
The application of SHAP values helps mitigate the trade-off between predictive performance and interpretability by providing post hoc explanations of model behavior. SHAP allows complex models such as XGBoost to be interpreted at both global and local levels, offering transparency into how individual features influence predictions. Although originally applied in geospatial and environmental modeling [
48,
49], this interpretability framework is equally valuable in financial applications, where model transparency supports more responsible and explainable decision-making.
6.4. Practical Implications for Credit Risk Monitoring
These findings have important implications for financial institutions and regulators. Tree-based models can be effectively integrated into credit monitoring systems as early warning tools capable of flagging firms at considerable risk of downgrade or default. Their capacity to identify rare yet impactful events with low error rates is essential for mitigating systemic risk.
XGBoost, with its superior predictive accuracy, is particularly suitable for high-stakes applications where false negatives must be minimized. In contrast, RF offers a robust alternative when computational efficiency and interpretability are prioritized. LR, despite its limitations, may still be appropriate for low-risk portfolios or as a benchmark model due to its transparency and simplicity.
Furthermore, the use of temporal validation underscores these models’ potential to remain effective under changing macroeconomic conditions—an essential feature in volatile financial environments. This adaptability supports their application in dynamic credit risk models, stress testing frameworks, and real-time surveillance systems.
6.5. Limitations and Future Work
Despite rigorous validation, several limitations should be acknowledged. First, the dataset used in this study originates from a single institutional source, which may limit the external validity and generalizability of the findings. Future research could incorporate data from multiple rating agencies or international sources to capture broader market heterogeneity.
Second, the analysis relies exclusively on structured numerical and categorical data. Integrating unstructured sources—such as management commentary, news sentiment, and analyst reports, along with forward-looking indicators (e.g., CDS spreads or market-implied signals), may enhance predictive accuracy and contextual depth.
Third, while SHAP was effectively applied to interpret model outputs, future work could explore methods that embed explainability directly into the training process. Techniques such as explainability-aware objectives or inherently interpretable architectures could improve transparency without compromising performance. In addition, hybrid models that combine traditional credit scoring with ML, as well as semi-supervised or transfer learning approaches to exploit sparse or unlabeled data, present promising directions for advancing model robustness.
Finally, although temporal validation was conducted using a hold-out year (2022), extending the validation across longer horizons and multiple economic cycles would further support the model’s reliability. Evaluating performance across diverse market regimes, industries, and geographic regions will be critical to confirm the stability and adaptability of these models in dynamic financial environments.
To mitigate overfitting and enhance generalizability, several safeguards were implemented. Hyperparameter tuning was conducted using stratified cross-validation and independent test sets. Temporal validation demonstrated that models trained on 2017–2021 data maintained high performance when applied to 2022 observations, suggesting the capture of generalizable patterns. Furthermore, SHAP analysis across folds and rating segments revealed consistent variable importance, reinforcing the models’ stability and resilience to noise and sampling variation.
7. Conclusions
This study proposed a hybrid ML framework to explain and predict corporate credit rating transitions and defaults. By integrating advanced classification algorithms—XGBoost, RF, LR, and SVM—with segment-specific modeling, temporal validation, and SHAP-based explainability, the research offers a robust and interpretable approach to credit risk assessment.
Empirical results confirm that tree-based ensemble models, particularly XGBoost, outperform traditional linear models and SVM across all rating segments and under various validation schemes. These models demonstrated superior predictive accuracy, resilience to class imbalance, and stability in out-of-time testing. Importantly, the use of SHAP values enabled transparent interpretation of key risk drivers, tailored to each credit rating segment, thereby enhancing model trustworthiness in regulatory and operational settings.
The findings contribute to both academic literature and industry practice by addressing several known gaps: (i) the lack of segment-level model stratification, (ii) the underuse of temporal validation in credit risk studies, (iii) the need for model interpretability in regulated environments, and (iv) the practical integration of explainability into EWS. The results suggest that ML-based frameworks—when properly tuned and explained—can support more proactive and transparent credit surveillance.
These contributions, however, are bounded by several assumptions and limitations. The analysis assumes that the 2017–2022 period adequately reflects the dynamics of rating transitions, even though more recent shocks may alter risk patterns. It also assumes that the chosen financial and governance variables sufficiently capture rating determinants, leaving aside unstructured disclosures, macro indicators, and market-based signals that may enrich predictive performance. Methodologically, the study assumes that the selected ML algorithms and tuning strategy fairly represent the performance–interpretability trade-off, though alternative approaches may yield different outcomes. Finally, the segmentation by rating bands is assumed to be the most relevant dimension of heterogeneity, even though industry or regional factors could provide complementary insights.
Future research should explicitly test these assumptions by extending datasets beyond 2022, incorporating qualitative and forward-looking variables, and experimenting with hybrid models that embed interpretability directly into the learning process. Such efforts would not only validate the robustness of the present framework but also enhance its adaptability to systemic shocks and evolving regulatory expectations.
Overall, this research demonstrates that integrating ML with domain-relevant segmentation and explainable AI tools can significantly improve the effectiveness, transparency, and adaptability of credit risk modeling. As financial markets evolve and regulatory scrutiny increases, such frameworks are likely to become indispensable for institutions seeking to strengthen risk governance and decision-making under uncertainty.