Explaining Corporate Ratings Transitions and Defaults Through Machine Learning

Oliveira, Nazário Augusto de; Basso, Leonardo Fernando Cruz

doi:10.3390/a18100608

Open AccessArticle

Explaining Corporate Ratings Transitions and Defaults Through Machine Learning

by

Nazário Augusto de Oliveira

^*

and

Leonardo Fernando Cruz Basso

Department of Business Administration—Strategic Finance, Mackenzie Presbyterian University (UPM), São Paulo 01302-907, SP, Brazil

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(10), 608; https://doi.org/10.3390/a18100608

Submission received: 2 September 2025 / Revised: 26 September 2025 / Accepted: 26 September 2025 / Published: 28 September 2025

(This article belongs to the Special Issue AI Applications and Modern Industry)

Download

Browse Figures

Versions Notes

Abstract

Credit rating transitions and defaults are critical indicators of corporate creditworthiness, yet their accurate modeling remains a persistent challenge in risk management. Traditional models such as logistic regression (LR) and structural approaches (e.g., Merton’s model) offer transparency but often fail to capture nonlinear relationships, temporal dynamics, and firm heterogeneity. This study proposes a hybrid machine learning (ML) framework to explain and predict corporate rating transitions and defaults, addressing key limitations in existing literature. We benchmark four classification algorithms—LR, Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machines (SVM)—on a structured corporate credit dataset. Our approach integrates segment-specific modeling across rating bands, out-of-time validation to simulate real-world applicability, and SHapley Additive exPlanations (SHAP) values to ensure interpretability. The results demonstrate that ensemble methods, particularly XGBoost and RF, significantly outperform LR and SVM in predictive accuracy and early warning capability. Moreover, SHAP analysis reveals differentiated drivers of rating transitions across credit quality segments, highlighting the importance of tailored monitoring strategies. This research contributes to the literature by bridging predictive performance with interpretability in credit risk modeling and offers practical implications for regulators, rating agencies, and financial institutions seeking robust, transparent, and forward-looking credit assessment tools.

Keywords:

credit rating transitions; machine learning; default prediction; explainable AI (SHAP)

1. Introduction

Corporate credit ratings are central to modern financial systems, influencing investment decisions, debt pricing, and regulatory capital requirements. Beyond their static role, rating transitions and defaults reflect the dynamic evolution of firm risk, particularly during periods of macroeconomic stress or structural disruption. Accurately modeling and explaining these transitions is vital for investors, regulators, and institutions seeking forward-looking insights into credit deterioration.

Traditionally, credit risk models have relied on structural frameworks such as Merton’s [1] option-theoretic approach and statistical models like LR [2,3]. While theoretically sound, these models assume linearity, firm homogeneity, and stable relationships between financial variables—assumptions that are increasingly challenged by the complexity of real-world credit behavior. Moreover, reliance on risk-weighted assets (RWAs) under Basel frameworks has been criticized for encouraging regulatory arbitrage and weakening alignment between modeled and actual risk [4].

Recent advances in ML offer promising alternatives. ML models such as RF, XGBoost, and LSTM networks can detect nonlinearities, handle high-dimensional data, and adapt to changing environments. For instance, Kandi and García-Dopico [5] showed that LSTM outperforms XGBoost in imbalanced datasets by capturing temporal dependencies without complex feature engineering—an important benefit in financial sequence data. However, many ML models remain opaque, hindering their adoption in regulated settings.

To address this, explainable artificial intelligence (XAI) techniques such as SHapley Additive exPlanations (SHAP) and LIME are gaining traction. Nallakaruppan et al. [6] demonstrated that integrating these tools into financial models supports both interpretability and regulatory compliance. Despite these developments, key gaps persist: few studies apply segment-specific modeling across rating categories [7], and many overlook firm-level disclosure complexity, which increases informational asymmetry [8].

This study proposes a hybrid ML framework to explain corporate rating transitions and defaults. We stratify firms into rating segments, apply out-of-time validation to simulate real-world performance, and use SHAP values to interpret model outputs. By benchmarking LR, RF, XGBoost, and SVM, we contribute a robust, explainable, and segment-aware early warning system (EWS) for credit risk management.

2. Theoretical Background and Literature Review

2.1. Traditional Approaches to Credit Risk Modeling

Traditional credit risk assessment relies on structural and statistical models grounded in economic theory. Merton’s [1] structural model conceptualizes a firm’s equity as a European call option on its assets, with default occurring when asset value falls below debt obligations at maturity. Although this model offers a theoretical link between default probability, asset volatility, and capital structure, its practical use is limited by the need for unobservable inputs and strong assumptions about market efficiency and continuous trading.

Due to these limitations, empirical research has increasingly favored statistical models based on observable firm-level data. LR gained prominence for its interpretability and operational simplicity. Foundational studies by Altman [2] and Ohlson [3] demonstrated the predictive power of financial ratios for bankruptcy and default, laying the groundwork for early credit scoring systems used in academic and regulatory contexts.

However, logistic models have important limitations. They assume linear and additive relationships between predictors and default probability and often presume homogeneity across firms and conditions. As Böhnke et al. [4] point out, reliance on internally developed RWA models under Basel II and III can lead to regulatory arbitrage and a weakening link between modeled risk and actual exposures. These models also struggle to capture dynamic risk profiles and economic shocks.

Structural models have been further criticized for ignoring stochastic debt issuance. Feldhütter and Schaefer [9] provide evidence that short-term leverage volatility often exceeds asset volatility, contradicting assumptions of constant debt levels. This highlights the need for models that incorporate evolving capital structures.

While LR remains common due to its transparency and ease of use, it is increasingly outpaced by more flexible ML methods. Zedda [10] confirms that although logistic models are still used in regulatory settings, they underperform in capturing nonlinear interactions in firm data. These shortcomings have paved the way for advanced approaches better suited to modeling heterogeneity and dynamic credit behavior, as explored in the next section.

2.2. Evolution to ML in Credit Risk

Limitations of traditional models have accelerated the adoption of ML techniques in credit risk assessment. Unlike LR, which assumes linearity, ML models can capture nonlinear relationships and adapt to changing economic conditions. With fewer parametric assumptions, ML techniques can extract patterns from high-dimensional data, improving prediction of defaults and rating transitions [11].

A key strength of ML is its ability to handle data heterogeneity and class imbalance, which are common in credit datasets where defaults are relatively rare. Ensemble models, in particular, have shown strong classification performance and robustness to noise and imbalanced class distributions. Techniques such as Synthetic Minority Oversampling Technique (SMOTE) further enhance generalizability by balancing training data without distorting distributions [12]. In personal credit contexts, Wang et al. [11] show that XGBoost outperforms traditional models in assessing borrower risk with high precision.

Empirical evidence highlights the strengths of ML in credit risk modeling. Xu and Zhang [13] demonstrated that combining SVM with ensemble methods improves classification performance in the technology sector, showcasing the adaptability and effectiveness of hybrid ML models in corporate credit scoring.

AutoML platforms like H2O-AutoML and AutoGluon lower technical barriers and streamline deployment. Papík and Papíková [14] show these tools deliver superior area under the curve (AUC) scores and faster model development in bankruptcy prediction, particularly in data-constrained environments like Slovak manufacturing firms. Similar findings are echoed by Papík and Papíková [15], who highlight the operational feasibility of AutoML systems for practitioners with limited modeling expertise.

To address concerns about model opacity, explainable AI (XAI) tools have become indispensable. SHAP stands out as a leading method for interpreting ML outputs. Hamida et al. [16] highlight the importance of explainability in high-stakes domains, emphasizing that XAI enhances transparency and fosters stakeholder trust in AI-driven decision systems.

Hybrid approaches are also emerging. Sun et al. [17] propose an interpretable SHAP-based framework for macroeconomic forecasting, incorporating news narrative sentiment, with applications in credit cycle modeling. Tribuvan et al. [18] further demonstrate that combining feature selection with ensemble classifiers enhances model precision in default prediction across diverse financial datasets.

2.3. Gaps in Literature

Despite ML’s promise, several key limitations hinder its broader application in credit risk modeling.

Interpretability remains a primary concern. While SHAP and other XAI tools provide insights, few studies systematically integrate them into credit rating transition or early warning models. Noriega et al. [19] point out that many models still operate as “black boxes,” limiting their acceptance by regulators and practitioners.

Segment-level performance is often overlooked. Most models are evaluated on aggregated data, assuming firm homogeneity. This masks important variations across rating categories. Bitetto et al. [7] and Beltman et al. [20] highlight the need to tailor models to firm-specific profiles, warning that a lack of stratification leads to biased predictions and weaker diagnostics.

Validation practices are also inadequate. Many studies rely on random cross-validation, ignoring the temporal structure of financial data and risking information leakage. Machado et al. [21] advocate for out-of-time validation to ensure models remain robust across economic cycles.

Feature diversity remains limited, as most models rely heavily on traditional financial ratios while overlooking non-financial indicators. Hu [22] and Safiullah et al. [23] demonstrate that incorporating ESG factors can enhance forecast accuracy, particularly in sensitive sectors.

Macroeconomic regime shifts are under-modeled. James and Menzies [24] show that structural changes in the market environment affect the predictive relevance of risk factors, yet most models fail to account for such dynamics.

Methodologically, techniques like generative adversarial networks (GANs) remain underused despite their promise. Strelcenia and Prakoonwit [25] find GAN-based models outperform traditional sampling in fraud detection, suggesting their utility in default prediction.

Ethical concerns are often treated as an afterthought. Fox and Rey [26] argue that fairness should be embedded in model design, not assessed post hoc. Horobet et al. [27] note that limited collaboration between finance and AI fields hampers innovation and policy alignment.

These gaps highlight the need for credit models that are not only accurate but also interpretable, adaptable, and ethically robust. The following framework integrates these priorities.

2.4. Theoretical Framework

This study adopts an interdisciplinary framework to support a hybrid ML approach to credit risk modeling. It integrates perspectives from credit theory, decision science, information economics, portfolio theory, XAI, institutional theory, and stakeholder theory—aligning innovation with regulatory and ethical requirements.

Credit risk theory provides a foundation, with models such as Altman’s Z-score [2], Ohlson’s [3] logit model, and Merton’s [1] framework still central to Basel II and III regulation [4]. However, assumptions like static debt structures are unrealistic [9], requiring more dynamic modeling. ML enhances these models by identifying nonlinear patterns and enabling forward-looking indicators like expected default frequency.

Decision science frames credit risk as a classification problem with asymmetric costs—where false negatives (missed defaults) are more damaging than false positives. ML facilitates probabilistic outputs and threshold tuning, strengthening EWS [20], especially when validated through temporal forecasting [21].

Information economics and signaling theory emphasize ML’s role in reducing information asymmetries. Models can detect latent signals from firm disclosures, ESG reports, or behavioral data [28,29].

Portfolio theory [30] underscores the importance of accurate credit predictions for capital allocation and diversification. ML improves prediction quality at the portfolio level [14]. Peer-matching techniques, such as in Bitetto et al. [7], refine assessments of unlisted firms, enhancing capital efficiency.

Explainability is critical for adoption. Gafsi [31] emphasizes that interpretability is essential for building regulatory trust and fostering stakeholder engagement. SHAP values contribute to this by offering both global and local transparency in model behavior.

According to institutional theory, the adoption of interpretable ML models can be understood as a response to coercive (regulatory), normative (professional), and mimetic (competitive) pressures that drive organizational isomorphism [32]. In sensitive domains such as credit scoring, embedding fairness and accountability into model design is essential to meet these institutional expectations [26].

Stakeholder theory supports the integration of ESG and innovation indicators. Green innovation enhances reputational capital and financial resilience [23], while strong ESG profiles correlate with reduced default risk [33].

This framework informs the design of a robust, interpretable, and segment-aware ML approach to credit risk management.

2.5. This Study’s Contribution

This study contributes to the literature by proposing a hybrid ML framework that addresses four challenges: performance, interpretability, segment-level heterogeneity, and temporal robustness.

First, it applies segment-specific modeling, grouping firms by credit quality (e.g., investment-grade, speculative-grade, distressed). This stratification improves diagnostic precision by revealing patterns obscured in aggregate evaluations [7,20].

Second, it uses temporal validation to assess generalizability across future periods, avoiding biases introduced by random cross-validation [21]. It also applies SHAP to enhance interpretability at global and segment levels, clarifying feature importance across firm types and time horizons.

Third, it compares four algorithms—LR, RF, XGBoost, and SVM—assessing both predictive accuracy and interpretability. This enables a nuanced understanding of trade-offs between transparency and performance [13,18,31]. By including SVM-based models and applying feature selection techniques where relevant, the study reflects contemporary best practices in the credit modeling literature.

Finally, it develops a practical EWS that integrates segmentation, dynamic validation, and explainable outputs. This EWS supports timely insights for credit monitoring, portfolio management, and regulatory oversight [14,20].

2.6. Research Hypotheses

Building on the theoretical and empirical foundations discussed in the literature review, this study develops and tests five hypotheses that structure the analysis of rating transitions and defaults. These hypotheses address three dimensions: the comparative performance of different modeling approaches, the value of segment-specific analysis, and the heterogeneity of explanatory drivers across rating categories.

H1.

Ensemble models (RF, XGBoost) outperform traditional models (LR, SVM) in predicting rating transitions and defaults.

Structural and statistical models such as Merton’s option-theoretic framework [1] and LR [3] provide transparency but assume linearity, firm homogeneity, and stable relationships between financial variables. Recent advances in ML, however, demonstrate that ensemble models can capture nonlinearities and high-dimensional interactions that traditional approaches overlook [5]. We therefore expect RF and XGBoost to achieve superior predictive performance relative to LR and SVM.

H2.

Segment-specific modeling by rating band produces higher predictive performance than aggregated modeling.

Traditional approaches often impose the assumption of uniform risk drivers across firms. Yet prior research highlights that credit risk dynamics differ substantially by rating category, undermining the homogeneity assumption embedded in linear frameworks [2]. Consistent with this evidence, we hypothesize that stratifying firms by rating segment will yield more robust and accurate predictive outcomes than a single aggregated model.

H3.

The determinants of rating transitions vary significantly across rating segments.

The literature shows that the factors influencing rating changes are not uniform across credit qualities. We thus hypothesize that SHAP analysis will reveal heterogeneous sets of explanatory drivers across investment-grade and speculative-grade categories.

H4.

Liquidity and coverage metrics play a stronger role in explaining transitions for lower-rated firms.

Liquidity constraints and refinancing risks have long been identified as key vulnerabilities in speculative-grade entities [2]. Metrics such as funds from operations to debt and interest coverage ratios are therefore expected to exert stronger influence in explaining transitions among lower-quality firms.

H5.

Management and Governance factors are more relevant in explaining transitions for higher-rated firms.

For investment-grade issuers, where liquidity constraints are less severe, long-term stability is more closely tied to governance and financial policy choices. Prior studies emphasize the role of governance and capital structure in maintaining rating stability and credibility in regulated settings [6]. Accordingly, we expect these factors to dominate in higher-quality rating segments.

Together, these hypotheses connect the predictive performance of ML models with the interpretability of risk drivers across rating categories. In subsequent sections, we describe the methodological procedures employed to test these hypotheses and present their empirical verification.

3. Research Design and Data

3.1. Data Collection and Sources

The dataset employed in this study was compiled from Capital IQ Pro and Bloomberg, comprising firm-year observations from 2017 to 2022. The initial dataset included over 35,000 observations across multiple geographies and industries. Following the removal of duplicates, incomplete entries, and firms lacking consistent credit rating histories, the final panel consisted of 31,151 firm-year observations.

The sample ends in 2022 because more recent financial and rating data were still incomplete or pending validation at the time of collection. Using 2017–2022 ensures consistency, comparability, and reliability across firms and rating segments.

To ensure consistency and minimize reporting bias, we applied several data quality filters:

Only firms with publicly available financial statements and recognized credit ratings from S&P Global were retained;
All variables were cross validated to eliminate entry errors and outliers;
Financial ratios were calculated using standardized formulas to avoid distortion from varying accounting practices.

Credit ratings were assigned based on the full set of 23 S&P Global notches (AAA to D), subsequently grouped into six broader rating segments—High Quality, BBB, BB, B, CCC, and High Risk + Default—to balance granularity and statistical power.

The binary target variable was set to 1 if the firm experienced a rating downgrade or default within the subsequent 12 months, and 0 otherwise. Rating upgrades were treated as non-events (Class 0), consistent with the study’s focus on identifying credit deterioration risks.

Table 1 summarizes the distribution of downgrade/default events (Class 1) and non-events (Class 0) across six credit rating segments. While the total sample size is held constant across segments (N = 31,151), the proportion of Class 1 events varies significantly, reflecting heterogeneity in downgrade or default risk. To ensure comparability across segments and control for sample size effects, we created balanced subsamples of 31,151 firm-year observations for each rating category. This design allows for direct performance comparisons without bias introduced by unequal class distributions.

The High Risk + Default segment shows the lowest incidence of Class 1 events (0.64%), suggesting these ratings are typically assigned to already deteriorated credits with limited further room for rating decline. In contrast, the B segment presents the highest proportion of Class 1 cases (30.71%), highlighting it as a critical transition zone where firms are more prone to rating actions.

The BBB and BB segments exhibit moderate Class 1 proportions (27.39% and 20.21%, respectively), indicating elevated but still manageable risk levels relative to lower-rated entities. The High Quality segment displays a relatively low downgrade/default rate (12.97%), while the CCC segment’s lower Class 1 proportion (7.15%) may reflect a floor effect—entities at this level already face severe credit stress, limiting further deterioration within the rating scale.

This distribution highlights the need for segment-specific modeling, as downgrade and default probabilities vary substantially across the rating spectrum. Applying a one-size-fits-all approach may obscure meaningful risk patterns within each segment and ultimately reduce predictive performance.

3.2. Rating Grouping and Class Consolidation

Rating Grouping

In this study, we utilize the full range of S&P Global credit rating grades, comprising 23 ordered levels from AAA (indicating the highest creditworthiness) to D (default). Given the concentration of credit risk at the lower end of the spectrum, market participants often consolidate ratings of CCC+ and below into broader categories due to their elevated probability of default and financial distress.

To address issues of class imbalance—particularly relevant in ML applications—and to preserve interpretability, we consolidated the 23 rating levels into six broader segments. This grouping reduces data sparsity in rare rating categories and improves the statistical power of model training and segment-level analyses.

Table 2 presents the mapping of the original S&P credit ratings into these six consolidated classes:

3.3. Data Preprocessing and Treatment

To prepare the dataset for ML modeling, several preprocessing steps were undertaken to ensure data integrity, address class imbalance, and enhance predictive performance.

First, all continuous numerical variables were standardized using the StandardScaler, which adjusts each variable so that they are measured on the same scale by removing differences in average size and spread. This prevents variables with larger numbers from dominating the analysis. Categorical variables, such as industry classification and management as well as governance, were transformed via one-hot encoding to retain non-ordinal relationships. Categorical variables, such as industry classification and management as well as governance, were transformed via one-hot encoding to retain non-ordinal relationships.

To reduce redundancy and multicollinearity, highly correlated features (Pearson’s ρ > 0.9) were identified and removed based on correlation matrices and variance inflation factors (VIF). Missing values in financial ratio variables were imputed using industry-median values. Variables with a high proportion of missingness or those deemed non-imputable led to the exclusion of corresponding records.

The dataset was further stratified by credit rating segment, ensuring that Class 1 (downgrade or default event) was defined within each segment to capture segment-specific credit risk. This stratification allowed for more context-aware binary classification modeling.

Given the inherent class imbalance—especially in segments such as High Risk + Default—SMOTE was applied post-train–test split to prevent information leakage [34,35].

Two validation strategies were employed: (1) five-fold cross-validation for in-sample evaluation and hyperparameter tuning, and (2) temporal validation, wherein models were trained on observations from 2017 to 2021 and evaluated on 2022 data to approximate real-world generalization.

Finally, all models underwent hyperparameter optimization via RandomizedSearchCV, with performance evaluated primarily using the F1-Score and Matthews Correlation Coefficient (MCC is a balanced measure of classification quality particularly suitable for imbalanced datasets). The parameter grids and tuning strategy for each model are detailed in Section 3.7.

3.4. ML Models

We evaluated four supervised learning algorithms—LR, RF, XGBoost, and SVM—to address the challenge of predicting rare credit rating downgrades and defaults. These models were selected to reflect a range of trade-offs across interpretability, nonlinearity handling, and computational scalability, which are critical design considerations in credit risk modeling.

LR is a benchmark model widely adopted in credit scoring due to its interpretability and computational simplicity. It assumes a linear relationship between predictors and the log-odds of the target class. While its transparent coefficient structure aids explainability, it lacks the flexibility to capture nonlinear relationships commonly observed in distressed corporate environments. Recent applications in bankruptcy prediction show that automated model selection frameworks outperform baseline LR in predictive power while retaining transparency [15].

RF is an ensemble-based model that aggregates multiple decision trees to reduce overfitting and improve generalization. Its use of feature bagging and bootstrap sampling enhances robustness in high-dimensional and noisy datasets—a common feature of financial and accounting data in credit risk. Comparative studies have demonstrated that RF achieves competitive performance in credit risk classification, particularly when combined with feature selection techniques [18].

XGBoost builds on ensemble methods through sequential model optimization via gradient boosting. Its regularization mechanisms help mitigate overfitting while improving classification performance, especially for imbalanced data. In personal credit risk applications, XGBoost has outperformed traditional models in both accuracy and sensitivity to default-prone classes, underscoring its potential in financial domains involving rare-event prediction [11].

SVM offers strong performance in high-dimensional spaces by constructing optimal hyperplanes that maximize class separation margins. However, its sensitivity to kernel selection and computational cost limits scalability in large datasets. Moreover, its performance tends to decline under severe class imbalance—an inherent feature in downgrade/default prediction. Empirical studies in the credit assessment of listed firms confirm that SVM may require hybrid approaches to remain competitive [13].

3.5. Target Variables

The target variable is defined as binary: the value of 1 indicates that the firm experienced a rating downgrade or default within a 12-month horizon; a value of 0 denotes all other cases. A downgrade is operationalized as any transition to a lower rating category (e.g., from BBB to BB, or from BB to B), including default events. In contrast, upgrades—transitions to higher rating categories (e.g., from BB to BBB)—are considered non-events and thus classified as Class 0, consistent with the modeling objective of early identification of credit deterioration.

Although upgrades are grouped within Class 0 to maintain a binary classification structure, we acknowledge the potential for rating volatility, including downgrade-upgrade cycles within short time frames. In such cases, the downgrade event is prioritized if it occurs within the 12-month prediction window, as the analytical focus is on identifying early signs of credit deterioration, which hold greater relevance for risk management and regulatory purposes.

3.6. Feature Engineering

The final model incorporates 46 features, comprising a diverse set of profitability ratios, leverage and coverage metrics, efficiency indicators, cash flow variables, balance sheet components, structural attributes, and qualitative governance factors. These variables were selected based on their theoretical relevance to credit risk and their empirical performance in prior studies. Table 3 provides an overview of the variables used in model training, grouped into thematic categories along with a brief description or calculation formula for each.

This comprehensive feature set captures both quantitative financial health and qualitative risk dimensions, aligning with credit theory and enabling explainable modeling through SHAP analysis.

3.7. Methodology

This study evaluated the predictive performance of four supervised ML models—LR, RF, XGBoost, and SVM—to classify corporate credit rating transitions. All models were implemented in Python v3.10 using the Scikit-learn v.1.3.2 and XGBoost v1.7.6 libraries.

To optimize model performance, we conducted hyperparameter tuning using the RandomizedSearchCV function with a 5-fold stratified cross-validation strategy. The F1-score was selected as the primary metric to guide the search. The hyperparameter search space for each model was defined as follows:

3.7.1. XGBoost

n_estimators: [50, 100, 200];
max_depth: [3, 6, 9];
learning_rate: [0.01, 0.1, 0.2];
subsample: [0.7, 0.8, 0.9].

3.7.2. RF

n_estimators: [50, 100, 200];
max_depth: [10, 20, 30, None];
min_samples_split: [2, 5, 10];
min_samples_leaf: [1, 2, 4].

3.7.3. LR

C (inverse of regularization strength): [0.01, 0.1, 1, 10, 100];
max_iter: [500, 1.000, 1.500].

3.7.4. SVM

C (regularization parameter): [0.1, 1, 10];
kernel: ‘linear’.

To address class imbalance, the SMOTE was applied to the training set within each fold of cross-validation. This ensured that oversampling did not introduce information leakage from the test set. After hyperparameter optimization, the best-performing estimator for each model was further evaluated using a separate 5-fold cross-validation to confirm the robustness and generalizability of the results.

3.8. Validation Approach

To ensure robust evaluation, we adopt two validation strategies: (i) 5-fold cross-validation and (ii) temporal validation, where models were trained using data from 2017 to 2021 (approximately 25,000 observations) and tested on a hold-out set from 2022 (6151 observations). This approach simulates real-world scenarios by assessing generalization to future periods, a critical step for credit risk prediction models with temporal dependencies.

The combination of both validation methods enhances the robustness and external validity of the results.

3.9. Evaluation Metrics

Performance is evaluated using Accuracy, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), Precision, Recall, F1-Score, and MCC.

4. Results

4.1. Comparative Performance by Rating Segment

As hypothesized in H1, ensemble models (RF and XGBoost) consistently outperformed traditional approaches (LR and SVM) across all rating segments (Table 4).

This section evaluates the predictive performance of four supervised learning algorithms—LR, RF, XGBoost, and SVM—across six credit rating segments. Each model was assessed using a comprehensive set of classification metrics, including AUC-ROC, F1-Score, MCC, Precision, Recall, and computational efficiency. These metrics provide a comprehensive view of model efficacy and illuminate trade-offs in large-scale credit risk settings.

XGBoost demonstrated the highest overall performance and computational efficiency. In the High Quality segment, it achieved perfect predictive metrics (AUC-ROC = 1.000, F1-Score = 1.000, MCC = 1.000), completing the evaluation in just 6 s. In more challenging segments, such as CCC and High Risk + Default, XGBoost maintained strong generalization capacity (F1-Score = 0.987 in CCC). In the High Risk + Default segment, it attained perfect Precision and Recall (1.000), as shown in Table 5, underscoring its robustness under class imbalance and noisy data conditions [36]. Moreover, its integration with explainable AI techniques, such as SHAP values, supports transparent and interpretable decision-making in regulatory environments [6,37].

RF also delivered strong results, particularly in the B segment (AUC-ROC = 0.999, F1 = 0.973, MCC = 0.960). While competitive with XGBoost across several segments, its training times were significantly longer—e.g., 52 s in the CCC segment. These findings align with ensemble benchmarking studies, which emphasize RF’s robustness and interpretability, albeit with higher computational costs [38].

LR showed intermediate performance. While it performed adequately in investment-grade segments (e.g., F1 = 0.877 in High Quality), its effectiveness diminished in speculative categories. In the BB segment, F1 dropped to 0.548 and MCC to 0.424, suggesting that LR’s linear assumptions may limit its capacity to model complex nonlinearities present in distressed firms. These results are consistent with earlier critiques of traditional linear credit scoring models [2].

SVM exhibited the lowest overall predictive performance. In the BB segment, it attained AUC-ROC = 0.714, F1 = 0.419, and MCC = 0.233, with training times of 6 s. Although performance was more stable in the High Quality segment (F1 = 0.912), SVM consistently lagged behind ensemble-based models. Its limited scalability and sensitivity to class imbalance reduce its applicability in operational credit risk settings [38].

In summary, XGBoost emerges as the most suitable model for EWS in credit risk, especially where timely identification of deterioration is critical. RF remains a viable alternative, balancing performance, and interpretability at the expense of longer training times. LR may still be favored in low-risk environments prioritizing transparency. SVM, however, is less suited to large-scale applications due to its lower performance and scalability.

Table 4 presents a comparative summary of classification metrics—AUC-ROC, F1-Score, MCC, Precision, Recall—and training times for each model across the six credit rating segments. This facilitates a detailed, segment-wise evaluation of predictive accuracy and computational efficiency among the supervised learning algorithms.

4.2. Confusion Matrices

Figure 1 presents the consolidated confusion matrices for the four evaluated ML models—LR, RF, XGBoost, and SVM—applied across six credit rating segments: High Quality, BBB, BB, B, CCC, and High Risk + Default. The primary focus of interpretation lies in the diagonal cells, which represent correctly classified instances where the predicted rating segment matches the true label. High values along the diagonal indicate effective classification and segmentation of credit quality.

Among the evaluated models, XGBoost demonstrates the most consistent performance, with diagonal accuracies exceeding 90% across nearly all segments. This result suggests that XGBoost effectively captures complex, non-linear relationships within financial and operational features, leading to more precise credit risk differentiation. RF also performs well, particularly in the High Quality and High Risk + Default segments. However, moderate confusion is observed in intermediate categories such as BBB and BB, indicating some misclassification between adjacent risk classes. These off-diagonal errors are particularly challenging to manage when class boundaries are ambiguous or overlapping, as often occurs in mid-tier ratings [39].

In contrast, LR exhibits lower discriminative power, especially between BB and B, where misclassifications are more frequent. This limitation likely stems from its linear nature and reduced capacity to model intricate feature interactions. SVM shows the least reliable performance, with dispersed prediction patterns and lower diagonal values—particularly in the CCC and High Risk + Default segments—highlighting challenges in detecting distressed firms.

Overall, Figure 1 illustrates the superior performance of tree-based ensemble models, particularly XGBoost, in credit risk classification tasks. These models achieve both lower false positives and FNRs, which are crucial for balancing sensitivity (correctly identifying deteriorating firms) and specificity (avoiding false alarms) in EWS.

4.3. ROC Curve Analysis

The Receiver Operating Characteristic (ROC) curve is a standard diagnostic tool for evaluating the discriminatory power of classification models in credit risk analysis. It depicts the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) across different decision thresholds. The summary statistic, AUC-ROC, quantifies a model’s ability to distinguish between risky and safe firms, with values closer to 1 indicating stronger discriminative performance [40].

In our study, XGBoost and RF consistently achieved AUC-ROC values equal to or approaching 1.000 across all six credit rating segments. In investment-grade segments (e.g., High Quality and BBB), both models demonstrated near-perfect separation. Even in speculative-grade categories such as BB and CCC, AUC-ROC values remained above 0.998. Conversely, LR exhibited deteriorating performance as credit risk intensified, with AUC declining to 0.826 in the BB segment. SVM presented the weakest results, particularly in the BB segment (AUC = 0.714) and the High Risk + Default group, where high error rates undermined its reliability.

To complement this analysis, we assessed TPR at fixed FPR thresholds (0.1, 0.2, and 0.5), offering practical insights into model sensitivity under varying false alarm tolerances. Assessing classifier sensitivity at pre-defined false positive levels has been widely used in screening applications to evaluate performance under real-world operational constraints [41]. XGBoost and RF consistently maintained TPR values near 1.000 across all thresholds and segments. In contrast, LR and SVM underperformed, particularly in high-risk segments. For instance, in the BB segment at FPR = 0.1, LR achieved a TPR of 0.483, while SVM registered only 0.293, compared to 0.999 for XGBoost.

A detailed summary of AUC-ROC values, TPR at fixed FPR thresholds (0.1, 0.2, and 0.5), and Equal Error Rates (EER) for each model and rating segment is presented in Table 5. These metrics provide a comprehensive view of each algorithm’s discriminative capacity and operational sensitivity across varying levels of credit risk.

We also report the EER—the point at which the FPR equals the False Negative Rate (FNR)—as an indicator of model calibration. Both XGBoost and RF exhibited minimal EER values (e.g., EER FPR < 0.01; EER TPR > 0.99), reinforcing their robustness in imbalanced credit risk settings, where accurately identifying minority-class (risky) firms is especially challenging [42].

4.4. Precision–Recall Curve Analysis

The Precision–Recall (PR) Curve provides a critical metric for evaluating classifier performance under class imbalance, which is inherent in the prediction of corporate rating transitions and defaults. Unlike the ROC Curve, which assesses global class separability, the PR Curve focuses on the minority class—firms experiencing downgrades or defaults—making it particularly relevant in financial applications where early identification of distress is essential [43,44].

Key indicators extracted from this analysis include:

Average Precision (AP): Reflects the area under the PR curve, summarizing the trade-off between precision and recall across all thresholds. Higher AP values indicate stronger performance in ranking positive cases early in the decision process;
Precision at Fixed Recall Levels (0.2, 0.5, 0.8): Demonstrates the model’s ability to retain precision as it seeks to identify a broader set of positive instances;
Maximum F1-Score: Represents the optimal balance between precision and recall, indicating the point of highest classification efficiency.

These metrics are summarized in Table 6, providing a comparative basis for evaluating model robustness across different rating segments.

4.4.1. LR: Stable in Investment-Grade, Limited in Speculative Segments

LR displayed consistent and acceptable performance in investment-grade categories. For instance, AP values reached 0.911 for High Quality and 0.850 for BBB, indicating reliable performance where class imbalance is less severe. However, its precision deteriorated in lower-rated segments, particularly at higher recall levels:

In the BB segment, AP declined to 0.505, with a significant drop in precision at recall = 0.5538;
In CCC and High Risk + Default categories, AP values fell to 0.661 and 0.685, respectively, with maximum F1-scores below 0.732.

These results suggest that LR struggles to maintain predictive quality when required to identify a broader population of risky firms, thus limiting its applicability in early-warning or proactive risk surveillance contexts.

4.4.2. SVM: Inconsistent Performance and Poor Recall Precision Trade-Offs

SVM presented the most variable performance among all models. While it achieved competitive AP values in isolated cases—0.922 for High Quality and 0.843 for B—the model underperformed in more speculative segments:

In the BB segment, SVM recorded an AP of only 0.376 and a maximum F1-score of 0.448;
In the High Risk + Default category, AP dropped to 0.588, with an F1-score of 0.636.

Precision at recall = 0.8 remained consistently low across these segments, revealing the model’s vulnerability when tasked with identifying a larger set of at-risk firms. These weaknesses undermine the model’s reliability for credit risk applications that demand high sensitivity and early detection.

4.4.3. Tree-Based Models: Superior Precision–Recall Trade-Offs Across All Segments

Tree-based models—XGBoost and RF—consistently outperformed other classifiers across all segments, particularly in scenarios of heightened credit risk. Both models maintained high precision even at elevated recall levels, highlighting their capacity to detect risky firms without incurring excessive false positives:

In the CCC and High Risk + Default segments, XGBoost achieved AP values near 1.000, with precision at recall = 0.8 exceeding 0.900;
RF also demonstrated robust results, with AP values above 0.990 in all segments and competitive F1-scores.

These findings align with previous studies emphasizing the effectiveness of PR Curve-based evaluation for tree-based classifiers under class imbalance [43] and reinforce the importance of such metrics in EWS for credit risk [44]. Their favorable trade-offs between precision and recall—especially under imbalance—highlight their practical relevance for dynamic credit surveillance and stress testing frameworks.

The PR-based findings are consistent with the ROC-based results presented in Section 4.3, reinforcing the models’ generalizability and resilience under multiple validation schemes.

4.5. Temporal Validation (Out-of-Time Testing)

Consistent with H2, segment-specific modeling by rating band delivered higher predictive performance than aggregated analysis, underscoring the value of stratified approaches in credit risk (Table 4 and Table 7).

To evaluate the temporal robustness of the models, we implemented an out-of-time validation procedure. All models were trained using data from 2017 through 2021 and tested exclusively on firm-year observations from 2022. This approach replicates a realistic forecasting scenario, in which credit risk models are applied to future data unseen during model development. The objective was to assess each model’s generalization capacity under temporal distributional shifts, which is critical in the context of volatile economic and credit environments.

4.5.1. XGBoost: Strongest Temporal Generalization

XGBoost consistently exhibited superior generalization performance when applied to the 2022 dataset.

Robustness Across Segments: Across all credit risk categories, XGBoost preserved near-perfect performance, with AUC-ROC values exceeding 0.999 and F1-Scores above 0.987;
High-Risk Segment Performance: In the “High Risk + Default” segment, the model achieved an AUC-ROC and F1-Score of 1.000, identical to the in-time evaluation. This confirms its exceptional capacity to identify high-risk firms with complete accuracy, even under temporal shifts;
Stability Analysis: The comparison between in-time and out-of-time performance revealed negligible degradation, reinforcing the model’s robustness for use in dynamic credit risk settings where data patterns evolve over time.

4.5.2. RF: Slight Degradation, Yet High Reliability

RF showed mild, yet measurable, performance decline in the temporal test, although it remained a highly reliable model overall.

Segment-Level Degradation: In the BB segment, the F1-Score declined from 0.965 to 0.954, and AUC-ROC decreased from 0.999 to 0.997. Despite this reduction, the model continued to deliver robust predictions;
Consistent in Low-Risk Segments: In the “High Quality” and “BBB” categories, F1-Scores remained above 0.979 and AUC-ROC values above 0.999, suggesting minimal sensitivity to temporal drift in stable firms;
Implications: While slightly less stable than XGBoost, RF remains a strong candidate for temporal credit forecasting, especially when model interpretability and resource efficiency are prioritized.

4.5.3. LR: Stable but Limited Predictive Capability

LR maintained stable performance between in-time and out-of-time evaluations; however, its overall predictive capability remained inferior, particularly in high-risk segments.

Predictive Ceiling: In the BB segment, the F1-Score increased modestly from 0.548 to 0.556, with AUC-ROC improving from 0.826 to 0.830. Despite stability, these values are significantly below those of tree-based models, limiting the model’s practical utility;
High-Risk Segment Performance: The “High Risk + Default” category yielded an F1-Score of 0.533, slightly higher than the in-time score of 0.523. Nonetheless, this level of performance falls short of the accuracy required for proactive risk management;
Structural Constraints: The model’s linear assumptions hinder its ability to capture complex patterns inherent in credit risk data, particularly in speculative-grade segments where firm characteristics are more heterogeneous and volatile.

4.5.4. SVM: Marked Instability over Time

SVM demonstrated the most significant performance deterioration in the temporal validation, confirming its limited applicability in credit forecasting contexts.

Sharp Declines: In the BB segment, the F1-Score declined from 0.474 to 0.357, while AUC-ROC dropped sharply from 0.749 to 0.422, indicating a substantial loss of discriminatory power;
Performance in CCC Segment: The CCC segment also showed substantial degradation, with the F1-Score falling from 0.636 to 0.283—a reduction of over 50% in predictive efficacy;
Conclusion: SVM exhibited high sensitivity to temporal changes, particularly in segments with greater class imbalance or non-linear risk patterns. Its instability and computational inefficiency suggest that it is not well-suited for credit risk applications where model robustness over time is essential.

This table summarizes the temporal validation results (2022 test set) for all four models—XGBoost, RF, LR, and SVM—across six credit risk segments. Metrics include AUC-ROC, Accuracy, F1-Score, MCC, Precision, and Recall.

4.6. Summary of Model Findings

To consolidate the main findings of this study, Table 8 provides a comparative overview of the four models analyzed—XGBoost, RF, LR, and SVM—highlighting their performance across key dimensions: AP, temporal robustness, F1-score range, FPR in risk segments, and recommended use cases.

This comparative summary confirms the superior performance of XGBoost, which consistently achieved perfect or near-perfect results across all metrics (e.g., AUC = 1.000, AP = 1.000), along with strong generalization in the temporal holdout. Its main limitation remains moderate interpretability, a known trade-off in complex ensemble models.

RF performed nearly as well, with slightly lower precision but improved interpretability and greater operational simplicity. This makes it a viable alternative in contexts where transparency or computational resources are limiting factors.

LR, while interpretable and fast to deploy, underperformed in speculative-grade and high-risk segments. Its application should be limited to stable portfolios or as a benchmark in model comparisons.

SVM demonstrated the weakest results, exhibiting low temporal generalization, wide variability in performance, and high FPR in critical segments. These issues render it unsuitable for practical use in financial credit risk classification systems.

In summary, this synthesis emphasizes that beyond predictive accuracy, robustness, interpretability, and deployment context are essential considerations when selecting models for real-world credit risk applications.

5. Explainability: SHAP Analysis

Supporting H3, SHAP results reveal heterogeneous determinants of rating transitions across rating categories, with different sets of features driving outcomes in investment-grade versus speculative-grade firms (Figure 2 and Figure 3).

Overall SHAP Framework

In line with H5, governance quality and capital structure were dominant explanatory factors among higher-rated entities, consistent with prior evidence on rating stability (Figure 3).

Aligned with H4, liquidity and coverage variables exerted the strongest influence in lower-rated firms, particularly in the CCC and High Risk + Default segments (Figure 3).

To enhance the interpretability of the ML models, we applied SHAP, a model-agnostic technique that attributes prediction outcomes to individual input variables. SHAP enables both global and segment-level analyses by quantifying the marginal contribution of each feature to the model’s output. This capability is essential for understanding and validating automated credit risk assessments, particularly in regulated environments where transparency is critical.

SHAP has been successfully adopted across multiple high-stakes domains, including geological classification [45], healthcare diagnostics [46], and photovoltaic forecasting [47]. These applications reinforce its relevance in settings that demand explainability, robustness, and domain-specific interpretability—characteristics that align closely with the requirements of credit risk modeling.

Figure 2 presents the consolidated SHAP summary plots by credit rating segment, revealing the most influential predictors of rating transitions and defaults.

Across all segments, key variables include management and governance quality, industry classification, liquidity, fixed charge coverage, and operating efficiency. These features are not only statistically important but also economically intuitive, reflecting established principles in credit evaluation.

In lower-rated segments—specifically CCC and High Risk + Default—liquidity emerges as a dominant variable, highlighting its relevance in assessing short-term solvency and distress risk. Conversely, higher-rated segments such as High Quality and BBB place greater weight on structural and strategic attributes, including management quality, industry exposure, and scale or diversification, which reflect the importance of forward-looking risk drivers among financially sound entities.

Profitability volatility, measured via earnings fluctuations, demonstrates consistent relevance across segments, particularly in BB and B categories. This underscores the role of earnings stability in signaling future rating transitions.

These findings suggest that the ML models dynamically adjust their explanatory basis based on the firm’s credit quality. Rather than relying on a static set of predictors, the algorithms prioritize different variables depending on the risk profile, demonstrating adaptability to the heterogeneity of real-world credit environments.

Figure 3 further illustrates this pattern, summarizing the top features across rating segments.

While variables like liquidity, operating efficiency, and governance consistently appear among the most impactful, their relative influence shifts: short-term solvency and earnings risk dominate in distressed segments, whereas industry dynamics and firm-level strategic characteristics prevail in investment-grade categories. These results align with domain knowledge and regulatory practices.

In conclusion, the SHAP-based interpretability framework confirms that the ML models offer not only strong predictive performance but also economic and theoretical coherence. Their decisions are grounded in well-established financial indicators, enhancing auditability, transparency, and trust—essential characteristics for the practical adoption of AI systems in credit risk management.

6. Discussion

To ensure coherence between the theoretical foundations, the research design, and the empirical findings, we evaluated the five hypotheses formulated in Section 2.6. Each hypothesis was tested against the results presented in Section 4, covering both predictive performance and feature interpretability across rating segments. Table 9 summarizes the verification outcomes, indicating whether each hypothesis was supported by the evidence.

The empirical evidence provides comprehensive support for all five hypotheses, emphasizing the significance of this study’s contributions. The superior performance of ensemble models demonstrates the capacity of advanced ML techniques to model nonlinear relationships and complex interdependencies within credit risk data. The benefits of segment-specific analysis reveal the heterogeneous dynamics underlying different rating categories, underscoring the necessity of disaggregated approaches. Moreover, the distinct roles of liquidity, coverage, and governance variables illustrate the value of interpretable methods such as SHAP in uncovering the economic rationale for rating transitions. Taken together, these findings highlight that predictive accuracy and interpretability can be jointly pursued, offering a more transparent and practically relevant framework for credit risk assessment and management.

6.1. Strengths of Tree-Based Models

The results indicate that tree-based models, particularly XGBoost and RF, outperform traditional classifiers in predicting rating transitions and corporate defaults. These models achieved exceptional predictive performance across all rating segments, consistently posting AUC-ROC and F1-Scores near or equal to 1.000. Their robustness was further confirmed in temporal validation, where XGBoost maintained high predictive power in 2022 out-of-sample data, demonstrating strong generalization capacity.

Both models also showed resilience to class imbalance. In minority-class segments such as “High Risk + Default” and “CCC,” XGBoost achieved high recall and precision simultaneously—a task where traditional models often struggle—due to its capacity to model complex, non-linear relationships among financial and governance indicators.

From an operational standpoint, RF delivered competitive performance with slightly greater interpretability and reduced computational demands, suggesting its suitability for deployment in resource-constrained environments. SHAP analysis confirmed that model predictions were driven by meaningful variables (e.g., liquidity, fixed charge coverage, management and governance quality, and operating efficiency), reinforcing their reliability and explainability [48,49].

As detailed in Table 9, XGBoost consistently achieved cross-validated F1-Scores above 0.986 and MCC values near 1.000 across all rating segments. RF also demonstrated solid generalization, with F1-Scores and MCCs typically exceeding 0.950. These findings support the efficacy of ensemble-based models in predictive credit risk analytics and their viability for real-world application in financial decision-making.

6.2. Limitations of Linear and SVM Models

While LR produced adequate results in high-quality and investment-grade segments, its performance declined substantially in lower-rated classes. Its linear assumptions limit the model’s ability to capture the complex, nonlinear relationships often exhibited by financially deteriorating firms. For instance, in the BB segment, LR yielded a cross-validated F1-Score of 0.544 and an MCC of 0.420 (Table 9), limiting its practical applicability in EWS and proactive risk management.

In contrast, SVM demonstrated the most inconsistent outcomes among the evaluated models. Despite employing kernel methods and regularization, SVM underperformed across most evaluation metrics, particularly in segments characterized by data imbalance. It exhibited elevated FPR, longer computational times, and a high sensitivity to hyperparameter settings. In the BB segment, for example, it recorded a CV F1-Score of just 0.353 and an MCC of 0.106 (Table 9), highlighting its instability in this context.

These limitations underscore the importance of utilizing models that can effectively manage high-dimensional, nonlinear, and imbalanced data—features that are intrinsic to corporate credit risk datasets.

6.3. Model Trade-Offs: Performance vs. Interpretability

Although XGBoost outperformed other models in terms of accuracy and generalization, it does so at the expense of interpretability. Its complexity—stemming from multiple decision trees and iterative boosting—hinders direct human understanding. In contrast, RF, while slightly less accurate, provides greater model transparency due to its ensemble averaging mechanism and reduced sensitivity to overfitting.

LR continues to serve as a benchmark for interpretability, as its coefficients can be directly associated with economic reasoning. This makes it particularly suitable for explainability-focused contexts, such as regulatory oversight [6,37]. However, its simplicity also constrains its performance in more complex or volatile rating segments.

Table 10 further illustrates this trade-off: LR maintained stable cross-validated F1-Scores in investment-grade segments (e.g., 0.947 in High Quality and 0.830 in BBB), despite a noticeable decline in performance across speculative-grade classes.

The application of SHAP values helps mitigate the trade-off between predictive performance and interpretability by providing post hoc explanations of model behavior. SHAP allows complex models such as XGBoost to be interpreted at both global and local levels, offering transparency into how individual features influence predictions. Although originally applied in geospatial and environmental modeling [48,49], this interpretability framework is equally valuable in financial applications, where model transparency supports more responsible and explainable decision-making.

6.4. Practical Implications for Credit Risk Monitoring

These findings have important implications for financial institutions and regulators. Tree-based models can be effectively integrated into credit monitoring systems as early warning tools capable of flagging firms at considerable risk of downgrade or default. Their capacity to identify rare yet impactful events with low error rates is essential for mitigating systemic risk.

XGBoost, with its superior predictive accuracy, is particularly suitable for high-stakes applications where false negatives must be minimized. In contrast, RF offers a robust alternative when computational efficiency and interpretability are prioritized. LR, despite its limitations, may still be appropriate for low-risk portfolios or as a benchmark model due to its transparency and simplicity.

Furthermore, the use of temporal validation underscores these models’ potential to remain effective under changing macroeconomic conditions—an essential feature in volatile financial environments. This adaptability supports their application in dynamic credit risk models, stress testing frameworks, and real-time surveillance systems.

6.5. Limitations and Future Work

Despite rigorous validation, several limitations should be acknowledged. First, the dataset used in this study originates from a single institutional source, which may limit the external validity and generalizability of the findings. Future research could incorporate data from multiple rating agencies or international sources to capture broader market heterogeneity.

Second, the analysis relies exclusively on structured numerical and categorical data. Integrating unstructured sources—such as management commentary, news sentiment, and analyst reports, along with forward-looking indicators (e.g., CDS spreads or market-implied signals), may enhance predictive accuracy and contextual depth.

Third, while SHAP was effectively applied to interpret model outputs, future work could explore methods that embed explainability directly into the training process. Techniques such as explainability-aware objectives or inherently interpretable architectures could improve transparency without compromising performance. In addition, hybrid models that combine traditional credit scoring with ML, as well as semi-supervised or transfer learning approaches to exploit sparse or unlabeled data, present promising directions for advancing model robustness.

Finally, although temporal validation was conducted using a hold-out year (2022), extending the validation across longer horizons and multiple economic cycles would further support the model’s reliability. Evaluating performance across diverse market regimes, industries, and geographic regions will be critical to confirm the stability and adaptability of these models in dynamic financial environments.

To mitigate overfitting and enhance generalizability, several safeguards were implemented. Hyperparameter tuning was conducted using stratified cross-validation and independent test sets. Temporal validation demonstrated that models trained on 2017–2021 data maintained high performance when applied to 2022 observations, suggesting the capture of generalizable patterns. Furthermore, SHAP analysis across folds and rating segments revealed consistent variable importance, reinforcing the models’ stability and resilience to noise and sampling variation.

7. Conclusions

This study proposed a hybrid ML framework to explain and predict corporate credit rating transitions and defaults. By integrating advanced classification algorithms—XGBoost, RF, LR, and SVM—with segment-specific modeling, temporal validation, and SHAP-based explainability, the research offers a robust and interpretable approach to credit risk assessment.

Empirical results confirm that tree-based ensemble models, particularly XGBoost, outperform traditional linear models and SVM across all rating segments and under various validation schemes. These models demonstrated superior predictive accuracy, resilience to class imbalance, and stability in out-of-time testing. Importantly, the use of SHAP values enabled transparent interpretation of key risk drivers, tailored to each credit rating segment, thereby enhancing model trustworthiness in regulatory and operational settings.

The findings contribute to both academic literature and industry practice by addressing several known gaps: (i) the lack of segment-level model stratification, (ii) the underuse of temporal validation in credit risk studies, (iii) the need for model interpretability in regulated environments, and (iv) the practical integration of explainability into EWS. The results suggest that ML-based frameworks—when properly tuned and explained—can support more proactive and transparent credit surveillance.

These contributions, however, are bounded by several assumptions and limitations. The analysis assumes that the 2017–2022 period adequately reflects the dynamics of rating transitions, even though more recent shocks may alter risk patterns. It also assumes that the chosen financial and governance variables sufficiently capture rating determinants, leaving aside unstructured disclosures, macro indicators, and market-based signals that may enrich predictive performance. Methodologically, the study assumes that the selected ML algorithms and tuning strategy fairly represent the performance–interpretability trade-off, though alternative approaches may yield different outcomes. Finally, the segmentation by rating bands is assumed to be the most relevant dimension of heterogeneity, even though industry or regional factors could provide complementary insights.

Future research should explicitly test these assumptions by extending datasets beyond 2022, incorporating qualitative and forward-looking variables, and experimenting with hybrid models that embed interpretability directly into the learning process. Such efforts would not only validate the robustness of the present framework but also enhance its adaptability to systemic shocks and evolving regulatory expectations.

Overall, this research demonstrates that integrating ML with domain-relevant segmentation and explainable AI tools can significantly improve the effectiveness, transparency, and adaptability of credit risk modeling. As financial markets evolve and regulatory scrutiny increases, such frameworks are likely to become indispensable for institutions seeking to strengthen risk governance and decision-making under uncertainty.

Author Contributions

Conceptualization, N.A.d.O. and L.F.C.B.; methodology, N.A.d.O.; software, N.A.d.O.; validation, N.A.d.O. and L.F.C.B.; formal analysis, N.A.d.O.; investigation, N.A.d.O.; resources, N.A.d.O.; data curation, N.A.d.O.; writing—original draft preparation, N.A.d.O.; writing—review and editing, N.A.d.O. and L.F.C.B.; visualization, N.A.d.O.; supervision, L.F.C.B.; project administration, L.F.C.B.; funding acquisition, L.F.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study were obtained from proprietary databases, specifically S&P Capital IQ Pro and Bloomberg, and are not publicly available due to licensing and confidentiality agreements. Access to these datasets requires a subscription or institutional access. Researchers interested in replicating or extending this study may contact the corresponding author for guidance on accessing similar data sources under appropriate licensing terms.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AP	Average Precision
AUC	Area under the curve
AUC-ROC	Area Under the Receiver Operating Characteristic Curve
EER	Equal Error Rates
EWS	Early warning system
FNR	False Negative Rate
FPR	False Positive Rate
GAN	Generative adversarial network
LR	Logistic Regression
MCC	Matthews Correlation Coefficient
ML	Machine learning
PR	Precision–Recall
RF	Random Forest
ROC	Receiver Operating Characteristic
RWA	Risk-weighted asset
SHAP	SHapley Additive exPlanations
SMOTE	Synthetic Minority Oversampling Technique
SVM	Support Vector Machine
TPR	True Positive Rate
VIF	Variance inflation factors
XAI	Explainable artificial intelligence
XGBoost	Extreme Gradient Boosting

References

Merton, R.C. On the pricing of corporate debt: The risk structure of interest rates. J. Financ. 1974, 29, 449–470. [Google Scholar] [CrossRef]
Altman, E.I. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Financ. 1968, 23, 589–609. [Google Scholar] [CrossRef]
Ohlson, J.A. Financial ratios and the probabilistic prediction of bankruptcy. J. Account. Res. 1980, 18, 109–131. [Google Scholar] [CrossRef]
Böhnke, V.; Ongena, S.; Paraschiv, F.; Reite, E.J. Back to the roots of internal credit risk models: Does risk explain why banks’ risk-weighted asset levels converge over time? J. Bank. Financ. 2023, 156, 106992. [Google Scholar] [CrossRef]
Kandi, K.; García-Dopico, A. Enhancing performance of credit card model by utilizing LSTM networks and XGBoost algorithms. Mach. Learn. Knowl. Extr. 2025, 7, 20. [Google Scholar] [CrossRef]
Nallakaruppan, M.K.; Chaturvedi, H.; Grover, V.; Balusamy, B.; Jaraut, P.; Bahadur, J.; Meena, V.P.; Hameed, I.A. Credit risk assessment and financial decision support using explainable artificial intelligence. Risks 2024, 12, 164. [Google Scholar] [CrossRef]
Bitetto, A.; Filomeni, S.; Modina, M. Machine learning for the unlisted: Enhancing MSME default prediction with public market signals. J. Corp. Financ. 2025, 94, 102830. [Google Scholar] [CrossRef]
Dang, M.; Puwanenthiren, P.; Mazur, M.; Hoang, V.A.; Nadarajah, S.; Nguyen, T.Q. Firm complexity and credit ratings. Int. Rev. Financ. Anal. 2025, 104, 104267. [Google Scholar] [CrossRef]
Feldhütter, P.; Schaefer, S. Debt dynamics and credit risk. J. Financ. Econ. 2023, 149, 497–535. [Google Scholar] [CrossRef]
Zedda, S. Credit scoring: Does XGBoost outperform logistic regression? A test on Italian SMEs. Res. Int. Busin. Financ. 2024, 70, 102397. [Google Scholar] [CrossRef]
Wang, K.; Li, M.; Cheng, J.; Zhou, X.; Li, G. Research on personal credit risk evaluation based on XGBoost. Procedia Comput. Sci. 2022, 199, 1128–1135. [Google Scholar] [CrossRef]
Elreedy, D.; Atiya, A.F.; Kamalov, F. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach. Learn. 2024, 113, 4903–4923. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M. Research on credit risk assessment of listed companies in technology sector based on support vector machine integration. Procedia Comp. Sci. 2022, 214, 867–874. [Google Scholar] [CrossRef]
Papík, M.; Papíková, L. The possibilities of using AutoML in bankruptcy prediction: Case of Slovakia. Technol. Forecast. Soc. Change 2025, 215, 124098. [Google Scholar] [CrossRef]
Papík, M.; Papíková, L. Automated machine learning in bankruptcy prediction of manufacturing companies. Procedia Comp. Sci. 2024, 232, 1428–1436. [Google Scholar] [CrossRef]
Hamida, F.B.; Laabidi, H.; Rashid, A.; Zeebaree, S.R.M. Exploring the landscape of explainable artificial intelligence (XAI): A systematic review of techniques and applications. Big Data Cogn. Comput. 2024, 8, 149. [Google Scholar] [CrossRef]
Sun, W.; Wang, Y.; Zhang, L.; Chen, X.H.; Hoang, Y.H. Enhancing economic cycle forecasting based on interpretable machine learning and news narrative sentiment. Technol. Forecast. Soc. Change 2025, 215, 124094. [Google Scholar] [CrossRef]
Tribuvan, S.; Kamath, S.D.; Mishra, S.; Usha, M.; Shreyas, J.; Gururaj, H.L.; Dayananda, P.; Karthik, S.A. Performance evaluation of advanced classification models combined with feature selection for credit risk performance. Procedia Comp. Sci. 2025, 258, 278–287. [Google Scholar] [CrossRef]
Noriega, J.P.; Rivera, L.A.; Herrera, J.A. Machine learning for credit risk prediction: A systematic literature review. Data 2023, 8, 169. [Google Scholar] [CrossRef]
Beltman, J.; Machado, M.R.; Osterrieder, J.R. Predicting retail customers’ distress in the finance industry: An early warning system approach. J. Retail. Consum. Serv. 2025, 82, 104101. [Google Scholar] [CrossRef]
Machado, M.R.; Chen, D.T.; Osterrieder, J.R. An analytical approach to credit risk assessment using machine learning models. Decis. Anal. J. 2025, 16, 100605. [Google Scholar] [CrossRef]
Hu, T. Impact mechanism of ESG ratings on corporate financing costs: A hybrid machine learning analysis using marginal effect and ESG rating effect in developed and developing countries. Syst. Soft Comput. 2025, 7, 200318. [Google Scholar] [CrossRef]
Safiullah, M.; Phan, D.H.B.; Kabir, M.N. Green innovation and corporate default risk. J. Int. Financ. Mark. Inst. Money 2024, 95, 102041. [Google Scholar] [CrossRef]
James, N.; Menzies, M. Detecting imbalanced financial markets through time-varying optimization and nonlinear functionals. Phys. D Nonlinear Phenom. 2025, 474, 134571. [Google Scholar] [CrossRef]
Strelcenia, E.; Prakoonwit, S. A survey on GAN techniques for data augmentation to address the imbalanced data issues in credit card fraud detection. Mach. Learn. Knowl. Extr. 2023, 5, 304–329. [Google Scholar] [CrossRef]
Fox, S.; Rey, V.F. Representing human ethical requirements in hybrid machine learning models: Technical opportunities and fundamental challenges. Mach. Learn. Knowl. Extr. 2024, 6, 580–592. [Google Scholar] [CrossRef]
Horobet, A.; Boubaker, S.; Belascu, L.; Negreanu, C.C.; Dinca, Z. Technology-driven advancements: Mapping the landscape of algorithmic trading literature. Technol. Forecast. Soc. Change 2024, 209, 123746. [Google Scholar] [CrossRef]
Spence, M. Job market signaling. Q. J. Econ. 1973, 87, 355–374. [Google Scholar] [CrossRef]
Akerlof, G.A. The market for “lemons”: Quality uncertainty and the market mechanism. Q. J. Econ. 1970, 84, 488–500. [Google Scholar] [CrossRef]
Markowitz, H. Portfolio selection. J. Financ. 1952, 7, 77–91. [Google Scholar] [CrossRef]
Gafsi, N. Machine learning approaches to credit risk: Comparative evidence from participation and conventional banks in the UK. J. Risk Financ. Manag. 2025, 18, 345. [Google Scholar] [CrossRef]
DiMaggio, P.J.; Powell, W.W. The iron cage revisited: Institutional isomorphism and collective rationality in organizational fields. Am. Sociol. Rev. 1983, 48, 147–160. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, Z. Corporate ESG rating prediction based on XGBoost-SHAP interpretable machine learning model. Expert Syst. with Appl. 2025, 295, 128809. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, T. Financial fraud detection and prediction in listed companies using SMOTE and machine learning algorithms. Entropy 2022, 24, 1157. [Google Scholar] [CrossRef] [PubMed]
Hernes, M.; Adaszyński, J.; Tutak, P. Credit risk modeling using interpreted XGBoost. Eur. Manag. Stud. 2023, 21, 46–70. [Google Scholar] [CrossRef]
de Lange, P.E.; Melsom, B.; Vennerød, C.B.; Westgaard, S. Explainable AI for credit assessment in banks. J. Risk Financ. Manag. 2022, 15, 556. [Google Scholar] [CrossRef]
Li, Y.; Chen, W. A comparative performance assessment of ensemble learning for credit scoring. Mathematics 2020, 8, 1756. [Google Scholar] [CrossRef]
Barranco-Chamorro, I.; Carrillo-García, R.M. Techniques to deal with off-diagonal elements in confusion matrices. Mathematics 2021, 9, 3233. [Google Scholar] [CrossRef]
Alamsyah, A.; Hafidh, A.A.; Mulya, A.D. Innovative credit risk assessment: Leveraging social media data for inclusive credit scoring in Indonesia’s fintech sector. J. Risk Financ. Manag. 2025, 18, 74. [Google Scholar] [CrossRef]
Dunstan, F.D.J.; Gray, J.C.; Nix, A.B.J.; Reynolds, T.M. Detection rates and false positive rates for Down’s syndrome screening: How precisely can they be estimated and what factors influence their value? Stat. Med. 1997, 16, 1481–1495. [Google Scholar] [CrossRef]
Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
Miao, J.; Zhu, W. Precision–recall curve (PRC) classification trees. Evol. Intel. 2022, 15, 1545–1569. [Google Scholar] [CrossRef]
Mahbobi, M.; Kimiagari, S.; Vasudevan, M. Credit risk classification: An integrated predictive accuracy algorithm using artificial and deep neural networks. Ann. Oper. Res. 2023, 330, 609–637. [Google Scholar] [CrossRef]
Antonini, A.S.; Tanzola, J.; Asiain, L.; Ferracutti, G.R.; Castro, S.M.; Bjerg, E.A.; Ganuza, M.L. Machine learning model interpretability using SHAP values: Application to igneous rock classification task. Appl. Comput. Geosci. 2024, 23, 100178. [Google Scholar] [CrossRef]
ElShawi, R.; Sherif, Y.; Al Mallah, M.; Sakr, S. Interpretability in healthcare: A comparative study of local machine learning interpretability techniques. Comput. Intell. 2021, 37, 1633–1650. [Google Scholar] [CrossRef]
Bakht, M.P.; Mohd, M.N.H.; Ibrahim, B.S.K.M.K.; Khan, N.; Sheikh, U.U.; Ab Rahman, A.A.H. Advanced automated machine learning framework for photovoltaic power output prediction using environmental parameters and SHAP interpretability. Results Eng. 2025, 25, 103838. [Google Scholar] [CrossRef]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef]
Li, Z. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]

Figure 1. Consolidated Confusion Matrices by Model and Segment. Comparative visualization of model performance across rating segments. Each matrix presents classification outcomes, with higher diagonal values indicating better predictive accuracy.

Figure 2. SHAP summary plot of top 10 predictive variables by credit rating segment (High Quality to High Risk + Default).

Figure 3. Segment-specific ranking of the most influential features based on average SHAP values across the six risk categories.

Table 1. Dataset Distribution by Credit Rating Segment.

Segment	Total Samples	Class 1 Samples	Class 0 Samples	Class 1 Proportion (%)
High Quality	31,151	4041	27,110	12.97
BBB	31,151	8531	22,620	27.39
BB	31,151	6297	24,854	20.21
B	31,151	9567	21,584	30.71
CCC	31,151	2228	28,923	7.15
High Risk + Default	31,151	199	30,952	0.64

Table 2. Grouping of S&P Global Ratings for Segment-Level Analysis.

Credit Ratings Classes	S&P Global’s Rating Grade
High Quality	AAA, AA+, AA, AA−, A+, A, A−
BBB	BBB+, BBB, BBB−
BB	BB+, BB, BB−
B	B+, B, B−, CC
CCC	CCC+, CCC, CCC−
High Risk + Default	C, SD, D

Table 3. Input Features Used in ML Model Training.

Variable Name	Category	Description/Formula
Profitability Ratios
EBITDA Margin (%)	Profitability	EBITDA/Revenue
EBIT Margin (%)	Profitability	EBIT/Revenue
Gross Margin (%)	Profitability	(Revenue—COGS)/Revenue
Return on Capital	Profitability	EBIT/Capital Employed
EBIT	Profitability	Earnings Before Interest and Taxes
EBITDA	Profitability	Earnings Before Interest, Taxes, Depreciation & Amortization
Profitability	Profitability	Composite or general indicator
Leverage Ratios
Debt/EBITDA	Leverage	Total Debt/EBITDA
Debt/Equity (%)	Leverage	Total Debt/Equity × 100
FFO/Debt (%)	Leverage	Funds From Operations/Total Debt × 100
CFO/Debt (%)	Leverage	Cash Flow from Operations/Debt × 100
FOCF/Debt (%)	Leverage	Free Operating Cash Flow/Debt × 100
DCF/Debt (%)	Leverage	Discretionary Cash Flow/Debt × 100
Coverage Ratios
Fixed Charge Coverage	Coverage	EBITDA/(Interest + Fixed Charges)
EBITDA Interest Coverage	Coverage	EBITDA/Interest Expense
EBIT Interest Coverage	Coverage	EBIT/Interest Expense
FFO Cash Interest Coverage	Coverage	FFO/Cash Interest
Efficiency & Operating Ratios
Working Capital	Efficiency	Current Assets—Current Liabilities
Capex/Revenues (%)	Efficiency	Capital Expenditures/Revenues × 100
Total Operating Expense	Efficiency	Operating costs incurred
Cash Flow Metrics
Cash Flow from Operations	Cash Flow	Net cash from operating activities
Funds from Operations	Cash Flow	Net income + non-cash items + WC adjustments
Discretionary Cash Flow	Cash Flow	FFO—Capex—Dividends—Repurchases
Free Operating Cash Flow (FOCF)	Cash Flow	CFO—Capex
Capital Expenditures (Capex)	Cash Flow	Investment in fixed assets
Cash Interest	Cash Flow	Interest paid in cash
Interest Expense	Cash Flow	Total interest expenses
Balance Sheet Items
Debt	Balance Sheet	Total financial debt
Adjusted Capital	Balance Sheet	Equity + debt adjustments
Equity	Balance Sheet	Shareholder equity
Cash	Balance Sheet	Cash and cash equivalents
Accessible Cash	Balance Sheet	Unrestricted and accessible cash
Dividends	Balance Sheet	Cash returns to shareholders
Share Repurchases	Balance Sheet	Buyback of own shares
Pension and Postretirement	Balance Sheet	Employee benefit liabilities
Revenue & Income Statement
Revenues	Income Statement	Total revenue/sales
Structural Attributes
Industry	Structural Attribute	Economic sector classification
Firm Size	Structural Attribute	Market cap or employee-based scale
Scale and Diversification	Structural Attribute	Geographic or product diversification
Volatility of Profitability	Structural Attribute	Variability in profit margins
Governance & Risk Factors
Governance Score	Qualitative	ESG metric
Country Risk	Qualitative	Macro risk of exposed geographies
Financial Policy	Qualitative	Capital management philosophy
Management and Governance	Qualitative	Effectiveness and credibility of management
Industry Risk	Structural	Risk inherent to the sector
Capital Structure	Structural/Qualitative	Mix of debt and equity financing

Table 4. Comparative Performance of ML Models by Rating Segment.

Segment	Model	Accuracy	AUC-ROC	F1-Score	MCC	Hyperparameter Time (s)	Evaluation Time (s)
High Quality	LR	0.965	0.987	0.877	0.860	6.9	0.0
	RF	0.998	1.000	0.993	0.992	42.3	0.0
	XGBoost	1.000	1.000	1.000	1.000	5.9	0.0
	SVM	0.977	0.978	0.912	0.898	2.8	0.1
BBB	LR	0.895	0.955	0.827	0.761	1.9	0.0
	RF	0.991	1.000	0.983	0.977	36.3	0.1
	XGBoost	1.000	1.000	1.000	0.999	4.5	0.0
	SVM	0.878	0.945	0.801	0.722	4.3	0.3
BB	LR	0.738	0.826	0.548	0.424	2.0	0.0
	RF	0.986	0.999	0.965	0.956	35.9	0.0
	XGBoost	0.999	1.000	0.997	0.996	4.8	0.0
	SVM	0.556	0.714	0.419	0.233	6.0	0.5
B	LR	0.865	0.941	0.804	0.713	2.5	0.0
	RF	0.983	0.999	0.973	0.960	37.1	0.1
	XGBoost	0.999	1.000	0.998	0.997	5.4	0.0
	SVM	0.861	0.927	0.787	0.687	5.2	0.4
CCC	LR	0.919	0.965	0.616	0.617	5.5	0.0
	RF	0.995	0.999	0.963	0.960	52.3	0.1
	XGBoost	0.998	1.000	0.987	0.986	6.7	0.0
	SVM	0.947	0.958	0.694	0.677	4.5	0.4
High Risk + Default	LR	0.989	0.998	0.523	0.582	4.9	0.0
	RF	1.000	1.000	0.983	0.983	40.9	0.1
	XGBoost	1.000	1.000	1.000	1.000	4.5	0.0
	SVM	0.986	0.996	0.470	0.543	1.7	0.0

Table 5. ROC Curve Metrics—Model Comparison.

Spreadsheet	Model	AUC-ROC	TPR at FPR 0.1	TPR at FPR 0.2	TPR at FPR 0.5	EER (FPR)	EER (TPR)
High Quality	LR	0.987	0.973	0.988	0.998	0.042	0.957
	RF	1.000	1.000	1.000	1.000	0.003	0.997
	XGBoost	1.000	1.000	1.000	1.000	0.000	1.000
	SVM	0.978	0.955	0.961	0.991	0.052	0.949
BBB	LR	0.955	0.896	0.960	0.993	0.101	0.900
	RF	1.000	1.000	1.000	1.000	0.008	0.991
	XGBoost	1.000	1.000	1.000	1.000	0.000	1.000
	SVM	0.945	0.868	0.940	0.987	0.117	0.883
BB	LR	0.826	0.483	0.681	0.933	0.249	0.751
	RF	0.999	1.000	1.000	1.000	0.016	0.984
	XGBoost	1.000	0.999	1.000	1.000	0.001	0.999
	SVM	0.714	0.293	0.504	0.790	0.334	0.666
B	LR	0.941	0.849	0.930	0.985	0.123	0.877
	RF	0.999	0.999	1.000	1.000	0.016	0.984
	XGBoost	1.000	1.000	1.000	1.000	0.001	0.999
	SVM	0.927	0.805	0.912	0.977	0.140	0.861
CCC	LR	0.965	0.925	0.966	0.993	0.083	0.916
	RF	0.999	0.999	0.999	1.000	0.012	0.988
	XGBoost	1.000	1.000	1.000	1.000	0.002	0.997
	SVM	0.958	0.898	0.945	0.988	0.101	0.898
High Risk	LR	0.998	0.983	0.983	0.983	0.014	0.983
	RF	1.000	1.000	1.000	1.000	0.000	1.000
	XGBoost	1.000	1.000	1.000	1.000	0.000	1.000
	SVM	0.996	0.983	0.983	1.000	0.033	0.967

Table 6. Precision–Recall Curve Metrics—Model Comparison.

Segment	Model	Average Precision (AP)	Precision at Recall 0.2	Precision at Recall 0.5	Precision at Recall 0.8	Max F1-Score
High Quality	LR	0.911	0.949	0.940	0.925	0.903
	RF	1.000	1.000	1.000	1.000	0.993
	XGBoost	1.000	1.000	1.000	1.000	1.000
	SVM	0.922	0.960	0.970	0.935	0.918
BBB	LR	0.850	0.880	0.914	0.853	0.837
	RF	0.999	1.000	1.000	1.000	0.986
	XGBoost	1.000	1.000	1.000	1.000	1.000
	SVM	0.844	0.894	0.902	0.830	0.823
BB	LR	0.505	0.607	0.550	0.412	0.553
	RF	0.995	1.000	1.000	0.997	0.966
	XGBoost	1.000	1.000	1.000	1.000	0.999
	SVM	0.376	0.457	0.390	0.280	0.448
B	LR	0.868	0.939	0.923	0.824	0.822
	RF	0.997	1.000	1.000	0.999	0.975
	XGBoost	1.000	1.000	1.000	1.000	0.998
	SVM	0.843	0.920	0.895	0.784	0.793
CCC	LR	0.661	0.744	0.741	0.668	0.732
	RF	0.991	1.000	1.000	0.989	0.966
	XGBoost	0.999	1.000	1.000	1.000	0.990
	SVM	0.674	0.788	0.744	0.683	0.744
High Risk + DE	LR	0.685	0.800	0.750	0.527	0.672
	RF	1.000	1.000	1.000	1.000	1.000
	XGBoost	1.000	1.000	1.000	1.000	1.000
	SVM	0.588	0.800	0.588	0.516	0.636

Table 7. Out-of-Time Performance Metrics by Model and Credit Risk Segment.

Segment	Model	AUC-ROC	Accracy	F1-Score	MCC	Precision	Recall
High Quality	LR	0.988	0.965	0.876	0.859	0.810	0.953
	RF	1.000	0.996	0.986	0.984	0.982	0.990
	XGBoost	1.000	0.998	0.994	0.993	0.994	0.994
	SVM	0.941	0.891	0.681	0.648	0.547	0.903
BBB	LR	0.958	0.890	0.821	0.753	0.739	0.925
	RF	0.999	0.989	0.979	0.972	0.979	0.980
	XGBoost	1.000	0.999	0.999	0.998	0.998	0.999
	SVM	0.893	0.559	0.544	0.359	0.379	0.964
BB	LR	0.830	0.739	0.556	0.437	0.423	0.809
	RF	0.997	0.981	0.954	0.942	0.949	0.958
	XGBoost	1.000	0.998	0.996	0.994	0.995	0.996
	SVM	0.422	0.435	0.357	0.107	0.232	0.775
B	LR	0.944	0.871	0.810	0.720	0.743	0.890
	RF	0.997	0.975	0.961	0.943	0.951	0.970
	XGBoost	1.000	0.997	0.995	0.993	0.993	0.998
	SVM	0.914	0.742	0.695	0.555	0.547	0.951
CCC	LR	0.967	0.923	0.627	0.625	0.481	0.900
	RF	0.999	0.994	0.956	0.953	0.978	0.936
	XGBoost	1.000	0.998	0.987	0.987	0.989	0.986
	SVM	0.872	0.682	0.283	0.291	0.169	0.879
High Risk + Default	LR	0.997	0.989	0.533	0.592	0.369	0.960
	RF	1.000	0.999	0.958	0.959	1.000	0.920
	XGBoost	1.000	1.000	1.000	1.000	1.000	1.000
	SVM	0.997	0.985	0.467	0.548	0.305	1.000

Table 8. Comparative Summary of Models.

Model	Average Precision (AP)	Temporal Robustness (2022)	Max F1-Score Range	False Positives (Risk Segments)	Recommended Use
XGBoost	1.000	High	0.990–1.000	~0.0%	High-accuracy credit risk systems with low error tolerance
RF	0.996	Moderate	0.960–0.990	<0.3%	Reliable applications requiring balance and interpretability
LR	0.747	Low	0.550–0.900	5.0–20%	Baseline models for stable and high-quality ratings
SVM	0.709	Very Low	0.450–0.790	>10%	Not recommended for credit risk classification

Table 9. Verification of Research Hypotheses.

Hypothesis	Statement	Verification Outcome
H1	Ensemble models (RF, XGBoost) outperform traditional models (LR, SVM) in predicting rating transitions and defaults.	Supported—XGBoost and RF achieved higher AUC and F1-scores compared with LR and SVM, both in cross-validation and out-of-time testing.
H2	Segment-specific modeling by rating band produces higher predictive performance than aggregated modeling.	Supported—Stratified models consistently outperformed aggregated models, confirming the benefit of segment-specific approaches.
H3	The determinants of rating transitions vary significantly across rating segments.	Supported—SHAP analysis revealed heterogeneous explanatory drivers across investment-grade and speculative-grade firms.
H4	Liquidity and coverage metrics play a stronger role in explaining transitions among lower-rated firms.	Supported—Variables such as Fixed charge coverage and liquidity ranked among the most influential in BB, B, and CCC segments.
H5	Management and Governance factors are more relevant in explaining transitions among higher-rated firms.	Supported—Management and Governance dominated feature importance for investment-grade firms.

Table 10. Cross-Validation and Model Performance Stability by Rating Segment.

Segment	Model	AUC-ROC	Accuracy	CV F1-Score (Mean)	CV MCC (Mean)	F1-Score	MCC	Precision	Recall
High Quality	RF	1.000	0.998	0.997	0.996	0.993	0.992	0.998	0.988
	LR	0.987	0.965	0.947	0.920	0.877	0.860	0.815	0.948
	SVM	0.978	0.977	0.538	0.170	0.912	0.898	0.896	0.928
	XGBoost	1.000	1.000	0.999	0.999	1.000	1.000	1.000	0.999
BBB	RF	1.000	0.991	0.982	0.975	0.983	0.977	0.990	0.976
	LR	0.955	0.895	0.830	0.765	0.827	0.761	0.755	0.916
	SVM	0.945	0.878	0.468	0.176	0.801	0.722	0.726	0.892
	XGBoost	1.000	1.000	0.998	0.997	1.000	0.999	1.000	0.999
BB	RF	0.999	0.986	0.951	0.938	0.965	0.956	0.979	0.951
	LR	0.826	0.738	0.544	0.420	0.548	0.424	0.421	0.786
	SVM	0.714	0.556	0.353	0.106	0.419	0.233	0.285	0.791
	XGBoost	1.000	0.999	0.996	0.995	0.997	0.996	0.999	0.995
B	RF	0.999	0.983	0.962	0.945	0.973	0.960	0.971	0.975
	LR	0.941	0.865	0.801	0.708	0.804	0.713	0.726	0.902
	SVM	0.927	0.861	0.603	0.407	0.787	0.687	0.743	0.837
	XGBoost	1.000	0.999	0.996	0.994	0.998	0.997	0.997	0.999
CCC	RF	0.999	0.995	0.996	0.995	0.963	0.960	0.968	0.958
	LR	0.965	0.919	0.878	0.815	0.616	0.617	0.466	0.910
	SVM	0.958	0.947	0.578	0.291	0.694	0.677	0.592	0.837
	XGBoost	1.000	0.998	0.998	0.998	0.987	0.986	0.977	0.997
High Risk + Default	RF	1.000	1.000	1.000	1.000	0.983	0.983	1.000	0.967
	LR	0.998	0.989	0.990	0.985	0.523	0.582	0.361	0.950
	SVM	0.996	0.986	0.875	0.819	0.470	0.543	0.310	0.967
	XGBoost	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oliveira, N.A.d.; Basso, L.F.C. Explaining Corporate Ratings Transitions and Defaults Through Machine Learning. Algorithms 2025, 18, 608. https://doi.org/10.3390/a18100608

AMA Style

Oliveira NAd, Basso LFC. Explaining Corporate Ratings Transitions and Defaults Through Machine Learning. Algorithms. 2025; 18(10):608. https://doi.org/10.3390/a18100608

Chicago/Turabian Style

Oliveira, Nazário Augusto de, and Leonardo Fernando Cruz Basso. 2025. "Explaining Corporate Ratings Transitions and Defaults Through Machine Learning" Algorithms 18, no. 10: 608. https://doi.org/10.3390/a18100608

APA Style

Oliveira, N. A. d., & Basso, L. F. C. (2025). Explaining Corporate Ratings Transitions and Defaults Through Machine Learning. Algorithms, 18(10), 608. https://doi.org/10.3390/a18100608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explaining Corporate Ratings Transitions and Defaults Through Machine Learning

Abstract

1. Introduction

2. Theoretical Background and Literature Review

2.1. Traditional Approaches to Credit Risk Modeling

2.2. Evolution to ML in Credit Risk

2.3. Gaps in Literature

2.4. Theoretical Framework

2.5. This Study’s Contribution

2.6. Research Hypotheses

3. Research Design and Data

3.1. Data Collection and Sources

3.2. Rating Grouping and Class Consolidation

Rating Grouping

3.3. Data Preprocessing and Treatment

3.4. ML Models

3.5. Target Variables

3.6. Feature Engineering

3.7. Methodology

3.7.1. XGBoost

3.7.2. RF

3.7.3. LR

3.7.4. SVM

3.8. Validation Approach

3.9. Evaluation Metrics

4. Results

4.1. Comparative Performance by Rating Segment

4.2. Confusion Matrices

4.3. ROC Curve Analysis

4.4. Precision–Recall Curve Analysis

4.4.1. LR: Stable in Investment-Grade, Limited in Speculative Segments

4.4.2. SVM: Inconsistent Performance and Poor Recall Precision Trade-Offs

4.4.3. Tree-Based Models: Superior Precision–Recall Trade-Offs Across All Segments

4.5. Temporal Validation (Out-of-Time Testing)

4.5.1. XGBoost: Strongest Temporal Generalization

4.5.2. RF: Slight Degradation, Yet High Reliability

4.5.3. LR: Stable but Limited Predictive Capability

4.5.4. SVM: Marked Instability over Time

4.6. Summary of Model Findings

5. Explainability: SHAP Analysis

Overall SHAP Framework

6. Discussion

6.1. Strengths of Tree-Based Models

6.2. Limitations of Linear and SVM Models

6.3. Model Trade-Offs: Performance vs. Interpretability

6.4. Practical Implications for Credit Risk Monitoring

6.5. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI