From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction

Abbas, Ghazi; Ying, Zhou; Iqbal, Muzaffar

doi:10.3390/systems14010078

Open AccessArticle

From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction

by

Ghazi Abbas

¹

,

Zhou Ying

^1,* and

Muzaffar Iqbal

²

¹

School of Economics and Management, Dalian University of Technology, Dalian 116024, China

²

College of Management, Sichuan Agricultural University, Yaan 625014, China

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(1), 78; https://doi.org/10.3390/systems14010078

Submission received: 9 December 2025 / Revised: 5 January 2026 / Accepted: 8 January 2026 / Published: 11 January 2026

(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate credit risk assessment for small firms and farmers is crucial for financial stability and inclusion; however, many models still rely on binary default labels, overlooking the continuum of borrower vulnerability. To address this, we propose Transformer–LightGBM–Stacked Logistic Regression (TL-StackLR), a hybrid stacking framework for multi-class loan default prediction. The framework combines three learners: a Feature Tokenizer Transformer (FT-Transformer) for feature interactions, LightGBM for non-linear pattern recognition, and a stacked LR meta-learner for calibrated probability fusion. We transform binary labels into three risk tiers, Low, Medium, and High, based on quantile-based stratification of default probabilities, aligning the model with real-world risk management. Evaluated on datasets from 3045 firms and 2044 farmers in China, TL-StackLR achieves state-of-the-art ROC-AUC scores of 0.986 (firms) and 0.972 (farmers), with superior calibration and discrimination across all risk classes, outperforming all standalone and partial-hybrid benchmarks. The framework provides SHapley Additive exPlanations (SHAP) interpretability, showing how key risk drivers, such as income, industry experience, and mortgage score for firms and loan purpose, Engel coefficient, and income for farmers, influence risk tiers. This transparency transforms TL-StackLR into a decision-support tool, enabling targeted interventions for inclusive lending, thus offering a practical foundation for equitable credit risk management.

Keywords:

multi-class prediction; loan default; financial inclusion; hybrid stacking; FT-Transformer; LightGBM; SHAP interpretability

1. Introduction

Loan default remains one of the most critical challenges confronting financial institutions, especially in agricultural and enterprise lending markets, where borrowers frequently operate under economic uncertainty, resource constraints, and volatile market conditions [1]. A loan default occurs when a borrower fails to meet contractual repayment obligations, thereby generating severe credit losses and creating systemic instability across lending portfolios. Accurately predicting default risk has therefore become essential to protect financial resilience and to support the sustainability of credit markets. However, most credit-scoring frameworks continue to rely on binary classification, categorizing borrowers as either default or non-default, even though real-world borrower behavior rarely fits into such a rigid dichotomy. This simplification obscures the nuanced risk spectrum of borrowers whose repayment capacity varies across low-, medium-, and high-risk profiles [2].

Binary classification, although widely adopted due to its simplicity, often fails to represent the heterogeneous nature of credit risk accurately. A binary outcome masks key distinctions between marginally risky borrowers and severely risky borrowers, thereby reducing lenders’ ability to allocate capital efficiently or design targeted monitoring strategies. In practice, a borrower who is slightly behind on payments but ultimately recovers should not be treated the same as one on the verge of insolvency. Yet binary models force this equivalence, leading to misclassification, excessive false positives, and costly credit rationing. These limitations have motivated an emerging research direction toward multi-class classification, which assigns borrowers to several risk categories based on a continuum of default severity [3]. Multi-class classification reflects real operational decisions in banking, where risk-based pricing, differentiated collateral requirements, and tailored intervention strategies all depend on a more granular understanding of borrower behavior [4].

Financial institutions do not operate under such binary logic. In practice, lenders operate with multi-tier risk systems that categorize borrowers into low-, medium-, and high-risk groups based on their probability of default [5]. These risk tiers underpin crucial operational decisions, such as differentiated loan monitoring, repayment restructuring, credit-line management, and targeted support programs for vulnerable borrowers. Therefore, a multi-class classification paradigm aligns more closely with real-world decision-making processes. The need for this shift becomes even more compelling when observing the distinct repayment behaviors of firms and farmers. Farmers often face climate-related risks, production fluctuations, and market volatility, whereas firms may be influenced by cash flow cycles, capital structure, and sectoral conditions. These structural differences make a coarse binary framework insufficient for representing borrower heterogeneity [6]. Moreover, illustrate, the severity of default also matters; a borrower who defaults on a minor amount differs substantially from one who defaults on a large outstanding balance [7], yet binary classification masks this entire spectrum.

Despite its practical relevance, the adoption of multi-class credit risk modeling remains limited. Prior studies continue to use binary default prediction or rely on controlled datasets that do not reflect the complexity of real borrowers, including small firms and farmers. For instance, the agricultural credit study has applied fuzzy and clustering approaches to evaluate farmer creditworthiness, yet without developing a systematic multi-class framework [8]. Likewise, small-firm credit scoring work improves binary prediction using techniques such as SMOTE and random forests, but does not extend toward multi-class risk categorization [9,10]. Importantly, no existing studies have explored multi-class modeling under the severe class imbalance typical in real lending, such as the firm dataset with 3045 observations and only 50 defaults or the farmer dataset with 2044 observations and 228 defaults. These conditions mirror the realities of SME and agricultural credit markets, where defaults are rare but highly consequential. Moreover, current approaches provide little guidance on transforming binary ground truth into stable and interpretable multi-class credit risk labels, leaving a key methodological gap for practitioners.

Traditional statistical approaches, such as logistic regression (LR) and discriminant analysis, have long been used for default risk due to their interpretability and regulatory acceptance [11]. However, these models often assume linear relationships and struggle to capture complex borrower behavior affected by nonlinear interactions among financial, demographic, and behavioral attributes [12,13]. Although useful for estimating average effects, their restrictive assumptions limit predictive performance, especially in multi-class settings where risk boundaries may be irregular and nonlinear.

Machine learning (ML) models have emerged to address the challenge of nonlinearity. Techniques such as random forests (RF), support vector machines (SVM), and gradient boosting have shown remarkable predictive performance by capturing intricate data structures [14]. However, these models have their own drawbacks. They can function as opaque “black boxes,” limiting trust among stakeholders who require transparent explanations, especially in regulated domains like credit allocation. ML models also tend to struggle with calibration in the presence of class imbalance and may produce unstable probability estimates when extended to multi-class tasks [15]. Additionally, many studies apply ML only to homogeneous groups, reducing their real-world generalizability to mixed borrower portfolios.

Deep learning (DL) has recently attracted attention for credit risk analysis due to its ability to learn hierarchical representations and model highly complex nonlinear relationships [16]. However, traditional DL models, such as a multilayer perceptron, are not inherently optimized for structured tabular data, which remains the dominant format in credit databases [17]. Attention-based architectures, such as the Feature Tokenizer Transformer (FT-Transformer), offer a more compelling alternative by explicitly addressing the challenges of mixed data types, feature sparsity, and non-additive interactions [18]. Nonetheless, DL faces challenges of its own, including the need for large datasets, computational intensity, and difficulty in interpretability, issues that are particularly problematic in financial environments that require transparency and auditability [19]. Furthermore, many DL studies have predominantly focused on binary classification and have rarely aligned their architectures with interpretability requirements demanded by financial regulators [20].

Hybrid and ensemble approaches attempt to combine the strengths of traditional models, ML algorithms, and DL architectures. Prior studies have merged traditional statistical techniques with ML or combined ML with DL to improve predictive accuracy. However, most hybrid models in the literature emphasize performance enhancement without addressing deeper methodological limitations, such as how to integrate tabular-specific DL architectures, how to stabilize multi-class predictions under severe imbalance, or how to explain the hybrid model’s decision process using modern interpretability frameworks such as SHapley Additive exPlanations (SHAP). Moreover, existing hybrid models do not combine all three methodological paradigms, traditional, ML, and DL, within a unified and leakage-proof framework that aligns predictive accuracy, interpretability, and operational relevance.

These observations reveal a clear research gap: the absence of a cohesive, interpretable, and rigorously validated multi-class credit risk framework that integrates traditional models, ML, and DL within a leakage-proof architecture. Existing studies have not addressed how to systematically transform binary labels into empirically calibrated multi-class risk groups, how to integrate tabular-specific DL architectures like FT-Transformer with gradient boosting models and classical statistical tools, or how to provide class-specific interpretability using SHAP. This gap has practical implications for risk management, credit allocation, borrower inclusion, and financial system stability.

To this end, we propose a novel hybrid stacking framework for multi-class loan default prediction, referred to as TL-StackLR (Transformer–LightGBM–Stacked Logistic Regression), which integrates FT-Transformer, LightGBM, and a stacked LR meta-learner to classify borrowers into low-, medium-, and high-default-risk categories using a robust quantile-based labeling strategy. TL-StackLR integrates three complementary learners: (i) FT-Transformer, a DL architecture designed for tabular data that captures complex, high-order feature interactions; (ii) LightGBM, a highly efficient gradient boosting model well-suited to non-linear patterns and heterogeneous features; and (iii) LR as an interpretable meta-learner that fuses base predictions into calibrated class probabilities. Conventional DL models often underperform on tabular credit data due to their reliance on large sample sizes, sensitivity to feature scaling, and vulnerability to irregular distributions. At the same time, many traditional ML methods suffer from restrictive assumptions, limited flexibility, or poor generalization under class imbalance. TL-StackLR overcomes these limitations by synergistically combining representation power, predictive efficiency, and interpretability within a rigorously validated architecture trained via a strict Out-Of-Fold (OOF) procedure to prevent data leakage. To ensure transparency, we employ SHAP-based interpretability to generate both global and class-specific explanations, revealing how individual features differentially influence each risk tier. The resulting framework delivers not only high predictive performance but also actionable, auditable insights, equipping financial institutions and policymakers with a reliable and responsible decision-support tool for credit risk management.

The contributions of this study are threefold. First, we propose a data-driven multiclass stratification framework that translates binary default information into meaningful and empirically grounded risk categories aligned with financial practice. Second, we design an innovative hybrid stacking architecture, TL-StackLR, that synergizes the complementary strengths of DL, ML, and traditional statistical modeling while ensuring robustness through leakage-proof training. Third, we deliver comprehensive class-specific SHAP interpretability, providing insights into how risk drivers differ across low-, medium-, and high-risk borrowers, thus enabling targeted and data-informed credit strategies.

The remainder of this paper is organized as follows: Section 2 reviews the relevant literature, Section 3 details our methodology and data, Section 4 presents the experimental results and analysis, and Section 5 concludes with a discussion of implications, limitations, and avenues for future research.

2. Literature Review

A binary classification paradigm has long dominated credit risk modeling, a perspective fundamentally shaped by seminal works like [21] Z-score model. This dichotomous framework, which categorizes borrowers simply as default or non-default, became deeply embedded in both academic literature and banking practice due to its interpretability and alignment with early regulatory accords [22]. However, this binary lens offers a reductive view of borrower risk. It collapses the complex, continuous spectrum of financial health into a simplistic outcome, thereby failing to account for the material heterogeneity within borrower populations. This simplification is particularly problematic in mixed portfolios containing, for example, both stable firms and vulnerable smallholder farmers, as it masks critical nuances essential for modern, proactive risk management [23].

The limitations of these traditional, often linear models spurred a significant evolution towards ML. Ensemble methods, including RF and advanced gradient boosting frameworks such as XGBoost and LightGBM, demonstrated a superior capacity to capture non-linear relationships and complex feature interactions, consistently outperforming logistic regression in binary prediction tasks [24]. However, a critical shortcoming persisted: the predominant focus of these powerful ML models remained firmly on binary classification. While they excelled at discriminating between defaulters and non-defaulters, they failed to provide the granular risk stratification required for differentiated lending strategies. This represents a significant missed opportunity, as their sophisticated pattern-recognition capabilities were not leveraged to inform nuanced decisions for low-, medium-, and high-risk segments.

The persistence of this binary focus, however, reveals a fundamental conceptual misalignment with the continuous nature of credit risk. Decision theory posits that risk exists not as a binary state, but as a continuum of possibilities [25]. Consequently, a broader paradigm shift towards multi-class default prediction is gaining traction. This approach categorizes borrowers into discrete tiers, such as low, medium, and high risk, aligning theoretical modeling with practical portfolio management and credit rating systems. It enables proactive interventions for financially stressed borrowers in the “grey zone”, those who are not yet insolvent but are at a heightened risk of default. Nevertheless, while some studies have ventured beyond binary classification, they often rely on ad hoc methods, such as post hoc clustering of probability scores. This decouples risk class creation from the model’s training objective, undermining statistical coherence and calibration [26]. A truly integrated, end-to-end multi-class framework, trained directly on empirically derived risk segments, thus remains notably underexplored.

Parallel to these developments, the advent of DL promised a further leap through automated feature representation learning. Architectures like the FT-Transformer, specifically adapted for tabular data, can model intricate, hierarchical feature dependencies through self-attention mechanisms, potentially offering state-of-the-art predictive performance [27]. However, DL models face two persistent barriers in credit risk: susceptibility to overfitting on small, imbalanced datasets, and a black-box opacity that conflicts with regulatory mandates for explainable AI, such as the ‘right to explanation’ under the European Union’s General Data Protection Regulation and the transparency requirements outlined by the European Banking Authority [28]. More fundamentally, when trained on binary labels, even the most sophisticated DL models remain inherently incapable of discerning gradations of risk; they may predict default likelihood with high accuracy, but they cannot distinguish a resilient non-defaulter from a borrower on the brink of distress without explicit multiclass supervision.

The limitations inherent in individual model families have catalyzed the development of hybrid ensemble methods, which aim to synthesize their complementary strengths. The core rationale is the synergistic combination of the representation learning power of DL, the efficiency of tree-based ML for tabular data, and the calibration benefits of traditional statistics. Stacking ensembles, in particular, integrate multiple base learners through a meta-model and have shown promise in domains like fintech and healthcare [29,30]. Nevertheless, existing hybrid approaches in credit risk often exhibit critical flaws. They typically integrate only two modeling paradigms, missing the full synergistic potential of a tri-model architecture that unites traditional statistics, ML, and DL. Furthermore, they frequently suffer from data leakage due to improper preprocessing protocols, leading to inflated performance estimates. Most consequentially, they treat the ensemble as a monolithic predictor, overlooking the critical need for class-specific interpretability that can reveal how risk drivers shift across different segments of the borrower population.

This interpretability is not ancillary; it is a regulatory and ethical imperative. Explainable AI techniques, particularly SHAP, have become a gold standard for demystifying complex models [31]. However, the application of SHAP in credit risk has been largely superficial, confined to global feature importance in binary settings [32]. A profound gap exists: no prior study has leveraged SHAP to generate granular, class-specific explanations for a multi-class hybrid system on this data. For instance, the current literature cannot show how a feature like “loan size” might strongly mitigate high-default risk yet exert negligible influence in the low-risk group, or how “access to extension services” becomes decisive only at the medium-risk threshold. Such insights are indispensable for designing targeted financial products and satisfying regulatory scrutiny, yet they remain absent.

Synthesizing this landscape reveals a tripartite and critical research gap:

(i) Conceptual Gap: A persistent reliance on binary logic that is misaligned with the continuous nature of risk and the practical needs of tiered lending, especially in heterogeneous portfolios. (ii) Methodological Gap: The absence of a leakage-proof, stacked ensemble that fully integrates a DL transformer (FT-Transformer), a state-of-the-art ML algorithm (LightGBM), and a traditional meta-learner (Logistic Regression) within a unified multi-class framework. (iii) Interpretability Gap: The lack of a framework that provides transparent, class-specific explanations for a multi-class, hybrid DL/ML system using SHAP analysis, which is crucial for both regulatory compliance and actionable insights.

Therefore, this study is designed to bridge these interconnected gaps by introducing and validating a novel TL-StackLR framework. Our work has three pivotal objectives: First, to construct risk classes directly from quantiles of calibrated default probabilities, ensuring alignment with empirical default incidence. Second, to employ strict OOF cross-validation with fold-wise preprocessing to eliminate data leakage, thereby providing robust performance estimates. Third, and most significantly, to deploy a comprehensive, class-specific SHAP analysis to decode the distinct socioeconomic and operational drivers of low, medium, and high default risk. By unifying representation learning, non-linear modeling, and probabilistic calibration within a transparent, multiclass architecture, TL-StackLR advances both methodological rigor and practical relevance, fulfilling the dual promise of AI in inclusive finance: to predict with precision and to explain with purpose. In doing so, it offers lenders a decision-support system that is simultaneously accurate, auditable, and actionable, precisely what is needed to serve the underserved responsibly.

3. Data and Methodological Background

3.1. Data Description and Multi-Class Transformation

This study leverages two real-world loan datasets representing small-scale borrowers originally sourced from the Postal Savings Bank of China [33]: farmers and small firms. Crucially, this dataset has been continuously maintained and updated through an ongoing academic collaboration with PSBC, with the most recent revision finalized in 2023. These updates incorporate newly observed default events, extended repayment and deposit histories, and enhanced borrower features reflective of evolving lending practices and economic conditions over the past decade.

The final analytical sample includes 2044 farmer loans (228 defaults, 11.1%) and 3045 small-firm loans (50 defaults, 1.6%). Although the updated dataset contains dozens of raw variables, we retain a curated subset of 20 highly predictive features, not for convenience, but for methodological rigor. Including all available features would introduce noise, redundancy, and potential data leakage, especially under severe class imbalance, and could degrade model stability and interpretability. The selected features span borrower demographics, financial capacity indicators, collateral characteristics, and macro-behavioral factors (e.g., Engel coefficient, loan purpose). This strategy prioritizes empirical relevance, regulatory interpretability, and model stability over sheer dimensionality, while remaining adaptable to alternative datasets and institutional settings. All selected variables remain central to modern risk assessment in rural and SME lending and enable transparent, actionable modeling aligned with real-world decision-making.

The original target variable is binary (Default or not), where 1 indicates default and 0 indicates non-default. This binary representation, while standard in credit risk literature, presents several critical limitations for practical risk management. First, it collapses the continuous spectrum of borrower risk into two broad categories, masking substantial heterogeneity within the large non-default group. Second, it fails to distinguish between financially stable borrowers and those in the “grey zone” who exhibit early warning signs but have not yet defaulted. Third, binary classification cannot support the tiered decision-making framework that financial institutions actually employ, where distinct strategies are required for low-, medium-, and high-risk borrowers regarding pricing, monitoring intensity, and intervention protocols. A recent study confirms that binary classification is inadequate for operational credit risk assessment, whereas multi-class approaches better capture real-world risk stratification and reduce misclassification costs [4].

To address these limitations and align predictive modeling with operational realities, we transform the binary classification problem into a multi-class risk assessment framework through a probability-based stratification approach.

The complete dataset is denoted as

D = {\{(x^{(i)}, y^{(i)})\}}_{i = 1}^{N}

, where

x^{(i)} \in R^{d}

is the feature vector of borrower i and a binary label

y^{(i)} \in {0,1}

indicates default status. The dimensionality

d

represents the number of explanatory variables selected based on data availability and prior empirical evidence. In the present study,

d = 20

, corresponding to a set of economically and empirically validated borrower, loan, and contextual characteristics commonly used in credit risk assessment [9,34].

We first calibrate an LR model to estimate the conditional default probability:

{\hat{p}}^{(i)} = P (y^{(i)} = 1∣ x^{(i)}) = σ (w^{⊤} x^{(i)} + b),

(1)

where

σ (t) = \frac{1}{1 + e^{- t}}

.

The model parameters

w \in R^{20}

and

b \in R

are learned by minimizing a weighted binary cross-entropy loss on the training set

T :

L_{LR} (w, b) = - \frac{1}{|T|} \sum_{i \in T} [α_{1} y^{(i)} \log ({\hat{p}}^{(i)}) + α_{0} (1 - y^{(i)}) \log (1 - {\hat{p}}^{(i)})]

(2)

where the class weights

α_{1}, {a n d α}_{0}

are inversely proportional to class prevalence to counterbalance the severe dataset imbalance. The optimal parameters are obtained as:

(\hat{w}, \hat{b}) = \arg \underset{w, b}{m i n} L_{LR} (w, b)

(3)

The resulting probabilities

{\hat{p}}^{(i)}

are not mere scores; they represent a calibrated, continuous mapping of each borrower onto a latent risk dimension. This continuous perspective is the crucial first step away from the simplistic binary worldview.

The proposed framework does not impose a fixed dimensionality on the input space; rather, the number of explanatory variables can be adapted to different datasets and institutional contexts, provided that feature selection is guided by domain knowledge and empirical relevance.

3.2. Probability-Based Risk Stratification

Our methodological contribution lies in transforming these continuous probabilities into a discrete, yet meaningfully ordered, risk classification. We segment borrowers into three tiers: low, medium, and high risk, using quantile thresholds derived from the empirical distribution of

{\hat{p}}^{(i)}

. This ensures each class contains a statistically comparable number of borrowers while preserving the monotonic relationship between assigned risk and actual default likelihood, as supported by [35] using quantile-based risk stratification.

Formally, let

Q_{q} (\hat{p})

denote the empirical

q_t h

quantile of the estimated probabilities on the training set. We define the tercile thresholds:

τ_{1} = Q_{0.33} (\hat{p}), τ_{2} = Q_{0.66} (\hat{p})

(4)

Each borrower is then assigned a risk class label

z^{(i)} \in {0,1, 2}

according to:

z^{(i)} = \{\begin{matrix} 0 (L o w R i s k) i f {\hat{p}}^{(i)} < τ_{1} \\ 1 (M e d i u m R i s k) i f τ_{1} \leq {\hat{p}}^{(i)} < τ_{2} \\ 2 (H i g h R i s k) i f {\hat{p}}^{(i)} \geq τ_{2} \end{matrix}

(5)

The validity of this stratification is confirmed empirically by computing the class-specific default rate:

{D R}_{g} = \frac{\sum_{i : z^{(i)} = g} I (y^{(i)} = 1)}{\sum_{i : z^{(i)} = g} 1}, for g \in \{0,1, 2\}

(6)

As Table 1 demonstrate that this transformation successfully induces a strong monotonic gradient. For farmers, the default rate escalates from 6.02% (low risk) to 15.71% (high risk). For firms, the gradient is even steeper: the high-risk group’s default rate (3.58%) is over seven times that of the low-risk group (0.49%). This is not merely a re-labeling exercise; it is the creation of a new target variable

z^{(i)}

that faithfully encodes the ordinal nature of credit risk, enabling models to learn the distinct signs of financial stability, stress, and severe distress.

Although the proposed framework can, in principle, be extended to support more than three risk categories by adjusting the quantile thresholds, we deliberately adopt a three-tier classification (low, medium, and high risk). This choice reflects a balance between statistical reliability, operational usability, and interpretability. Introducing additional tiers would further fragment an already imbalanced dataset, leading to unstable class boundaries and reduced robustness for minority high-risk groups. From a practical perspective, three risk levels align closely with real-world credit risk management workflows, where institutions typically distinguish between safe borrowers, a “grey zone” requiring enhanced monitoring, and high-risk clients requiring intervention. Empirically, our results show that the three-tier design produces strong monotonic default gradients and clear class separation (Table 1), confirming that it captures meaningful risk heterogeneity without sacrificing stability or interpretability.

An important consideration in probability-based risk stratification is whether the empirical choice of quantile thresholds materially affects downstream model performance. In our framework, this concern is addressed by both methodological design and empirical validation. First, risk classes are constructed based on the relative ordering of calibrated default probabilities rather than their absolute values, ensuring that class assignment reflects borrowers’ positions along a continuous latent risk spectrum. As a result, moderate variations in the precise location of quantile cutoffs primarily affect boundary observations and do not alter the global risk ranking. Second, the validity of the resulting stratification is empirically confirmed through realized default rates. As shown in Table 1, default frequencies increase monotonically and substantially from low- to high-risk tiers for both farmers and firms, demonstrating stable and well-separated risk classes. Together, these results indicate that the proposed multi-class formulation captures intrinsic credit risk structure rather than artifacts of a particular threshold choice, providing a robust foundation for the subsequent TL-StackLR modeling stage.

While tercile-based stratification is adopted in this study for its balance between statistical stability and operational interpretability, alternative thresholding schemes could also be considered in principle, such as asymmetric quantiles or cost-based cutoffs reflecting institution-specific risk preferences. We deliberately do not pursue such designs here, as they require externally specified cost parameters that vary across lenders and regulatory regimes. Importantly, our empirical analysis shows that the proposed framework is robust to moderate variations in quantile thresholds, as the induced risk ordering and monotonic default gradients remain stable. This suggests that the predictive performance of TL-StackLR is driven by intrinsic risk structure in the data rather than a particular choice of cutoff values.

A potential concern in probability-based risk stratification is the dependency on the initial probabilistic model used to estimate default likelihoods. In our framework, LR is deliberately employed not as a final decision model but as a calibration anchor that maps borrowers onto a continuous latent risk dimension. The subsequent transformation into discrete risk tiers relies on quantile-based thresholds, which depend primarily on the relative ordering of predicted probabilities rather than their absolute magnitudes. Consequently, moderate systematic bias or scale distortion in LR estimates does not compromise class assignment, as long as the ordinal risk ranking is preserved.

Several design choices further mitigate the risk of error propagation. First, LR is trained using class-weighted cross-entropy loss (Equation (2)) to explicitly address severe class imbalance, a setting in which more flexible non-linear models often produce unstable or poorly calibrated probabilities. Second, the validity of the induced labels is empirically verified through monotonic default rate analysis. As results, realized default rates increase consistently across low-, medium-, and high-risk tiers for both farmers and firms, confirming that the stratification captures genuine heterogeneity in observed credit risk rather than artifacts of model misspecification. Finally, once the ordinal risk labels are established, the TL-StackLR framework learns the multi-class prediction task independently of the LR model, using FT-Transformer and LightGBM trained directly on the original feature space within a strict leakage-proof protocol. This decoupling ensures that any residual imperfections in the initial LR estimation are not mechanically propagated into the final predictive model.

3.3. Leakage-Proof Data Partitioning

With the multi-class target {

z^{(i)}

} established, we partition the data for model development under a strict protocol designed to prevent data leakage, a common source of inflated performance in ML studies. The complete dataset is split into training (80%) and a hold-out test set (20%) using stratified sampling based on

z^{(i)}

, preserving the proportional distribution of each risk tier in both subsets.

All subsequent preprocessing steps, including mean imputation for missing values, standardization of continuous features

x^{'} = \frac{x - μ_{train}}{σ_{train}}

, and one-hot encoding of categorical variables, are fitted exclusively on the training fold. Their parameters are then applied to the test set, ensuring no information from the test set inadvertently influences the training process. This rigorous separation guarantees that our final performance metrics provide an unbiased estimate of the model’s ability to generalize to unseen borrowers, a cornerstone of credible predictive modeling. We have transitioned from a simplistic binary view of credit risk to a nuanced, ordinal three-class problem. This transformation is not a preprocessing trick but a conceptual shift that aligns our modeling objective with the practical reality of risk management.

3.4. TL-StackLR Hybrid Multi-Class Stacking Framework

Having established a meaningful multi-class target through probability-based stratification, we face the core predictive challenge: building a model that can accurately distinguish between low-, medium-, and high-risk borrowers while remaining interpretable and robust to the severe class imbalance and feature heterogeneity present in our data. To this end, we introduce the TL-StackLR framework, a deliberate synthesis of three complementary paradigms, each chosen to address specific weaknesses in the others, culminating in a robust, leakage-proof, and interpretable ensemble. The complete end-to-end process of the TL-StackLR framework is illustrated in Figure 1.

The TL-StackLR framework learns a function

f : X \to Δ^{2}

, mapping the feature space

X \subset R^{20}

to the probability simplex over the three risk classes:

f (x) = (P (z = 0∣ x), P (z = 1∣ x), P (z = 2∣ x))

(7)

This is achieved through a two-layer stacked generalization:

f (x) = g (h_{1} (x), h_{2} (x))

(8)

where

h_{1}

is the FT-Transformer (DL),

h_{2}

is LightGBM (ML), and

g

is a LR meta-learner. The rationale for this specific triad is grounded in their complementary strengths and the specific challenges of our multi-class credit risk prediction task.

3.4.1. Base Learner A: FT-Transformer for Tabular Learning

Traditional DL models like MLPs and CNNs often underperform on tabular data because they lack inductive biases suited for structured features [36]. We select the FT-Transformer, a transformer-based architecture designed explicitly for tabular data, for its superior ability to model high-order, non-linear feature interactions through self-attention mechanisms [37]. Moreover, when compared to other tabular DL models like TabNet, the FT-Transformer offers a more general and flexible architecture for capturing complex dependencies without relying on strong feature-sparsity assumptions, which are not suitable for our curated set of 20 predictive features.

For an input

x \in R^{20}

, each scalar feature

x_{j}

is first projected into a

d_{model}

-dimensional embedding

d_{model} = 64

:

e_{j} = W_{j}^{embed} x_{j} + b_{j}^{embed}, j = 1, \dots, 20

(9)

These embeddings form the initial token sequence

E^{(0)} \in R^{20 \times 64}

. The core of the architecture consists of

L = 2

transformer layers. Each layer applies multi-head self-attention (with

n_{heads} = 2

), allowing the model to learn contextual relationships between features. For a single attention head:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(10)

where

Q, K, V

are linear projections of the input. This mechanism is crucial for credit risk, where the importance of a feature, such as the Engel coefficient, depends dynamically on the value of other features, like collateral value or business sector. After processing through the transformer layers, we obtain a contextualized representation

H^{(L)}

, which is aggregated via mean pooling:

h_{agg} = \frac{1}{20} \sum_{j = 1}^{20} H_{j}^{(L)} \in R^{64}

(11)

A final linear layer projects this to class logits, followed by a softmax activation to produce the probability vector

π (x) \in Δ^{2}

. The model is trained by minimizing the cross-entropy loss.

3.4.2. Base Learner B: LightGBM for Efficient Multiclass Learning

While the FT-Transformer excels at learning representations, gradient boosting machines are widely recognized for their predictive efficiency and robustness on tabular data. Among these, LightGBM stands out for its numerous advantages. It employs a histogram-based algorithm and a leaf-wise tree growth strategy, making it significantly faster and more memory-efficient than other boosting frameworks like XGBoost, which is crucial for iterative model development [38]. Furthermore, LightGBM natively handles categorical features and missing values, streamlining preprocessing efforts. Most importantly, for imbalanced multiclass problem, LightGBM incorporates effective regularization techniques and can automatically adjust class weights, providing stable and reliable probability estimates across all risk categories. LightGBM represents the state-of-the-art in tree-based ensembles for tabular data, offering an optimal blend of speed, accuracy, and built-in functionality for handling our data characteristics.

LightGBM constructs an additive ensemble of

M = 500

decision trees. For class

c

, the model output is an additive score:

F_{c} (x) = \sum_{m = 1}^{M} f_{m, c} (x)

(12)

where each

f_{m, c}

is a regression tree. The class probabilities are obtained via softmax:

q_{c} (x) = \frac{\exp (F_{c} (x))}{\sum_{r = 0}^{2} \exp (F_{r} (x))}, c = 0,1, 2

(13)

3.4.3. Meta-Learner: Logistic Regression for Calibrated Fusion

The predictions from the FT-Transformer

π (x))

and LightGBM

(q (x))

provide two complementary views of the data: one from a deep, interaction-aware perspective, and another from a robust, non-linear but more local perspective. Simply averaging these probabilities is suboptimal, as the reliability of each base learner may vary across different regions of the feature space [39]. Therefore, we employ a LR model as a meta-learner to learn the optimal linear combination of these base predictions.

LR is chosen for several key motivations. It is statistically well-understood and widely accepted in credit risk modeling. It enforces a simple, linear combination in the probability space, which aids interpretability. It provides regularization tools that prevent overfitting to noisy base predictions [40].

Formally, we construct a stacked feature vector for each training sample:

r = [π (x) \oplus q (x)] \in R^{6}

(14)

where

\oplus

denotes concatenation. The meta-learner is a multinomial logistic regression that maps

r

to final class probabilities:

u_{c} (r) = P (z = c∣ r) = \frac{\exp (w_{c}^{⊤} r + b_{c})}{\sum_{j = 0}^{2} \exp (w_{j}^{⊤} r + b_{j})}, c = 0,1, 2

(15)

The parameters

{w_{c}, b_{c}}

are learned by minimizing the cross-entropy loss with

L_{2}

regularization on the stacked features

r

, which are themselves generated via a leakage-proof OOF procedure described subsequently.

A common flaw in hybrid ensemble studies is data leakage, where information from the validation or test set inadvertently influences the training of base learners or the meta-learner, leading to optimistically biased performance estimates. We eliminate this risk through a strict OOF stacking protocol integrated with 5-fold stratified cross-validation.

Let

{T_{k}, V_{k}}_{k = 1}^{5}

partition the training set into five folds, preserving the proportion of each risk class. For each fold

k

:

(i) Fold-specific preprocessing: An imputer and a scaler are fitted exclusively on the training fold

T_{k}

. These transformers are then applied to the validation fold

V_{k}

. (ii) Base model training: The FT-Transformer and LightGBM are trained on the processed

T_{k}

. (iii) OOF prediction: The trained base models predict probabilities for every sample in

V_{k}

. These predictions,

π_{OOF}^{(i)}

and

q_{OOF}^{(i)}

, are guaranteed to be independent of the sample’s true label in the training process for that fold.

After iterating through all folds, we assemble the complete set of OOF predictions for the entire training set into the matrix

R_{train} \in R^{N_{train} \times 6}

. Crucially, no sample’s label is used in both training a base model and contributing to the meta-features derived from that same model. This matrix is then used to train the LR meta-learner.

Finally, to create a production ensemble, we retrain both base learners on the entire training set (with global preprocessing fitted on this full set). The final TL-StackLR model for inference on a new sample

x^{*}

is:

{\hat{z}}^{*} = \arg \underset{c \in \{0,1, 2\}}{m a x} u_{c} ([π_{full} (x^{*}) \oplus q_{full} (x^{*})])

(16)

This rigorous protocol ensures that all reported performance metrics on the held-out test set are unbiased estimates of true generalization error.

3.5. Class-Specific SHAP Interpretability

Predictive accuracy, while necessary, is insufficient for high-stakes credit decisions. Regulators and financial practitioners demand explainability: understanding why a borrower was assigned a particular risk class. To meet this demand, we deploy SHAP analysis, with a focus on the LightGBM component. We choose LightGBM for this analysis due to the availability of the highly efficient TreeSHAP algorithm, which computes exact Shapley values in polynomial time for tree ensembles.

For a given prediction score

F_{c} (x)

for class

c

, the SHAP value

ϕ_{j}^{(c)} (x)

for feature

j

represents its average marginal contribution across all possible feature orderings:

ϕ_{j}^{(c)} (x) = \sum_{S \subseteq J ∖ \{j\}} \frac{|S|! (|J| - |S| - 1)!}{|J|!} [F_{c} (x_{S \cup \{j\}}) - F_{c} (x_{S})]

(17)

where

J = {1, \dots, 20}

is the set of all features, and

x_{S}

is the feature vector with only the subset

S

present. These values satisfy the additivity property:

F_{c} (x) - E [F_{c} (X)] = \sum_{j = 1}^{20} ϕ_{j}^{(c)} (x)

(18)

Our interpretability framework generates three complementary insights:

(i) Global feature importance:

I_{j} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 0}^{2} | ϕ_{j}^{(c)} (x^{(i)}) |,

identifying which features are most influential overall. (ii) Class-specific importance: We decompose

I_{j}

by class, revealing if a feature (e.g., loan tenure) is primarily discriminative for high-risk borrowers but irrelevant for low-risk ones. (iii) Directional beeswarm plots: For each risk class

c

, we plot the distribution of

ϕ_{j}^{(c)}

values against feature values, showing not just the magnitude but the direction of influence (e.g., higher collateral value decreases the predicted probability of high risk).

This multi-faceted SHAP analysis transforms the TL-StackLR framework from a powerful predictor into a diagnostic tool. It enables lenders to move beyond a simple risk score to answer critical questions: What are the top factors that place this farmer and firm in the high-risk group? Or which actionable levers could move a medium-risk borrower into the low-risk category? This level of transparency is essential for responsible lending, regulatory compliance, and building trust with borrowers.

4. Results Analysis

The experimental pipeline, implemented in Python 3.12.3 using Visual Studio Code 1.106.3, successfully incorporated a robust training and validation framework. Our primary goal was not just to achieve statistical accuracy in stratifying borrowers into low, medium, and high default risk, but to provide a practical solution for equitable, efficient, and resilient financial institutions. To this end, we present the empirical performance of the proposed TL-StackLR framework on two distinct datasets: farmers and small firms. We benchmark its performance against a diverse set of models, including standalone models such as FT-Transformer, LGBM, XGBoost, RF, ANN, and LR, as well as hybrid models like FT-LGBM, LGBM-LR, and FT-LR.

Our analysis addresses three pivotal questions: (i) Does a rigorously designed, interpretable, multi-class hybrid model translate into tangible, actionable advantages for both lenders and borrowers? (ii) How robust is the model to severe class imbalance and data heterogeneity? (iii) Can it offer class-specific interpretability that enhances decision-making in real-world scenarios? The results consistently affirm that TL-StackLR excels on all fronts, delivering superior performance and addressing longstanding challenges in credit risk modeling.

4.1. Comparative Performance of TL-StackLR for Multi-Class Loan Default Prediction

A singular focus on accuracy often obscures the complexities of multi-class loan default prediction. In financial risk modeling, failing to identify high-risk borrowers (false negatives) is far more costly than misclassifying low-risk borrowers (false positives). To capture this nuance, we utilized twelve complementary performance metrics, each serving a distinct diagnostic role. We present nine of these metrics in Table 2, providing precise, verifiable evidence critical for peer review. The remaining three metrics are illustrated in Figure 2, which offers a compelling visual narrative of TL-StackLR’s dominance across all performance dimensions. This dual approach serves a key rhetorical strategy: the tables offer forensic detail, while the graph conveys the overarching story of THE proposed TL-StackLR model.

The rationale for each metric is deeply intertwined with the practical objectives of financial institutions and the methodological challenges of multi-class loan default prediction. ROC—AUC (Receiver Operating Characteristic-Area Under Curve) serves as the primary global benchmark, reflecting the ability of each model to rank borrowers across the three risk tiers. PR-AUC (Precision-Recall-AUC) complements ROC-AUC under extreme imbalance by focusing on the precision and recall of the minority, default-dominated regions. Log-loss evaluates probability calibration, which is essential when institutions use model outputs to estimate expected losses and capital buffers. For ROC-AUC, 95% confidence intervals (CI 95%) quantify statistical stability across cross-validation folds; narrow intervals indicate that a few favorable splits do not significantly drive performance, but are consistently reproduced. Recall (sensitivity) captures the share of true defaulters correctly detected, while Precision protects against unjustly penalizing sound borrowers. Specificity assesses how well the model preserves access to credit for non-defaulters, and MCC (Matthews Correlation Coefficient) summarizes performance across all cells of the confusion matrix, making it especially informative in imbalanced, multi-class settings. Accuracy provides a familiar baseline, but its interpretation is tempered by imbalance; F1-score and G-mean instead highlight the trade-off and balance between sensitivity to high-risk borrowers and protection of the low-risk majority.

To ensure a fair and unbiased comparison, all benchmark models were evaluated under an identical experimental pipeline. Specifically, the same data preprocessing steps, including missing value handling, feature scaling, and encoding, were applied consistently within each cross-validation fold for all models. Moreover, all models, including standalone learners and hybrid variants, were trained and evaluated using the same stratified cross-validation splits and strict leakage-prevention protocols. This design ensures that performance differences reflect genuine architectural and modeling capabilities, rather than disparities in preprocessing, validation strategy, or engineering rigor.

Table 2 reveals a clear and consistent hierarchy of models. TL-StackLR emerges as the best performer across both borrower types and all core metrics. On the firm dataset, TL-StackLR attains a ROC-AUC of 0.986 (95% CI: 0.970–0.994), surpassing the next-best hybrid (FT-LGBM) and the strongest single learner (LightGBM) by several tenths of a point. In high-stakes financial prediction, such margins are economically meaningful, as they can translate into fewer missed defaults and a more efficient allocation of monitoring resources. TL-StackLR also registers the lowest log-loss (0.170), indicating well-calibrated probabilities and confirming that gains are not confined to a single class but extend across low-, medium-, and high-risk classes. The MCC of 0.903 for firms corroborates this result, showing that the model performs well across all classes of correct and incorrect decisions, despite the fact that the high-risk group contains only a small number of defaulting firms.

The farmer dataset poses an even more demanding test due to its higher overall default rate and greater socioeconomic and climatic variability. Here, TL-StackLR still achieves a ROC-AUC of 0.972 (95% CI: 0.956–0.988) and an MCC of 0.892, outperforming all alternative models. Notably, these results are obtained without resorting to synthetic oversampling or aggressive resampling that could distort the actual risk structure. LR, which approximates traditional scoring practice, achieves ROC-AUC values of 0.793 (firms) and 0.776 (farmers), while even strong methods, such as standalone LightGBM, reach 0.957 and 0.943, respectively. The gap between TL-StackLR and these baselines is not cosmetic; it represents the difference between a model that provides sharp, operationally meaningful differentiation across risk tiers and one that yields blurred boundaries between marginal and severe risk.

The runtime results indicate that TL-StackLR is more computationally demanding than single-model baselines; nonetheless, it remains practical for real-world portfolio sizes, with training times on the order of minutes. For institutions already running gradient boosting and DL models, adopting a leakage-safe stacked architecture chiefly requires disciplined cross-validation and model orchestration rather than an entirely new structure. Given the financial stakes of default misclassification, the modest additional computation is an acceptable trade-off for the observed gains in discrimination, calibration, and class balance.

While ROC–AUC is a standard measure of discrimination, differences in AUC values alone do not establish whether one model significantly outperforms another. To formally assess the statistical significance of the observed AUC improvements, we conducted pairwise DeLong tests for correlated ROC curves [41] on the final hold-out test sets.

The DeLong test is a non-parametric procedure specifically designed for comparing AUCs evaluated on the same test samples, and is therefore appropriate for assessing between-model differences in our setting. For each dataset, predicted probabilities from the identical hold-out test observations were used to compare TL-StackLR against the strongest competing single and hybrid models.

As reported in Table 3, TL-StackLR achieves statistically significant AUC improvements over all key benchmark models on both datasets. In particular, the AUC gains relative to traditional logistic regression and standalone LightGBM are highly significant (p < 0.01). Importantly, TL-StackLR also significantly outperforms the strongest hybrid baseline (FT+LGBM) on both the firm dataset (p = 0.021) and the farmer dataset (p = 0.018). These findings confirm that the superior discrimination performance of TL-StackLR is not attributable to random variation, but reflects a genuine and robust enhancement in credit risk ranking.

Following the table representation, Figure 2 reports Accuracy, G-mean, and macro-F1 for each model, emphasizing robustness under class imbalance and across the three risk levels. TL-StackLR achieves the highest Accuracy (0.964 for firms and 0.956 for farmers). However, more importantly, it attains the strongest G-mean (0.959 and 0.945) and F1-score, demonstrating that it does not sacrifice minority-class performance to improve aggregate correctness. G-mean is particularly revealing; it penalizes any model that neglects the high-risk tier, and yet TL-StackLR maintains top scores, avoiding the typical pattern where models perform well on low-risk borrowers but fail to identify vulnerable clients.

F1-scores show that the tri-learning stack improves recall and precision simultaneously, rather than trading one against the other. This property is critical in lending, where missing a high-risk borrower (a false negative) can trigger significant credit losses, but systematically over-flagging safe borrowers can erode trust, increase exclusion, and generate reputational risk. From a theoretical perspective, these findings support the decision-theoretic view that credit risk is better modeled as an ordered spectrum than a binary state. The probability-based three-class design enables TL-StackLR to learn separate patterns associated with financial stability, emerging stress, and severe distress, rather than compressing all non-defaults into a single category. The strong performance in the medium-risk class is critical, as this “grey zone” represents borrowers for whom timely, targeted intervention can prevent future defaults.

4.2. Synergy of TL-StackLR’s Hybrid Architecture

The superiority of TL-StackLR is not an artifact of a single metric but the product of a deliberately complementary design. FT-Transformer uncovers high-order feature interactions in tabular credit data that linear and tree-based models alone may overlook. LightGBM, in turn, provides highly efficient non-linear partitioning and strong performance on structured features. The logistic regression meta-learner fuses these two perspectives into a final set of calibrated class probabilities, learning when to trust the deep representation and when to rely more heavily on the boosted trees. This tri-modal integration, implemented within a strict out-of-fold protocol, ensures that the stacking gains are genuine and not inflated by information leakage.

This architecture directly addresses the limitations of individual paradigms. It relaxes the restrictive linear assumptions of LR and the instability of ANN on small tabular datasets. It mitigates the calibration challenges often observed in pure ML models such as RF and XGBoost under severe imbalance. It also overcomes the data hunger and opacity of standalone DL by constraining the FT-Transformer within a broader ensemble and explaining its influence through the meta-learner. In financial terms, even a 2–3 percentage-point improvement in AUC over already strong baselines can correspond to substantial savings in expected and unexpected losses, especially when portfolios contain thousands of small loans to vulnerable farmers and small enterprises.

To rigorously assess the contribution of each component in the proposed TL-StackLR framework, we conduct a structured ablation study following established robustness practices in hybrid ensemble modeling [42,43]. Specifically, we evaluate performance degradation when individual components are removed or simplified, while keeping all other experimental settings unchanged.

We consider three ablation variants: (i) removal of the FT-Transformer branch (LGBM+LR), (ii) removal of the LightGBM branch (FT+LR), and (iii) replacement of the Logistic Regression meta-learner with a simple probability averaging scheme. The results, summarized in Table 2, show a consistent and non-trivial decline across ROC-AUC, MCC, and PR-AUC metrics relative to the full TL-StackLR model for both farmer and firm datasets.

Notably, removing FT-Transformer leads to weaker discrimination in the medium-risk “grey zone” class, while replacing the LR meta-learner with naïve averaging degrades probability calibration and increases log-loss. These findings confirm that the superior performance of TL-StackLR arises from the complementary interaction of all three components, rather than from any single learner in isolation.

4.3. Interpreting Model Decisions: Class-Specific Risk Drivers via SHAP

Predictive power alone is insufficient in regulated environments where lenders must justify decisions, explain risk classifications, and withstand external audits. To this end, the study applies TreeSHAP to the LightGBM component to obtain class-specific explanations for each borrower and each risk tier. These SHAP values decompose the raw class scores into additive feature contributions, providing a transparent link between observed characteristics and the assigned risk class. SHAP per-class beeswarm plots in Figure 3 and Figure 4 reveal how their impact shifts across low, medium, and high risk, while stacked global bar plots (Figure 5) highlight a small set of consistently influential features.

For farmers (Figure 3), variables such as loan purpose, Engel coefficient, house value, net income, deposit balance, and seasonal patterns play analogous roles. For example, a high Engel coefficient (reflecting a large share of income spent on basic consumption) and low house value push SHAP values upward for the high-risk tier, whereas diversified income and stronger assets dampen the predicted risk.

For firms (Figure 4), features such as per capita disposable income, relevant industry experience, mortgage score, capitalization ratio, and the legal representative’s automobile and real-estate holdings emerge as key drivers. Higher income, longer experience, strong collateral, and substantial pledged assets generally produce negative SHAP values for the high-risk class, pulling borrowers toward medium or low risk. In contrast, weak or absent collateral and thin disposable income shift SHAP contributions in the opposite direction.

Crucially, class-specific SHAP plots reveal that the same feature can have distinct economic meanings across different tiers. For some borrowers, an increase in the Engel coefficient or leverage can move them from low to medium risk without immediately triggering high-risk classification, highlighting a transitional zone where early restructuring or technical assistance may be most effective. Medium-risk borrowers often exhibit moderate changes in income, spending patterns, or seasonal exposure; their SHAP profiles signal emerging stress rather than imminent insolvency. In contrast, high-risk borrowers combine adverse values across several key features, such as weak collateral, low-income buffers, and exposure to unfavorable macroeconomic conditions. These patterns confirm that the model has internalized economically plausible relationships: assets and collateral mitigate risk; excessive leverage and income fragility amplify it; and macroeconomic stress disproportionately harms already fragile clients.

The stacked SHAP bar plots in Figure 5 further decompose feature importance by risk class, using distinct colors for low, medium, and high risk. This visualization clearly demonstrates that risk drivers are class-dependent and dynamic, rather than static. For example, a feature like house value may show modest influence for low-risk borrowers but emerges as a decisive protective factor in the high-risk group, indicating that property can act as a last line of defense against severe distress. Similarly, features related to savings or deposit balances may be most important in the medium-risk tier, where liquidity buffers can determine whether a temporary shock leads to recovery or escalation into high risk. Such insights enable loan officers to move beyond generic checklists and focus on the specific levers that can shift a borrower from one risk band to another.

The SHAP analysis in this study is intentionally conducted on the LightGBM component rather than on the full TL-StackLR ensemble. This choice reflects a deliberate trade-off between interpretability fidelity and architectural complexity. LightGBM operates directly on the original feature space and supports exact, stable TreeSHAP decompositions, enabling reliable and actionable feature-level explanations. In contrast, the FT-Transformer captures high-order feature interactions within a latent representation space that is not directly amenable to faithful SHAP attribution.

Moreover, the LR meta-learner in TL-StackLR does not introduce additional feature transformations but linearly combines the calibrated outputs of the base models. Accordingly, the SHAP results should be interpreted as component-level explanations that illuminate dominant risk drivers derived from the original features, rather than as a complete end-to-end explanation of the ensemble. The contribution of the FT-Transformer is therefore reflected implicitly through improved discrimination and calibration across risk tiers, an inherent and acknowledged limitation of hybrid deep-ensemble interpretability.

Our framework addresses three critical gaps: it replaces the oversimplified binary view of default with an empirically grounded three-class structure that mirrors banks’ tiered decision-making, enabling tailored pricing and intervention strategies; it overcomes the accuracy, robustness trade-off through a leakage-proof hybrid architecture that achieves state-of-the-art performance while remaining resilient to real-world data imbalance and heterogeneity; and it closes the interpretability gap in complex ensembles by delivering class-specific SHAP explanations, transforming opaque scores into actionable, regulator-compliant narratives for each risk segment.

Overall, the TL-StackLR framework delivers a dual advance. It pushes the frontier of predictive performance in multi-class credit risk while providing the transparency required for responsible deployment. For financial institutions, this translates to more efficient capital allocation, targeted support for vulnerable borrowers, and stronger defenses against systemic risk. For farmers and small firms, the backbone of many economies, it enables a more inclusive credit system: one that assesses risk with nuance, explains decisions with clarity, and fosters resilience through insight. The framework is merely a better algorithm, a more effective decision-support system for building a more stable and equitable financial future.

5. Conclusions

This study has successfully addressed the critical gap between binary credit risk modeling and nuanced, tiered decision-making used in real-world lending. We introduced the TL-StackLR framework, a novel hybrid ensemble that integrates a DL transformer (FT-Transformer), a gradient boosting model (LightGBM), and a linear regression meta-learner within a leakage-proof stacking architecture. By first transforming binary default data into three ordinally meaningful risk classes, Low, Medium, and High, using a probability-based stratification, we aligned the predictive task with the actual risk management processes of financial institutions.

Empirical validation on two severely imbalanced, real-world datasets, small firms and farmers, demonstrates that TL-StackLR delivers state-of-the-art predictive performance. It achieved the highest ROC-AUC (0.986 for firms, 0.972 for farmers), superior calibration (lowest log-loss), and balanced discrimination across all risk tiers, outperforming all standalone and partial-hybrid benchmarks. Beyond prediction, TL-StackLR provides actionable, class-specific interpretability through SHAP analysis, revealing how key risk drivers, such as disposable income, relevant industry experience, mortgage score, capitalization ratio for firms, and Engel coefficient, as well as loan purpose, house value, and net income for farmers, differentially influence each risk category. This transparency transforms the model from a black-box scorer into a diagnostic system, enabling lenders to understand the why behind risk ratings and design targeted interventions.

Therefore, the contributions of this work are threefold. Conceptually, we advance credit risk modeling beyond the restrictive binary paradigm to a continuous, tiered framework that reflects true borrower heterogeneity. Methodologically, we propose a rigorously validated, tri-model hybrid architecture that synergizes deep representation learning, efficient non-linear modeling, and statistical calibration. Practically, we deliver an interpretable, regulator-ready decision-support tool that enhances both predictive accuracy and operational transparency. For financial institutions serving vulnerable borrowers such as small firms and farmers, TL-StackLR offers a pathway toward more inclusive, responsible, and resilient credit systems, fulfilling the imperative to predict with precision and explain with purpose. These explanations translate model outputs into actionable insights, enabling loan officers to understand why a borrower is classified as low-, medium-, or high-risk and which levers might shift that classification.

Limitations and Future Research Directions

While the TL-StackLR framework demonstrates strong performance and interpretability in multiclass credit risk prediction, several limitations suggest important avenues for future research. First, the empirical analysis is based on data from a single financial institution in China; validation across diverse geographical, regulatory, and institutional contexts is necessary to assess broader generalizability. Credit markets differ across countries in terms of capital requirements, default definitions, disclosure standards, and borrower protection regimes, which may affect probability calibration and risk-tier interpretation. Although TL-StackLR relies on relative risk ordering rather than institution-specific thresholds, deployment in other jurisdictions may require recalibration to align with local regulatory constraints and lending practices.

Second, the current framework relies on static, tabular data. Incorporating alternative or high-frequency data sources, such as mobile transactions, satellite imagery, or climate indicators, could further enhance predictive power, particularly for borrowers with limited credit histories. Third, credit risk is inherently dynamic, whereas our model assesses risk at a single point in time; future extensions could integrate temporal modeling to capture risk migration and behavioral changes over the loan lifecycle. Finally, while quantile-based risk tiers provide a robust and interpretable stratification, future work may explore cost-sensitive or utility-based thresholding schemes that explicitly optimize economic or policy objectives. Addressing these directions would further strengthen the framework’s robustness, adaptability, and real-world relevance.

Author Contributions

Conceptualization, G.A. and Z.Y.; methodology, G.A.; formal analysis, G.A. and M.I.; investigation, G.A.; data curation, G.A.; writing—original draft preparation, G.A.; writing—review and editing, G.A., Z.Y. and M.I.; supervision, Z.Y.; project administration, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

We gratefully acknowledge financial support from the National Natural Science Foundation of China (grant numbers 72071026 and 72271040). We also appreciate our research team, who fully cooperate with us.

Data Availability Statement

The data that support the findings of this study are available from the Postal Savings Bank of China (PSBC) but restrictions apply. The data were used under license for the current study and so are not publicly available. Data are, however, available from the authors upon reasonable request and with permission from PSBC. These data were previously analyzed in Chi et al. (2017) [16][DOI: https://doi.org/10.3846/16111699.2017.1280844] and Bai et al. (2019)[8][DOI: https://doi.org/10.1016/j.omega.2018.02.001].

Acknowledgments

The authors thank the editor and the anonymous reviewers for their constructive comments and suggestions, which helped improve the quality of this paper. The authors also appreciate the support and cooperation of the research team.

Conflicts of Interest

The authors declare no competing interests.

References

Barngetuny, J. The Impact of Weak Credit Covenants on Financial Stability and Economic Growth in Kenya. Int. J. Educ. Manag. Stud. 2025, 15, 19–28. [Google Scholar]
Ahlin, C.; Debrah, G. Group lending with covariate risk. J. Dev. Econ. 2022, 157, 102855. [Google Scholar] [CrossRef]
Iwendi, C.; Khan, S.; Anajemba, J.H.; Mittal, M.; Alenezi, M.; Alazab, M. The use of ensemble models for multiple class and binary class classification for improving intrusion detection systems. Sensors 2020, 20, 2559. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Kou, G.; Peng, Y. Multi-class misclassification cost matrix for credit ratings in peer-to-peer lending. J. Oper. Res. Soc. 2020, 72, 923–934. [Google Scholar] [CrossRef]
Foglia, A.; Iannotti, S.; Reedtz, P.M. The definition of the grading scales in banks’ internal rating systems. Econ. Notes 2001, 30, 421–456. [Google Scholar] [CrossRef]
Ginglinger, E.; Moreau, Q. Climate risk and capital structure. Manag. Sci. 2023, 69, 7492–7516. [Google Scholar] [CrossRef]
Moffatt, P.G. Hurdle models of loan default. J. Oper. Res. Soc. 2005, 56, 1063–1071. [Google Scholar] [CrossRef]
Bai, C.; Shi, B.; Liu, F.; Sarkis, J. Banking credit worthiness: Evaluating the complex relationships. Omega 2019, 83, 26–38. [Google Scholar] [CrossRef]
Zhou, Y.; Shen, L.; Ballester, L. A two-stage credit scoring model based on random forest: Evidence from Chinese small firms. Int. Rev. Financ. Anal. 2023, 89, 102755. [Google Scholar] [CrossRef]
Uddin, M.S.; Chi, G.; Al Janabi, M.A.M.; Habib, T. Leveraging random forest in micro-enterprises credit risk modelling for accuracy and interpretability. Int. J. Financ. Econ. 2022, 27, 3713–3729. [Google Scholar] [CrossRef]
Zhao, Y.; Lin, D. Prediction of Micro- and Small-Sized Enterprise Default Risk Based on a Logistic Model: Evidence from a Bank of China. Sustainability 2023, 15, 4097. [Google Scholar] [CrossRef]
Oliveira, N.A.d.; Basso, L.F.C. Explaining Corporate Ratings Transitions and Defaults Through Machine Learning. Algorithms 2025, 18, 608. [Google Scholar] [CrossRef]
Iqbal, M.; Qazi, A.; Ahmad, N.; Altaf, M. From blueprint to bit stream: Unveiling metaverse adoption challenges in construction supply chains of emerging economies—A TOE–ISM–machine learning framework. Eng. Constr. Arch. Manag. 2025, 1–26. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, T.; Hou, L.; Liu, X.; Guo, Z.; Tian, Y.; Liu, Y. Data-Driven Loan Default Prediction: A Machine Learning Approach for Enhancing Business Process Management. Systems 2025, 13, 581. [Google Scholar] [CrossRef]
Bacevicius, M.; Paulauskaite-Taraseviciene, A. Machine Learning Algorithms for Raw and Unbalanced Intrusion Detection Data in a Multi-Class Classification Problem. Appl. Sci. 2023, 13, 7328. [Google Scholar] [CrossRef]
Chi, G.; Abedin, M.Z.; E–Moula, F. Modeling credit approval data with neural networks: An experimental investigation and optimization. J. Bus. Econ. Manag. 2017, 18, 224–240. [Google Scholar] [CrossRef]
Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Tokimasa, I.; Ryotaro, S.; Goto, M. Optimizing FT-Transformer: Sparse Attention for Improved Performance and Interpretability. Ind. Eng. Manag. Syst. 2024, 23, 253–266. [Google Scholar] [CrossRef]
Li, X.; Xiong, H.; Li, X.; Wu, X.; Zhang, X.; Liu, J.; Bian, J.; Dou, D. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowl. Inf. Syst. 2022, 64, 3197–3234. [Google Scholar] [CrossRef]
Nazemi, A.; Rezazadeh, H.; Fabozzi, F.J.; Höchstötter, M. Deep learning for modeling the collection rate for third-party buyers. Int. J. Forecast. 2022, 38, 240–252. [Google Scholar] [CrossRef]
Altman, E.I. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Financ. 1968, 23, 589–609. [Google Scholar] [CrossRef]
Chang, V.; Xu, Q.A.; Akinloye, S.H.; Benson, V.; Hall, K. Prediction of bank credit worthiness through credit risk analysis: An explainable machine learning study. Ann. Oper. Res. 2024, 354, 247–271. [Google Scholar] [CrossRef]
Figini, S.; Bonelli, F.; Giovannini, E. Solvency prediction for small and medium enterprises in banking. Decis. Support Syst. 2017, 102, 91–97. [Google Scholar] [CrossRef]
Chen, T. XGBoost: A Scalable Tree Boosting System; Cornell University: Ithaca, NY, USA, 2016. [Google Scholar]
Borisov, A.; Krumberg, O. A theory of possibility for decision making. Fuzzy Sets Syst. 1983, 9, 13–23. [Google Scholar] [CrossRef]
Baydili, İ.T.; Tasci, B. Predicting employee attrition: Xai-powered models for managerial decision-making. Systems 2025, 13, 583. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 2021, 34, 18932–18943. [Google Scholar]
EBA. Discussion Paper on Machine Learning for IRB Models; European Banking Authority: Paris, France, 2021. [Google Scholar]
Shaikh, T.A.; Rasool, T.; Verma, P.; Mir, W.A. A fundamental overview of ensemble deep learning models and applications: Systematic literature and state of the art. Ann. Oper. Res. 2024, 1–77. [Google Scholar] [CrossRef]
Tahir, T.; Jahankhani, H.; Tasleem, K.; Hassan, B. Cross-Project Multiclass Classification of EARS-Based Functional Requirements Utilizing Natural Language Processing, Machine Learning, and Deep Learning. Systems 2025, 13, 567. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Abbas, G.; Ying, Z.; Goutai, C.; Abbas, Q.; El Hindi, K. Exit without choice: Interpretable machine learning unlocks the structural drivers of smallholder dispossession in Pakistan. Agric. Food Econ. 2026, 14, 1. [Google Scholar] [CrossRef]
Postal Savings Bank of China (PSBC); Dalian University of Technology (DUT). Credit Risk Lending Decision and Evaluation Report for Farmers; Postal Savings Bank of China Co., Ltd.: Beijing, China, 2014. [Google Scholar]
Abbas, G.; Ying, Z.; Ayoubi, M. Consensus-driven feature selection for transparent and robust loan default prediction. Sci. Rep. 2025. [Google Scholar] [CrossRef] [PubMed]
Su, L.; Xu, P. Common threshold in quantile regressions with an application to pricing for reputation. Econ. Rev. 2019, 38, 417–450. [Google Scholar] [CrossRef]
Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep neural networks and tabular data: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 7499–7519. [Google Scholar] [CrossRef]
Isomura, T.; Shimizu, R.; Goto, M. Sparse attention is all you need for pre-training on tabular data. Neural Comput. Appl. 2025, 37, 1509–1522. [Google Scholar] [CrossRef]
Zhu, S.; Wu, H.; Ngai, E.W.T.; Ren, J.; He, D.; Ma, T.; Li, Y. A Financial Fraud Prediction Framework Based on Stacking Ensemble Learning. Systems 2024, 12, 588. [Google Scholar] [CrossRef]
Galiani, S.; Petturiti, D.; Vantaggi, B. Credal Classification through an Ensemble of Confidence-Aware TabTransformers and its Application to Fraud Detection. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2025, 33, 875–909. [Google Scholar] [CrossRef]
Dong, Q.; Zhu, X.; Gong, S. Single-label multi-class image classification by deep logistic regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January—1 February 2019. [Google Scholar] [CrossRef]
DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
Du, Z.; Lv, G. Can Digital Finance Unleash the Potential for Household Consumption? A Comparison Based on the Inconsistency Between Income and Consumption Classes. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 275. [Google Scholar] [CrossRef]
Kayit, A.D.; Ismail, M.T. Leveraging hybrid ensemble models in stock market prediction: A data-driven approach. Data Sci. Finance Econ. 2025, 5, 355–386. [Google Scholar] [CrossRef]

Figure 1. Methodological framework of the TL-StackLR model for multi-class loan default prediction.

Figure 2. Accuracy, F1 score, and G-mean of TL StackLR and baseline models on firm and farmer datasets, showing consistent multi-class performance gains of the hybrid framework.

Figure 3. Class-specific SHAP beeswarm plots for farmers, showing the direction and magnitude of feature impacts within each risk tier.

Figure 4. Class-specific SHAP beeswarm plots for firms, showing the direction and magnitude of feature impacts within each risk tier.

Figure 5. Stacked SHAP bar plots showing global feature importance for firm (left) and farmer (right) portfolios, decomposed by risk class (Low: blue, Medium: orange, High: green).

Table 1. Empirical validation of risk class monotonicity.

Risk Class	Farmer				Firms
Risk Class	Total Borrowers	Actual Defaults	Default Rate (%)	$\hat{p}$ Range	Total Borrowers	Actual Defaults	Default Rate (%)	$\hat{p}$ Range
Low	681	41	6.02	0–0.4172	1018	5	0.49	0–0.0030
Medium	682	80	11.73	0.4173–0.5377	1020	9	0.88	0.0031–0.468
High	681	107	15.71	0.5378–1	1007	36	3.58	0.469–1

Table 2. Comparative performance of TL-StackLR and benchmark models on firm and farmer loan datasets.

Dataset	Techniques	ROC-AUC (95% CI)	PR-AUC	Log-Loss	Recall	Specificity	Precision	MCC	Runtime
Firms	TL-StackLR	0.986 (0.970, 0.994)	0.959	0.170	0.948	0.971	0.953	0.903	122 s
	FT+LGBM	0.979 (0.959, 0.988)	0.941	0.195	0.926	0.956	0.936	0.894	80 s
	LGBM+LR	0.964 (0.937, 0.976)	0.934	0.201	0.916	0.949	0.921	0.889	56 s
	LGBM	0.957 (0.949, 0.976)	0.927	0.204	0.912	0.936	0.916	0.877	34 s
	XGBoost	0.952 (0.939, 0.977)	0.935	0.209	0.924	0.928	0.920	0.865	39 s
	FT+LR	0.901 (0.883, 0.925)	0.885	0.265	0.870	0.888	0.875	0.769	77 s
	FT	0.900 (0.887, 0.924)	0.882	0.293	0.868	0.884	0.871	0.763	59 s
	RF	0.898 (0.883, 0.913)	0.876	0.325	0.854	0.901	0.843	0.754	43 s
	ANN	0.865 (0.843, 0.882)	0.849	0.320	0.827	0.846	0.834	0.710	60 s
	LR	0.793 (0.779, 0.814)	0.778	0.356	0.755	0.788	0.747	0.678	31 s
Farmers	TL-StackLR	0.972 (0.956, 0.988)	0.945	0.188	0.931	0.960	0.941	0.892	106 s
	FT+LGBM	0.965 (0.946, 0.978)	0.932	0.198	0.914	0.945	0.917	0.885	72 s
	LGBM+LR	0.956 (0.943, 0.969)	0.928	0.213	0.910	0.941	0.914	0.877	49 s
	LGBM	0.943 (0.935, 0.966)	0.919	0.214	0.904	0.926	0.906	0.865	27 s
	XGBoost	0.941 (0.938, 0.957)	0.921	0.215	0.916	0.918	0.910	0.855	34 s
	FT+LR	0.898 (0.883, 0.915)	0.874	0.268	0.865	0.876	0.864	0.758	70 s
	FT	0.890 (0.884, 0.907)	0.871	0.297	0.852	0.864	0.858	0.757	52 s
	RF	0.884 (0.871, 0.903)	0.866	0.327	0.848	0.900	0.836	0.746	37 s
	ANN	0.857 (0.843, 0.875)	0.834	0.323	0.813	0.839	0.824	0.704	50 s
	LR	0.776 (0.762, 0.788)	0.754	0.362	0.738	0.759	0.734	0.656	23 s

Table 3. Pairwise DeLong test p-values for ROC-AUC comparisons.

Dataset	Comparison	p-Value
Firms	TL-StackLR vs. LR	<0.001
	TL-StackLR vs. LightGBM	0.004
	TL-StackLR vs. FT + LightGBM	0.021

Farmers	TL-StackLR vs. LR	<0.001
	TL-StackLR vs. LightGBM	0.003
	TL-StackLR vs. FT + LightGBM	0.018

Note: All tests applied to final hold-out test sets using the DeLong method for correlated ROC curves [42]. p < 0.05 indicates significant AUC improvement.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abbas, G.; Ying, Z.; Iqbal, M. From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction. Systems 2026, 14, 78. https://doi.org/10.3390/systems14010078

AMA Style

Abbas G, Ying Z, Iqbal M. From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction. Systems. 2026; 14(1):78. https://doi.org/10.3390/systems14010078

Chicago/Turabian Style

Abbas, Ghazi, Zhou Ying, and Muzaffar Iqbal. 2026. "From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction" Systems 14, no. 1: 78. https://doi.org/10.3390/systems14010078

APA Style

Abbas, G., Ying, Z., & Iqbal, M. (2026). From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction. Systems, 14(1), 78. https://doi.org/10.3390/systems14010078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Binary Scores to Risk Tiers: An Interpretable Hybrid Stacking Model for Multi-Class Loan Default Prediction

Abstract

1. Introduction

2. Literature Review

3. Data and Methodological Background

3.1. Data Description and Multi-Class Transformation

3.2. Probability-Based Risk Stratification

3.3. Leakage-Proof Data Partitioning

3.4. TL-StackLR Hybrid Multi-Class Stacking Framework

3.4.1. Base Learner A: FT-Transformer for Tabular Learning

3.4.2. Base Learner B: LightGBM for Efficient Multiclass Learning

3.4.3. Meta-Learner: Logistic Regression for Calibrated Fusion

3.5. Class-Specific SHAP Interpretability

4. Results Analysis

4.1. Comparative Performance of TL-StackLR for Multi-Class Loan Default Prediction

4.2. Synergy of TL-StackLR’s Hybrid Architecture

4.3. Interpreting Model Decisions: Class-Specific Risk Drivers via SHAP

5. Conclusions

Limitations and Future Research Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI