An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms

Zheng, Chenlu; Lio, Yuhlong; Tsai, Tzong-Ru

doi:10.3390/math14050800

Open AccessArticle

An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms

by

Chenlu Zheng

¹,

Yuhlong Lio

²

and

Tzong-Ru Tsai

^3,*

¹

Public Administration Department, Fujian Police College, Fuzhou 350007, China

²

Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA

³

Department of Statistics and Data Science, Tamkang University, New Taipei City 251301, Taiwan

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(5), 800; https://doi.org/10.3390/math14050800

Submission received: 27 January 2026 / Revised: 23 February 2026 / Accepted: 24 February 2026 / Published: 27 February 2026

(This article belongs to the Special Issue Advanced Research in Mathematical Economics and Financial Modelling, 2nd Edition)

Download

Browse Figure

Versions Notes

Abstract

Zero-inflated and class-imbalanced data present significant challenges in credit scoring. Zero-Inflated Bernoulli Distribution (ZIBD) models help handle excess zeros. However, the S-shaped function and the neglect of misclassification costs may degrade the ZIBD model’s classification performance. To address these challenges, this paper proposes a novel two-stage algorithm that integrates an optimized ZIBD model with Random Forest, Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), respectively. Specifically, we develop a new loss function that incorporates cross-entropy and example-dependent cost-sensitive to optimize the ZIBD model, thereby minimizing cost risks. Subsequently, we suggest integrating baseline models to compensate for the ZIBD model’s classification deficiencies. This hybrid approach effectively mitigates the impact of structural zeros in imbalanced data while enhancing model robustness. The performance of the proposed method is validated using two real-world banking datasets. Experimental results demonstrate that the proposed two-stage algorithm significantly outperforms its competitors across both machine-learning metrics and savings. Hence, the proposed novel two-stage algorithm offers a more effective solution for zero-inflated banking data.

Keywords:

differential evolution; cost-sensitive learning; particle swarm optimization; zero-inflated model; credit scoring

MSC:

62-08; 62-11

1. Introduction

In the modern financial landscape, accurately predicting credit default is crucial for effective risk management and institutional profitability. As market complexity increases, credit default prediction is a high-stakes decision-supporting task. Misclassifying a default case as non-default can lead to significant financial losses and systemic risks [1,2]. A persistent challenge in this domain is the prevalence of imbalanced class distributions, characterized by zero inflation, in which the majority of observations are non-defaulters. These datasets typically contain a mixture of structural zeros that standard classifiers struggle to distinguish from stochastic outcomes governed by a Bernoulli process [3]. While standard models often fail to disentangle these two distinct sources of non-default, zero-inflated models have demonstrated superior performance by explicitly accounting for this dual-source mechanism.

Lambert [4] is the pioneer to propose the zero-inflated Poisson (ZIP) model and discussed its applications. Hall [5] studied the ZIP and zero-inflated binomial regression models with random effects. Rodrigues [6] applied the Bayesian method to estimate the parameters of zero-inflated distributions. Ghosh et al. [7] and Gelfand et al. [8] studied Bayesian inference procedures to obtain the estimates of the zero-inflated regression model parameters. Diop et al. [9] proposed a simulation-based procedure for the inference on a zero-inflated Bernoulli regression (ZIBR) model. Staub and Winkelmann [10] studied the estimation consistency of zero-inflated count models. Lee et al. [11] discussed the validation of the maximum likelihood estimation method for a ZIBR model when missing covariates are found. Li and Lu [12] studied the inference procedure of the semiparametric ZIBR model. Chiang et al. [13] proposed an expectation-maximization algorithm to obtain the maximum likelihood estimates of the ZIBD model parameters. Lu et al. [14] investigated the penalized estimation for ZIBR models. Pho [15] proposed a goodness-of-fit test method to infer the ZIBR model. Xin et al. [16] used a regularization rule in the loss function to estimate ZIBD model parameters. Pho [17] proposed a zero-inflated Probit Bernoulli model and discussed its applications. Su et al. [18] proposed machine learning methods of using gradient descent methods with a non-constant learning rate to estimate the ZIBD model parameters.

The ZIBD model provides a robust alternative to many machine learning (ML) methods, enabling the expression of causal relationships through analytically defined functional forms. This interpretability is crucial for the real-world banking business, where understanding the link between features and response variables is mandatory. However, in high-dimensional feature spaces, maintaining a high classification rate while identifying significant features becomes increasingly complex [16,18]. Despite its strengths, applying ZIBD in modern credit scoring faces several challenges.

A critical discrepancy arises from the misalignment between statistical loss functions and actual financial stakes. Traditional ZIBD models typically use a standard cross-entropy loss function, which treats all misclassifications equally. In credit scoring, the cost of a False Negative varies significantly by loan line and borrower profile. A cost-sensitivity consideration necessitates a mathematically optimized approach that incorporates example-dependent cost-sensitive (EDCS) factors to align model objectives with real-world economic risks [19].

Even with an appropriate loss function, the mathematical reliability depends on the robustness of its optimization trajectory. Parameter estimation in ZIBD models often relies on gradient-based methods, such as the Newton–Raphson algorithm, which is notoriously sensitive to initial values. In high-dimensional and imbalanced credit data, a poorly chosen starting point can lead to a local optimum rather than a global minimum, particularly when used as an initial solution. This sensitivity requires an algorithmic intervention, specifically the integration of meta-heuristic search strategies to stabilize the parameter space.

The challenges of applying a ZIBD model with the existing parameter estimation methods are as follows:

1.: A standalone ZIBD model remains constrained by its structural simplicity. Although it excels at identifying structural zeros, its binary predictions are constrained by a relatively S-shaped logistic function. This simplicity may be unable to capture the intricate, non-linear interactions between financial variables.
2.: If cost is a critical concern, the existing estimation method for the ZIBD model does not include cost factors in the maximum likelihood or loss functions. The resulting model may be unrealistic for handling a dataset with cost factors.
3.: Lack of discussions on a two-stage estimation that integrates ensemble machine-learning methods and uses heuristic algorithms for gradient estimation for the ZIBD model.

Advanced ensemble models, such as Random Forest, XGBoost, and LightGBM, are designed to address the challenges mentioned above. Consequently, there is a compelling rationale for a two-stage framework that leverages the “divide and conquer” principle: using ZIBD as a structural filter to handle the distributional anomaly of zero inflation, while delegating the complex non-linear mapping to high-performance ensemble algorithms.

By addressing these intertwined challenges, this study proposes a comprehensive algorithmic framework that transforms the ZIBD from a theoretical statistical model into a powerful, cost-aware decision tool for financial risk management. The contributions of this study are as follows:

1.: A new loss function and an optimization process are proposed to construct the ZIBD model. To the best of our knowledge, no studies have considered using EDCS factors in the loss function to build a ZIBD model. In this study, the new loss function integrates the EDCS factors and the cross-entropy function.
2.: Because the quality of the gradient method is sensitive to initial parameter solutions, it is challenging to obtain better initial parameter solutions in real-world applications. The initial parameter solutions of the Newton–Raphson method are provided from solutions generated by the particle swarm optimization (PSO) and differential evolution (DE) algorithms. Then, the best model is produced by competing with any two initial solutions to minimize the target loss function.
3.: The ZIBD model accounts for structural zeros but uses an S-shaped function for binary prediction. The S-shape function is simple to use, but it may struggle to achieve a high classification rate in some cases. The optimized ZIBD model is used to remove structural zero cases that are correctly classified. The cleaned dataset, after removing structural zeros, is used to train the RF, XGBoost, and LightGBM models, thereby improving their classification performance. In this study, the proposed two-stage models are referred to as RF-ZIBD, XGB-ZIBD, and LG-ZIBD, respectively.

Because the final predicted values from the proposed two-stage strategy are not weighted with stacking. Hence, the proposed two-stage method differs from SuperLearner. The implementation of the proposed two-stage strategy will be discussed in Section 2.

The proposed two-stage strategy employs a metaheuristic-inspired approach that combines DE and PSO with the Newton–Raphson method to navigate the solution space. Typically, metaheuristic learning combines optimization heuristic algorithms with machine learning methods to navigate complex search spaces. Kaveh and Mesgari [20] reviewed the optimization of Artificial Neural Networks (ANNs) and Deep Learning methods using metaheuristic algorithms. Rahman [21] proposed a new group metaheuristic algorithm inspired by interactions among individuals and the group leader’s influence on members. Inspired by the intelligence and life of pumas, Abdollahzadeh et al. [22] proposed a novel metaheuristic algorithm, named Puma Optimize, and showed its good performance against all kinds of optimization problems. Jia and Lu [23] improved three metaheuristic algorithms, including Stochastic Fractal Search and the Marine Predators Algorithm. The model’s performance was then evaluated on 57 constrained engineering problems. Dasi et al. [24] studied the performance of the support vector regression and XGBoost algorithms that are enhanced by Satin Bowerbird Optimizer, Ant Lion Optimizer, Artificial Ecosystem-iniBased Optimizer, Slime Mold Optimizer, Moth-Flame Optimizer, and PSO. Helforoush and Sayyad [25] studied the prediction performance of machine-learning techniques to improve obesity risk assessment. They recommended that the proposed ANN-PSO model has good performance to achieve the goal. Kowalski et al. [26] investigated the ability of metaheuristic algorithms in enhancing the training performance of Probabilistic Neural Networks. Ahmed et al. [27] investigated the performance of 10 machine learning methods on a dataset on dye contamination in water sources. In their study, the DE, PSO, and GA are used to optimize tree-based models for predicting contaminant levels. Donmez and Gucluer [28] provided a comprehensive discussion of the performance of LightGBM, Categorical Boosting, RF, and Extremely Randomized Trees. PSO and Dwarf Mongoose Optimization methods were used to tune the hyperparameters of these models. Tran [29] proposed a metaheuristic model that integrates an artificial neural network with Teaching–Learning-Based Optimization to improve the performance of breaking wave height prediction.

The remainder of this paper is organized as follows. Section 2 addresses the zero-inflated model and presents the new loss functions for the ZIBD modeling. In addition, algorithms are proposed to minimize the new loss function using DE and PSO. In Section 4 and Section 5, the performance of the proposed RF-ZIBD, XGB-ZIBD, and LG-ZIBD methods is illustrated using two real banking datasets. Moreover, the performance of the proposed two-stage methods is compared with that of the baseline methods RF, XGBoost, and LightGBM. The model transparency, governance, and validation are discussed in Section 6. Finally, the strengths and limitations of the proposed estimation strategy are discussed, and future studies on this topic are presented in Section 7.

2. Methods

The ZIBD model will be addressed in this Section 2.1. Moreover, we will analytically present the process of optimizing the loss function of the ZIBD model with a cost consideration.

2.1. The ZIBD Model

Let

p = P (Y = 1)

, where

Y = 0

or 1 is a binary response variable. Typically, Y follows the Bernoulli distribution, denoted by

Y \sim B (1, p)

. However, many instances of sample collections appear to have a preponderance of structural zeros in the majority group because subjects are not at risk for the specific event of interest. Structural zeros make the probability model of Y different from

B (1, p)

, and the collected data set is imbalanced with excess zeros.

Let W be a latent binary variable that controls whether

Y = 0

is structural or not. Specifically,

W = 1

indicates that

Y = 0

is a structural zero, and

W = 0

means that

B (1, p)

determines the outcome of Y. In other words, given

W = 0

implies that

Y \sim B (1, p)

with

p = P (Y = 1 | W = 0)

. Let

y_{i}

be a binary response from n individuals randomly sampled from the distribution, and let

π_{i} = P (Y_{i} = 1) = 1 - P (Y_{i} = 0)

denote the probability that the ith individual is at risk of the specific event for

i = 1, 2, \dots, n

. It is trivial to show that

π_{i} = P (Y_{i} = 0) = δ_{i} + (1 - δ_{i}) (1 - p_{i}) = 1 - (1 - δ_{i}) p_{i},

where

δ_{i} = P (W_{i} = 1)

and

p_{i} = P (Y_{i} = 1 | W_{i} = 0)

,

i = 1, 2, \dots, n

. Assume that two sets of features

x_{i} = {(x_{i, 1}, x_{i, 2}, \dots, x_{i, m_{1}})}^{T}

and

z_{i} = {(z_{i, 1}, z_{i, 2}, \dots, z_{i, m_{2}})}^{T}

are available and can be linked to

p_{i}

and

δ_{i}

, respectively, by

p_{i} = \frac{1}{1 + exp (- x_{i}^{T} β)}

and

δ_{i} = \frac{1}{1 + exp (- z_{i}^{T} γ)}, i = 1, 2, \dots, n,

where

β = {(β_{1}, β_{2}, \dots, β_{m_{1}})}^{T}

and

γ = {(γ_{1}, γ_{2}, \dots, γ_{m_{1} + m_{2}})}^{T}

. Let

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}

denote the set of responses in the ZIBD model with the features

x_{i}

and

z_{i}

,

i = 1, 2, \dots, n

. Denote the complete data set by

d = {y, x_{1}, \dots, x_{n}, z_{1}, \dots, z_{n}}

. The typical cross-entropy loss function can be presented by

L_{1} (Θ | d) = - \sum_{i = 1}^{n} [y_{i} ln π_{i} (Θ) + (1 - y_{i}) ln (1 - π_{i} (Θ))] .

(1)

where

Θ = {(θ_{1}, θ_{2}, \dots, θ_{m_{1} + m_{2}})}^{T} = (β_{1}, β_{2}, \dots, β_{m_{1}}, γ_{1}, γ_{2}, \dots, γ_{m_{2}})

.

In banking, credit scoring is a powerful tool for risk assessment and a crucial decision-making factor for financial institutions. Many existing studies have adopted cost-sensitive models for various binary classification methods, including logistic regression, survival analysis, and ML models, to implement credit risk assessments. Let

c^{F P}

denote the cost of misclassifying a non-default customer as default, and

c^{F N}

denote the cost of misclassifying a default customer as non-default. Using cost-sensitive factors in the loss function

L_{1} (Θ | d)

results in

L_{2} (Θ | d) = \sum_{i = 1}^{n} [c_{i}^{F N} y_{i} ln π_{i} + c_{i}^{F P} (1 - y_{i}) ln (1 - π_{i})] .

(2)

It can be shown that

L_{2} (Θ | d)

reduces to

L_{1} (Θ | d)

as

c_{i}^{F P} = c_{i}^{F N} = 1

for

i = 1, 2, \dots, n

. Hence,

L_{2} (Θ | d)

can be a generalized case of

L_{1} (Θ | d)

.

The gradient method can be used to minimize the target loss function. However, the quality of the gradient method is quite sensitive to the initial parameter values. For big data with a large number of features, the Newton–Raphson method may not yield reliable parameter estimates for the ZIBD model, thereby affecting its classification performance. Hence, heuristic algorithms, including PSO (see Algorithm 1) and DE (see Algorithm 2), are suggested to minimize

L_{2} (Θ | d)

. Then, the two initial solution sets are used to implement the Newton–Raphson method to search for the optimal ZIBD model.

Let

Θ_{i} = (θ_{i, 1}, θ_{i, 2}, \dots, θ_{i, m_{1} + m_{2}})

denote the ith particle for

i = 1, 2, \dots, N_{p}

. Let

θ_{j, L}

and

θ_{j, U}

be the minimum and maximum of

θ_{j}

for

j = 1, 2, \dots, m_{1} + m_{2}

. The PSO can be implemented using the following steps:

Algorithm 1 The PSO algorithm

P-I. Initialization: Let $t = 0$ .

-: Generating $Θ^{(0)} = {Θ_{1}^{(0)}, Θ_{2}^{(0)}, \dots, Θ_{N_{p}}^{(0)}}$ : For $1 \leq i \leq N_{p}$ , $θ_{i, j}^{(0)} \sim U (θ_{j, L}, θ_{j, U})$ , $j = 1, 2, \dots, m_{1} + m_{2}$ .
-: Select w, $c_{1}$ , $c_{2}$ , and $t_{U}$ ; For $i = 1, 2, \dots, N_{P}$ ,

$\begin{matrix} p B_{i} & \leftarrow Θ_{i}^{(0)}, \\ g B & \leftarrow Θ_{k}^{(0)} if L (Θ_{k}^{(0)} | d) = min_{i = 1, 2, \dots, N_{p}} L (Θ_{i}^{(0)} | d) . \end{matrix}$

P-II. Update parameters: For iteration $t = 0, 1, 2, \dots, t_{U}^{P S O}$ , where $t_{U}^{P S O}$ is the maximal iterations for PSO. For $i = 1, 2, \dots, N_{p}$ ,

$\begin{matrix} ν_{i}^{(t + 1)} & \leftarrow w \times ν_{i}^{(t)} + c_{1} \times r_{1} \times (p B_{i} - Θ_{i}^{(t)}) + c_{2} \times r_{2} \times (g B - Θ_{i}^{(t)}), \\ Θ_{i}^{(t + 1)} & \leftarrow Θ_{i}^{(t)} + ν_{i}^{(t + 1)}, \\ p B_{i} & \leftarrow Θ_{i}^{(t + 1)} if L (Θ_{i}^{(t + 1)} | d) \leq L (p B_{i} | d), \\ g B & \leftarrow Θ_{i}^{(t + 1)} if L (Θ_{i}^{(t + 1)} | d) \leq L (g B | d), \end{matrix}$

where $ν_{i}$ denotes the velocity of a particle, $Θ_{i}$ ; inertia weight w and acceleration coefficients ( $c_{1}$ and $c_{2}$ ) are used to control the impact of previous velocity and balance the personal best ( $p B$ ) and global best ( $g B$ ).
P-III. Termination: Implement Step P-II until a convergence condition is achieved or $t = t_{U}^{P S O}$ .

Let F denote the differential weight and CR be a crossover probability. The DE algorithm can be implemented using the following steps:

Algorithm 2 The DE algorithm

D-I. Initialization: Let $t = 0$ .

-: Let $Θ^{(0)} = {Θ_{1}^{(0)}, Θ_{2}^{(0)}, \dots, Θ_{N_{p}}^{(0)}}$ : For $i = 1, 2, \dots, N_{p}$ , $j = 1, 2, \dots, m_{1} + m_{2}$ , generate $rand \sim U (0, 1)$ ,

$\begin{matrix} θ_{i, j} \leftarrow θ_{j, L} + rand \times (θ_{j, U} - θ_{j, L}), \end{matrix}$
-: Select parameters F and CR from the intervals of $[0, 2]$ and $[0, 1]$ , respectively.

D-II. Update Parameters: For $t = 0, 1, 2, \dots, t_{U}^{D E}$ , where $t_{U}^{D E}$ is the maximum number of iterations.

-: Mutation: For $Θ_{k}^{(t)}$ , randomly select $Θ_{i}^{(t)}$ , $Θ_{j}^{(t)}$ , and $Θ_{h}^{(t)}$ , $i \neq j \neq h$ from $Θ_{(- k)}$ , which is the set of $Θ$ but $Θ_{k}^{(t)}$ is removed from $Θ$ .

$\begin{matrix} τ_{k}^{(t + 1)} \leftarrow Θ_{i}^{(t)} + F \times (Θ_{j}^{(t)} - Θ_{h}^{(t)}), k = 1, 2, \dots, N_{p} . \end{matrix}$
-: Crossover: Randomly select R from ${1, 2, \dots, N_{p}}$ . Generate $rand \sim U (0, 1)$ .

$\begin{matrix} u_{k}^{(t + 1)} \leftarrow τ_{k}^{(t + 1)} if rand < CR or k = R; \end{matrix}$

otherwise $u_{k}^{(t + 1)} \leftarrow Θ_{k}^{(t)}$ .
-: Selection:

$\begin{matrix} Θ_{k}^{(t + 1)} \leftarrow u_{k}^{(t + 1)} if L (u_{k}^{(t + 1)} | d) \leq L (Θ_{k}^{(t)} | d); \end{matrix}$

otherwise $Θ_{k}^{(t + 1)} \leftarrow Θ_{k}^{(t)}$ .

D-III. Termination: Implement Step D-II until the convergence condition is reached or $t = t_{U}^{D E}$ .

2.2. The RF Model

RF is an ensemble learning method used to build a number of decision trees. Then, combining the obtained results to enhance the predictive performance and robustness. The core idea of RF is to use the bagging method of bootstrap aggregating to replace the traditional decision tree method, which relies on a single tree. Each tree in the RF strategy is trained by a different bootstrap sample from the original dataset. Only a subset of features will be randomly selected at each split in RF. Randomness can efficiently reduce correlation among trees, enhance the ensemble’s nonlinearity, capture patterns, and reduce instability when using an individual decision tree.

For categorical response variables, the aggregation of predictions can be majority voting for classification to mitigate overfitting. The RF can handle noisy observations with minimal preprocessing and efficiently process high-dimensional data. RF has been widely used in areas such as finance, healthcare, marketing, and environmental science for prediction and key-feature identification. Recent applications of RF include Salman et al. [30], Iranzad and Liu [31], and Mallala et al. [32].

2.3. The XGBoost Model

Unlike the RF method, XGBoost is an ensemble learning method based on a gradient boosting framework. XGBoost emphasizes efficiency, scalability, and high predictive accuracy. In XGBoost, the method sequentially develops decision trees. Each new tree in XGBoost can correct errors introduced by previous trees. The mechanism of XGBoost is to optimize an objective function with penalty terms in the loss function to enable the model to learn complex nonlinear patterns. Hence, the XGBoost can maintain a better generalization performance.

Because XGBoost considers practical engineering features, including efficient handling of missing values, overfitting penalty, shrinkage at learning rate, row and column subsampling, XGBoost is also effective at handling large-scale and high-dimensional datasets. Moreover, the XGBoost method performs well in terms of computational speed. XGBoost is a popular ensemble learning method widely used in recommendation systems, fraud and credit risk identification, and data science competitions. Recent applications of XGBoost include Niazkar et al. [33], Wiens et al. [34], and Imani et al. [35].

2.4. The LightGBM Model

Similar to XGBoost, LightGBM also uses a gradient-boosting framework. LightGBM was designed to maintain high efficiency and scalability in real-world applications, especially for processing large and high-dimensional datasets. LightGBM sequentially builds decision trees. Each new tree is trained to reduce the errors from the previous ensemble. However, LightGBM uses an efficient strategy, called leaf-wise (or best-first) tree growth, to replace the level-wise strategy used in traditional gradient-boosting ensemble methods. The leaf-wise strategy allows LightGBM to focus on splitting the leaf that yields the largest loss reduction, enabling it to learn more complex patterns with fewer trees.

LightGBM also incorporates several optimization techniques, including histogram-based feature discretization and efficient handling of categorical features. Another strength of LightGBM is that it supports parallel and GPU training, thereby significantly reducing memory usage and computation time to maintain high predictive accuracy. LightGBM is widely applied in domains such as online advertising, ranking systems, finance, and large-scale business analytics. LightGBM can maintain the strengths of fast training and significant performance on processing large-scale datasets. Recent applications of LightGBM include Li et al. [36], Long et al. [37], and Lian et al. [38].

3. The Two-Stage Algorithms

The following steps can be used to implement the proposed RF-ZIBD algorithm (see Algorithm 3):

Algorithm 3 The RF-ZIBD algorithm

I.: Obtaining the Initial Solutions of $Θ$ .

-: ${\hat{Θ}}_{P} \leftarrow arg {min}_{Θ \in Ω_{Θ}} L (Θ | d)$ using Algorithm 1, ${\hat{Θ}}_{P} = ({\hat{θ}}_{P, 1}, {\hat{θ}}_{P, 2}, \dots, {\hat{θ}}_{P, m_{1} + m_{2}})$ .
-: ${\hat{Θ}}_{D} \leftarrow {arg min}_{Θ \in Ω_{Θ}} L (Θ | d)$ using Algorithm 2, ${\hat{Θ}}_{D} = ({\hat{θ}}_{D, 1}, {\hat{θ}}_{D, 2}, \dots, {\hat{θ}}_{D, m_{1} + m_{2}})$ .

II.: Obtaining the Optimal Solution of $Θ$ :

-: ${\hat{Θ}}_{P S O} \leftarrow {arg min}_{Θ \in Ω_{Θ}} L (Θ | d)$ using the Newton–Raphson method with $Θ_{P}$ as initial solution.
-: ${\hat{Θ}}_{D E} \leftarrow {arg min}_{Θ \in Ω_{Θ}} L (Θ | d)$ using the Newton–Raphson method with $Θ_{D}$ as initial solution.
-: $\hat{Θ} \leftarrow {\hat{Θ}}_{P S O}$ if $L ({\hat{Θ}}_{P S O} | d) \leq L ({\hat{Θ}}_{D E} | d)$ , otherwise $\hat{Θ} \leftarrow {\hat{Θ}}_{D E}$ . The optimal estimate of $Θ$ is $\hat{Θ}$ .

III.: Removing Structural-Zero Individuals:

-: Condition A: If ${\hat{δ}}_{i} \geq 0.80$ and $y_{i} = 0$ , identify $y_{i}$ as a structural zero.
-: Use Condition A to check all individuals and remove all individuals that are classified as structural-zero cases. Denote the reduced data set by $d^{*}$ .

IV.: Establish the RF-ZIBD Model:

-: The RF-ZIBD Model: Using an RF algorithm to establish a classification model based on the data set $d^{*}$ .
-: Using the k-fold cross-validation method to evaluate the performance of the RF-ZIBD model based on the metrics of ACC, REC, SPE, PRE, $F_{1}$ , $AUC$ , and Savings, defined by

$Savings = \frac{{Cost}_{L} - {Cost}_{f}}{{Cost}_{L}},$

where ${Cost}_{f}$ is the loss based on the RF-ZIBD model; ${Cost}_{L}$ is the minimal values of ${Cost}_{0}$ and ${Cost}_{1}$ . ${Cost}_{0}$ and ${Cost}_{1}$ refer to that the classifier predicts all the examples with $y = 0$ and $y = 1$ , respectively.

By replacing the RF in the integrated model with XGBoost and LightGBM, the RF-ZIBD model can be converted to an XGB-ZIBD and LG-ZIBD model, respectively. The metrics accuracy (ACC), recall (REC or sensitivity), specificity (SPE), precision (PRE),

F_{1}

-score (

F_{1}

), and the area under the curve (AUC) are used to evaluate the prediction performance of the competing models in this study. They are defined by

\begin{matrix} ACC & = \frac{TP + TN}{TP + TN + FP + FN}, \end{matrix}

(3)

\begin{matrix} REC & = \frac{TP}{TP + FN}, \end{matrix}

(4)

\begin{matrix} SPE & = \frac{TN}{TN + FP}, \end{matrix}

(5)

\begin{matrix} PRE & = \frac{TP}{TP + FP}, \end{matrix}

(6)

\begin{matrix} F_{1} & = \frac{2}{\frac{1}{REC} + \frac{1}{PRE}} = 2 \times \frac{REC \times PRE}{REC + PRE}, \end{matrix}

(7)

where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives in a confusion matrix, respectively. The AUC is the area under the receiver operating characteristic (ROC) curve, which is the curve of ‘1-SPE’ vs. REC. REC and SPE are used to estimate the true-positive and true-negative rates. The

F_{1}

is a harmonic mean of the REC and PRE. Only categorical response variables are investigated in this study. The metrics for continuous response variables, such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE), are not used in this study.

The existing proposed methods are difficult to balance across important performance metrics. The proposed two-stage algorithms remove structural-zero cases in the first stage using the ZIBD model, which is designed to minimize a loss function with cost considerations. Then, the cleaning data is used to establish the RF, XGBoost, and LightGBM models. The one with the best performance is selected. Hence, the performance of the proposed two-stage algorithms is better than that of a single RF, XGBoost, or LightGBM model. To illustrate the applications of the proposed two-stage algorithms, two banking examples will be used to evaluate their performance, and the results will be compared with those of the RF, XGBoost, and LightGBM models.

4. Example 1: The Micro and Small Enterprises Dataset

The first dataset, from a Chinese commercial bank, concerns loans to Micro and Small Enterprises (SMEs). The objective of this example is to build a model for successfully identifying default SMEs. The first dataset is unavailable due to confidentiality reasons.

The SME loan dataset contains fifteen features, and the response variable y indicates whether an SME is a default or non-default case. There are 8524 loan records from 2014 to 2019. To predict default status, given the highly imbalanced proportion, only 1.68% of SMEs represent default cases. Key features include Industry, Loan Line, Interest Rate, Term, and Entrepreneur characteristics such as age, education, and income. The features and response variable in this dataset are defined as follows:

Industry: the investment-targeted industry, with 1 = Primary Industry, 2 = Secondary Industry, and 3 = Tertiary Industry.
Entrusted payment: 0 = no, 1 = yes, and 2 = unknown.
Prepay: prepayment status, with 0 = no, 1 = yes, and 2 = unknown.
Interest rate: exercise interest rate.
Rate mode: interest rate setting type, with 0 = fixed, 1 = floating, 2 = periodic auto-adjustment.
Rate floating direction: 0 = upward, 1 = downward, 2 = non-floating/fixed Loan line: loan amount.
Term: loan duration in months.
Gender: entrepreneur’s gender, with 1 = male, 2 = female.
Age: entrepreneur’s age in years.
Education: entrepreneur’s education, with 1 = primary school, 2 = junior high school, 3 = technical secondary school or senior high school, 4 = junior college or bachelor’s degree, 5 = postgraduate and above, 0 = unknown.
Household income: entrepreneur’s annual household income.
Personal income: entrepreneur’s personal monthly incom.
Type of enterprises: 1= non-state-owned domestic/foreign enterprises, 2 = self-employed, 3 = unknown.
Residency: entrepreneur’s housing status, with 0 = rental housing, 1 = self-owned (with mortgage balance), 2 = self-owned (without mortgage balance).
Marriage: entrepreneur’s marital status, with 0 = unmarried/divorced; 1 = married; 2 = unknown.

Preprocessing involved removing inconsistent interest rate records. Descriptive statistics are summarized in Table 1. Key model parameters for the dataset, including average interest rate, cost of funds, loan term, and loss given default, are provided in Table 2. Popular ML methods such as RF, XGBoost, and LightGBM cannot consistently achieve satisfactory results across metrics such as ACC, AUC, PRE, SPE,

F_{1}

, and Savings. The proposed integrated method can significantly improve this drawback. The metrics are obtained from 10 repetitions of the 5-fold cross-validation method. Then, the mean and standard deviation (sd) of the ten metrics are reported in Table 3.

Table 3 summarizes the performance of the proposed ZIBD-integrated methods (RF-ZIBD, XGB-ZIBD, and LG-ZIBD) compared to the baseline ML classifiers RF, XGBoost, and LightGBM. The empirical results demonstrate that incorporating the ZIBD module significantly enhances the predictive performance of all three base classifiers. As shown in Table 3, the proposed ZIBD-integrated methods consistently outperform the standard algorithms across all critical metrics. This improvement suggests that the ZIBD framework effectively captures the characteristics of structural zeros and addresses the data imbalance inherent in credit risk assessment.

In terms of discriminatory power, the proposed methods achieved superior AUC scores. The LG-ZIBD model yielded the highest AUC of 0.9125, followed by RF-ZIBD (0.9053) and XGB-ZIBD (0.9026). In contrast, the second panel of Table 3 indicates that the baseline ML models ranged from 0.8385 to 0.8418. Furthermore, REC (or Sensitivity) is a critical metric for identifying actual default cases. RF-ZIBD achieved the highest REC at 0.7855, representing a substantial improvement over the standard RF model (0.6984). This indicates that the proposed approach is significantly more effective in reducing Type II errors for missed default cases, which is crucial for effective risk management.

The F1-score, which balances precision and recall, further validates the robustness of the proposed methods. The baseline ML models achieved F1 scores ranging from approximately 0.38 to 0.45. The ZIBD-enhanced models, by contrast, demonstrated a significant leap in performance: RF-ZIBD reached an F1-score of 0.6360, followed by LG-ZIBD (0.6227) and XGB-ZIBD (0.5395). These results indicate that the proposed framework successfully mitigates the trade-off observed in imbalanced classification tasks.

From a practical banking perspective, the ‘Savings’ metric is the most indicative of business value. The baseline ML models yielded relatively low savings rates, ranging from 0.2364 (RF) to 0.4222 (LightGBM). The values of the corresponding metrics reported by the proposed methods doubled compared to these figures. XGB-ZIBD achieved the highest savings rate of 0.8436, followed closely by RF-ZIBD at 0.8352 and LG-ZIBD at 0.8245. These dramatic increases suggest that the proposed ZIBD-integrated strategies can significantly reduce financial losses associated with unrecognized defaults and optimize risk-mitigation costs. The figures in the lines of Table 3 with “(sd)” are standard deviations. Regarding the standard deviation, the smaller, the better. In viewing Table 3, one can find that the proposed two-stage methods perform better in terms of mean, while they are competitive in terms of sd compared with the existing ones placed in the second-row block

5. Example 2: Give Me Some Credit Dataset

The Give Me Some Credit Dataset, available from Kaggle at https://www.kaggle.com/c/GiveMeSomeCredit/ (accessed on 26 January 2026), is used for Example 2. This dataset has been discussed in many studies, such as Alonso Robisco and Carbó Martínez [39], Imteaj and Amini [40], Yan [41], Bakare and Odunaike [42], and Chia [43]. The response variable and features are given as follows:

y: The default customer is denoted by $y = 1$ , and non-default customers are denoted by $y = 0$ . In this example, $y = 1$ indicates that a borrower has experienced an illegal past-due 90 days or more. This is the response variable.
RevolvingUtilizationOfUnsecuredLines ( $X_{1}$ ): the total balance of the personal credit line and cards, except for real estate and no installment debt, divided by the credit quota.
Age ( $X_{2}$ ): the borrower’s age (years).
PastDue30–59 ( $X_{3}$ ): the number of times the borrower has had past due 30-59 days but no worse in the last 2 years.
DebtRatio ( $X_{4}$ ): monthly debt payments, alimony, and living costs divided by monthly gross income.
MonthlyIncome ( $X_{5}$ ): the monthly income.
OpenCredit ( $X_{6}$ ): the number of open loans and credit lines.
PastDue90+ ( $X_{7}$ ): the number of times a borrower has been past due 90 days or more.
RealEstateLoans ( $X_{8}$ ): the number of mortgage and real estate loans.
PastDue60–89 ( $X_{9}$ ): the number of times a borrower has been past due 60-89 days but no worse in the last 2 years.
Dependents ( $X_{10}$ ): the number of dependents in the family.

This dataset comprises 112,915 observations with 10 features, with a default proportion of 6.74%. For the Kaggle Credit dataset, we use the same parameters as in previous work in 2015. The model parameters of this data set are summarized in Table 4. Moreover, the descriptive statistics of the Kaggle default data set are summarized in Table 5. The histograms of features are displayed in Figure 1. We can see that the loss given default (

L g d

) in the Kaggle default dataset (75%) is significantly higher than that in the SMEs dataset (48.58%).

Table 6 summarizes the results for the Kaggle default dataset. The R codes to implement the RF and RF-ZIBD methods for Example 2 are presented in the Appendix A. The R code to implement the XGBoost, LightGBM, XGB-ZIBD, and LG-ZIBD methods for Section 5 is similar to that in the Appendix A. We do not display them to save pages. Unlike the dramatic improvements observed in Example 1, the proposed ZIBD-integrated methods yield results comparable to the baseline ML classifiers. This behavior is expected and validates the adaptive nature of our hybrid approach. After analyzing the data using the ZIBD model, we found that the Kaggle default dataset contains fewer structural-zero cases. The ZIBD component effectively converges with the baseline behavior, avoiding performance degradation often seen in specialized models when applied to general data. Crucially, even with less pronounced structural zero-inflation, our method effectively captures subtle risk patterns. The

F_{1}

scores and Savings metrics for RF-ZIBD, XGB-ZIBD, and LG-ZIBD all show positive improvements over their base models. This suggests that the incorporated cost-sensitive loss function continues to optimize for misclassification costs, making the hybrid algorithm a safe and effective solution across varying degrees of data inflation. The figures in the lines of Table 6 with “(sd)” are standard deviations. Regarding the standard deviation, the smaller, the better. Table 6 displays that the proposed two-stage methods and the existing ones, shown in the second row block, are competitive in terms of mean and sd metrics.

6. Discussions

Traditional ML models often struggle to precisely identify positive cases in imbalanced datasets, especially when many observations are structural zeros. In this paper, we incorporate EDCS factors into the ZIBD model’s loss function and propose a ZIBD-integrated process to improve prediction performance for imbalanced data with a large proportion of structural zeros. To implement the proposed ZIBD-integrated process, three two-stage algorithms are provided. Moreover, the PSO and DE heuristics are employed to provide stable initial solutions for the Newton–Raphson method, thereby improving estimation accuracy.

To maintain stability of the initial parameters for implementing the Newton–Raphson algorithm to minimize the target loss function, the PSO and DE algorithms are first implemented to generate initial model parameters. Then, the best initial model parameters are obtained through competition. After establishing the ZIBD model, the next step is to remove data with a structural zero response. The method predicts the probability of a response with a structural zero and removes data with probabilities below a conservative threshold, such as 0.80 or higher. To reduce the risk of incorrect removal, we set a high-probability threshold.

The data after removing the structural-zero case is used to train the RF, XGBoost, and LightGBM models to improve prediction accuracy. Because the proposed strategy involves two stages. Moreover, we can repeat the K-fold cross-validation procedures as in Section 4 and Section 5 for model validation. The proposed methods definitely are more time-consuming than typical RF, XGBoost, and LightGBM models. This is a trade-off between computational time and prediction accuracy.

The convergence of the proposed two-stage algorithms can be addressed in two stages. At the first stage of candidate solution search, the Newton–Raphson method, PSO, and DE are integrated. When searching for the initial solution using the PSO algorithm, the P-II of Algorithm 1 determines convergence; furthermore, the D-II of Algorithm 2 determines convergence when using the DE to search for the initial solution. In a typical Newton–Raphson method, the search process converges when the loss function cannot be improved. Finally, the best initial solutions that reach the maximal Savings is found. At the second stage, the RF, XGBoost, and LightGBM methods are used. Hence, the convergence of each method follows its traditional design.

The proposed two-stage algorithms are more computationally complex than implementing individual RF, XGBoost, and LightGBM methods. The implementation times of Examples 1 and 2 indicate different patterns. The dataset for Example 1 is not large, and many structural-zero cases are identified. Hence, the secondary dataset, after removing structural-zero cases for the second stage of the proposed method, accelerates the implementation of the RF, XGBoost, and LightGBM methods, with Savings higher than those achieved with the three machine learning methods using the original dataset. The implementation times for the proposed two-stage method are almost equivalent to those of using the RF, XGBoost, and LightGBM methods. The ratio of the time to implement the proposed two-stage methods to that of the RF, XGBoost, and LightGBM methods based on the original dataset is 0.986 (1.968 h over 1.995 h). However, we find different results for Example 2. The dataset of Example 2 is large, and few structural-zero cases are found. The time to implement the RF, XGBoost, and LightGBM methods based on the original dataset is significantly shorter than that to implement the proposed two-stage method. It is obvious that the proposed two-stage method spends more time finding the best initial solutions at the first stage, based on a large dataset, and uses a dataset of almost the same size for the second stage. The ratio of the time to implement the proposed two-stage methods to that of the RF, XGBoost, and LightGBM methods based on the original dataset is 1.428 (10.041 h over 7.032 h). Based on the findings in Figure 1 and Section 4, the proposed two-stage method improves savings for the RF, XGBoost, and LightGBM methods when they are applied to the original datasets.

7. Concluding Remarks

Integrating a cost-sensitive loss function aligns model optimization with real-world financial risks, where default misclassification is far more costly than false positives. The fusion of ZIBD and baseline ML models leverages ZIBD’s ability to model zero-inflated distributions and ML models’ strength in characterizing nonlinear relationships. The purpose is to yield a more comprehensive and robust classifier—consistent with the trend toward hybrid algorithms for addressing complex credit risk challenges, which aligns with the expectation of Zheng et al. [44].

Practically, the proposed ZIBD-integrated models, RF-ZIBD, XGB-ZIBD, and LG-ZIBD, offer actionable value for financial institutions. Its high Recall ensures that most default risks are identified, while its high Precision avoids excessive risk aversion that could restrict credit access for viable loan applicants. The considerable Savings (in cost) are helpful in promoting the profitability of banks and the efficiency of the risk management, addressing the critical need for accurate and cost-effective credit scoring.

Despite its strengths, this study has gaps that future studies can address. First, the computational complexity of the proposed two-stage algorithm may be higher than that of competitors, potentially requiring optimization for large-scale datasets. Second, the current ZIBD-integrated framework does not integrate deep learning models to capture complex, latent patterns (e.g., high-dimensional sequential financial records and unstructured text from loan applications), which may limit its predictive performance. To address these gaps, future work can focus on optimizing the algorithm’s computational efficiency through feature selection or parallel computing. Additionally, exploring fusion with deep learning models to capture more complex data patterns and extending the cost-sensitive framework to dynamic risk assessment (e.g., real-time loan monitoring) could further expand the method’s practical utility.

Author Contributions

C.Z. and T.-R.T. led the conceptualization, investigation, and drafting of the manuscript; Y.L. handled the writing—review and editing; C.Z. collected the data; C.Z. and T.-R.T. carried out data analysis; C.Z. secured funding. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the financial support provided by the National Office for Philosophy and Social Sciences of China under Grant 23BTJ044 for Zheng C.

Data Availability Statement

The data presented in this study are openly available in Kaggle at https://www.kaggle.com/c/GiveMeSomeCredit/ (accessed on 26 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. The R Codes to Implement the RF and RF-ZIBD Methods

References

Chi, G.; Dong, B.; Zhou, Y.; Jin, P. Long-horizon predictions of credit default with inconsistent customers. Technol. Forecast. Soc. Chang. 2024, 198, 123008. [Google Scholar] [CrossRef]
Sun, W.; Zhang, X.; Li, M.; Wang, Y. Interpretable high-stakes decision support system for credit default forecasting. Technol. Forecast. Soc. Chang. 2023, 196, 122825. [Google Scholar] [CrossRef]
Han, S.; Jung, H.; Yoo, P.D.; Provetti, A.; Cali, A. NOTE: Non-parametric oversampling technique for explainable credit scoring. Sci. Rep. 2024, 14, 26070. [Google Scholar] [CrossRef]
Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Hall, D.B. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics 2000, 56, 1030–1039. [Google Scholar] [CrossRef]
Rodrigues, J. Bayesian analysis of zero-inflated distributions. Commun. Stat.–Theory Methods 2003, 32, 281–289. [Google Scholar] [CrossRef]
Ghosh, S.K.; Mukhopadhyay, P.; Lu, J.C. Bayesian analysis of zero-inflated regression models. J. Stat. Plan. Inference 2006, 136, 1360–1375. [Google Scholar] [CrossRef]
Gelfand, A.E.; Citron-Pousty, S. Zero-inflated models with application to spatial count data. Environ. Ecol. Stat. 2022, 9, 341–355. [Google Scholar] [CrossRef]
Diop, A.; Diop, A.; Dupuy, J.-F. Simulation-based inference in a zero-inflated Bernoulli regression model. Commun. Stat.–Simul. Comput. 2016, 45, 3597–3614. [Google Scholar] [CrossRef]
Staub, K.E.; Winkelmann, R. Consistent estimation of zero-inflated count models. Health Econ. 2013, 22, 673–686. [Google Scholar] [CrossRef]
Lee, S.M.; Pho, K.H.; Li, C.S. Validation likelihood estimation method for a zero-inflated Bernoulli regression model with missing covariates. J. Stat. Plan. Inference 2021, 214, 105–127. [Google Scholar] [CrossRef]
Li, C.S.; Lu, M. Semiparametric zero-inflated Bernoulli regression with applications. J. Appl. Stat. 2022, 49, 2845–2869. [Google Scholar] [CrossRef] [PubMed]
Chiang, J.-Y.; Lio, Y.L.; Hsu, C.-Y.; Ho, C.-L.; Tsai, T.-R. Binary classification with imbalanced data. Entropy 2024, 26, 15. [Google Scholar] [CrossRef] [PubMed]
Lu, M.; Li, C.S.; Wagner, K.D. Penalised estimation of partially linear additive zero-inflated Bernoulli regression models. J. Nonparametr. Stat. 2024, 36, 863–890. [Google Scholar] [CrossRef]
Pho, K.H. Goodness of fit test for a zero-inflated Bernoulli regression model. Commun.-Stat.-Simul. Comput. 2024, 53, 756–771. [Google Scholar] [CrossRef]
Xin, H.; Lio, Y.L.; Chen, H.-C.; Tsai, T.-R. Zero-inflated Binary Classification Model with Elastic Net Regularization. Mathematics 2024, 12, 2990. [Google Scholar] [CrossRef]
Pho, K.H. Zero-inflated probit Bernoulli model: A new model for binary data. Commun. Stat.–Simul. Comput. 2025, 54, 2324–2344. [Google Scholar] [CrossRef]
Su, C.-J.; Chen, I.-F.; Tsai, T.-R.; Lio, Y.L. A hybrid algorithm with a data augmentation method to enhance the performance of the zero-inflated Bernoulli model. Mathematics 2025, 13, 1702. [Google Scholar] [CrossRef]
Xiao, J.; Li, S.; Tian, Y.; Huang, J.; Jiang, X.; Wang, S. Example dependent cost sensitive learning based selective deep ensemble model for customer credit scoring. Sci. Rep. 2025, 15, 6000. [Google Scholar] [CrossRef]
Kaveh, M.; Mesgari, M.S. Application of meta-heuristic algorithms for training neural networks and deep learning architectures: A comprehensive review. Neural Process. Lett. 2023, 55, 4519–4622. [Google Scholar] [CrossRef]
Rahman, C.M. Group learning algorithm: A new metaheuristic algorithm. Neural Comput. Appl. 2023, 35, 14013–14028. [Google Scholar] [CrossRef]
Abdollahzadeh, B.; Khodadadi, N.; Barshandeh, S.; Trojovský, P.; Gharehchopogh, F.S.; El-kenawy, E.S.M.; Abualigha, L.; Mirjalili, S. Puma optimizer (PO): A novel metaheuristic optimization algorithm and its application in machine learning. Clust. Comput. 2024, 27, 5235–5283. [Google Scholar] [CrossRef]
Jia, H.; Lu, C. Guided learning strategy: A novel update mechanism for metaheuristic algorithms design and improvement. Knowl.-Based Syst. 2024, 286, 111402. [Google Scholar] [CrossRef]
Dasi, H.; Ying, Z.; Ashab, M.F.B. Proposing hybrid prediction approaches with the integration of machine learning models and metaheuristic algorithms to forecast the cooling and heating load of buildings. Energy 2024, 291, 130297. [Google Scholar] [CrossRef]
Helforoush, Z.; Sayyad, H. Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach. Front. Big Data 2024, 7, 1469981. [Google Scholar] [CrossRef]
Kowalski, P.A.; Kucharczyk, S.; Mańdziuk, J. Constrained hybrid metaheuristic algorithm for probabilistic neural networks learning. Inf. Sci. 2025, 713, 122185. [Google Scholar] [CrossRef]
Ahmed, Y.; Dutta, K.R.; Nepu, S.N.C.; Prima, M.; AlMohamadi, H.; Akhtar, P. Optimizing photocatalytic dye degradation: A machine learning and metaheuristic approach for predicting methylene blue in contaminated water. Results Eng. 2025, 25, 103538. [Google Scholar] [CrossRef]
Donmez, I.; Gucluer, K. Metaheuristic-optimized machine learning models for predicting compressive strength and assessing sustainability of waste glass powder additive mortars. Clean. Eng. Technol. 2026, 31, 101156. [Google Scholar] [CrossRef]
Tran, K.Q.; Tra, N.Q.N.; Tran, L.H.; Luu, L.X.; Duong, N.T. Hybrid Metaheuristic–ANN for accurate, stable, and generalized breaking wave height prediction. Ocean. Eng. 2026, 344, 123660. [Google Scholar] [CrossRef]
Salman, H.A.; Kalakech, A.; Steiti, A. Random forest algorithm overview. Babylonian J. Mach. Learn. 2024, 2024, 69–79. [Google Scholar] [CrossRef]
Iranzad, R.; Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int. J. Data Sci. Anal. 2025, 20, 197–211. [Google Scholar] [CrossRef]
Mallala, B.; Ahmed, A.I.U.; Pamidi, S.V.; Faruque, M.O.; Reddy, R. Forecasting global sustainable energy from renewable sources using random forest algorithm. Results Eng. 2025, 25, 103789. [Google Scholar] [CrossRef]
Niazkar, M.; Menapace, A.; Brentan, B.; Piraei, R.; Jimenez, D.; Dhawan, P.; Righetti, M. Applications of XGBoost in water resources engineering: A systematic literature review (Dec 2018–May 2023). Environ. Model. Softw. 2024, 174, 105971. [Google Scholar] [CrossRef]
Wiens, M.; Verone-Boyle, A.; Henscheid, N.; Podichetty, J.T.; Burton, J. A tutorial and use case example of the eXtreme gradient boosting (XGBoost) artificial intelligence algorithm for drug development applications. Clin. Transl. Sci. 2025, 18, e70172. [Google Scholar] [CrossRef] [PubMed]
Imani, M.; Beikmohammadi, A.; Arabnia, H.R. Comprehensive analysis of random forest and XGBoost performance with SMOTE, ADASYN, and GNUS under varying imbalance levels. Technologies 2025, 13, 88. [Google Scholar] [CrossRef]
Li, S.; Dong, X.; Ma, D.; Dang, B.; Zang, H.; Gong, Y. Utilizing the lightgbm algorithm for operator user credit assessment research. arXiv 2024, arXiv:2403.14483. [Google Scholar] [CrossRef]
Long, L.; Shi, Q.; Zhang, Q.; Hu, J.; Zhang, H. Dual-warning model for coal spontaneous combustion temperature prediction and risk classification based on BO-LightGBM. Process Saf. Environ. Prot. 2025, 201, 107624. [Google Scholar] [CrossRef]
Lian, H.; Ji, Y.; Niu, M.; Gu, J.; Xie, J.; Liu, J. A hybrid load prediction method of office buildings based on physical simulation database and LightGBM algorithm. Appl. Energy 2025, 377, 124620. [Google Scholar] [CrossRef]
Alonso Robisco, A.; Carbó Martínez, J.M. Measuring the model risk-adjusted performance of machine learning algorithms in credit default prediction. Financ. Innov. 2022, 8, 70. [Google Scholar] [CrossRef]
Imteaj, A.; Amini, M.H. Leveraging asynchronous federated learning to predict customers financial distress. Intell. Syst. Appl. 2022, 14, 200064. [Google Scholar] [CrossRef]
Yan, G. Autoencoder based generator for credit information recovery of rural banks. Int. J. Ind. Eng. Theory, Appl. Pract. 2023, 30. [Google Scholar] [CrossRef]
Bakare, A.; Odunaike, A. Machine learning for enhanced credit scoring. Algora 2024, 1, 1–15. [Google Scholar] [CrossRef]
Chia, L.H. Finding the sweet spot: Optimal data augmentation ratio for imbalanced credit scoring using ADASYN. arXiv 2025, arXiv:2510.18252. [Google Scholar] [CrossRef]
Zheng, C.; Zhu, J.; Weng, F.; Zhang, Z.; Feng, C.; Wang, L. A two-stage machine learning method for personal credit risk scoring with fragmentary data. Expert Syst. Appl. 2026, 299, 130268. [Google Scholar] [CrossRef]

Figure 1. The histograms of the features of Example 2.

Table 1. Descriptive statistics of the SMEs dataset.

Covariates	Mean	Std	Min	Max	Median
Industry	2.821	0.391	1	3	3
Entrusted payment	0.917	0.280	0	2	1
prepay	0.253	0.438	0	2	0
Interest rate	0.006	0.001	0.003	0.012	0.006
Rate mode	1.564	0.691	0	2	2
Rate Floating direction	0.232	0.640	0	2	0
Loan line	3,299,279	2,308,286	50,000	9,990,000	2,750,000
Term	57.510	64.006	1	240	35
Gender	1.364	0.481	1	2	1
Age	42.877	9.230	20	68	43
Education	3.087	1.398	0	5	4
Household income	426,286	751,996	0	5,976,000	0
Personal income	29,884	54,901	0	498,000	0
Type of enterprises	1.523	0.769	1	3	1
Residency	1.433	0.777	0	2	2
Marriage	1.398	0.493	0	2	1

Table 2. Model parameters of the SME dataset.

Parameter	MSE Credit
Average interest rate ( $\bar{r}$ )	6.05%
Cost of funds ( $r_{c f}$ )	3.01%
Average loan term in months	57.51
Loss given default (Lgd)	48.58%

Table 3. Five-fold cross-validation results for the SMEs dataset.

Methods	ACC	AUC	REC	SPE	$F_{1}$	PRE	Savings
RF-ZIBD (mean)	0.9683	0.9053	0.7855	0.9726	0.6360	0.6032	0.8352
RF-ZIBD (sd)	0.0237	0.0134	0.0258	0.0243	0.0725	0.0942	0.1129
XGB-ZIBD (mean)	0.9571	0.9026	0.7792	0.9611	0.5395	0.4644	0.8436
XGB-ZIBD (sd)	0.0176	0.0105	0.0113	0.0183	0.0606	0.0725	0.0682
LG-ZIBD (mean)	0.9670	0.9125	0.7679	0.9716	0.6227	0.5973	0.8245
LG-ZIBD (sd)	0.0244	0.0097	0.0219	0.0253	0.1317	0.1745	0.0851
RF (mean)	0.9293	0.8408	0.6984	0.9332	0.4137	0.4061	0.2364
RF (sd)	0.0403	0.0187	0.0432	0.0414	0.1309	0.1745	0.2853
XGBoost (mean)	0.9225	0.8385	0.6883	0.9265	0.3882	0.3625	0.3654
XGBoost (sd)	0.0315	0.0196	0.0548	0.0324	0.0962	0.1353	0.1507
LightGBM (mean)	0.9286	0.8418	0.6828	0.9327	0.4495	0.4400	0.4222
LightGBM (sd)	0.0380	0.0213	0.0403	0.0390	0.0762	0.0934	0.1519

Table 4. Model parameters of the Kaggle default dataset.

Parameter	Kaggle Default Data Set
Average interest rate ( $\bar{r}$ )	4.79%
Cost of funds ( $r_{c f}$ )	2.94%
Average loan term in months	24
Loss given default (Lgd)	75%

Table 5. Descriptive statistics of the Kaggle default dataset.

Features	Mean	Std	Min	Max	Median
RevolvingUtilizationOfUnsecuredLines	5.825	254.977	0	50708	0.173
age	51.357	14.451	0	103	51
PastDue30–59	0.379	3.522	0	98	0
DebtRatio	0.306	0.223	0	1.000	0.278
MonthlyIncome	6959.809	14,781.926	1	3,008,750	5600
OpenCredit	8.676	5.125	0	57	8
PastDue90+	0.214	3.490	0	98	0
RealEstateLoans	1.016	1.081	0	29	1
PastDue60–89	0.189	3.472	0	98	0
Dependents	0.855	1.149	0	20	0

Table 6. Five-fold cross-validation results for the Kaggle Credit dataset.

Methods	ACC	AUC	REC	SPE	$F_{1}$	PRE	Savings
RF-ZIBD (mean)	0.8011	0.8342	0.7160	0.8075	0.3380	0.2214	0.4967
RF-ZIBD (sd)	0.0058	0.0005	0.0078	0.0068	0.0046	0.0047	0.0024
XGB-ZIBD (mean)	0.7875	0.8498	0.7590	0.7897	0.3366	0.2167	0.5132
XGB-ZIBD (sd)	0.0087	0.0005	0.0113	0.0102	0.0059	0.0059	0.0026
LG-ZIBD (mean)	0.7851	0.8520	0.7679	0.7864	0.3368	0.2162	0.5205
LG-ZIBD (sd)	0.0035	0.0002	0.0028	0.0041	0.0030	0.0029	0.0009
RF (mean)	0.8045	0.8377	0.7191	0.8106	0.3326	0.2168	0.4940
RF (sd)	0.0093	0.0003	0.0106	0.0108	0.0071	0.0070	0.0010
XGBoost (mean)	0.7870	0.8538	0.7688	0.7883	0.3279	0.2086	0.5114
XGBoost (sd)	0.0013	0.0004	0.0017	0.0015	0.0013	0.0013	0.0006
LightGBM (mean)	0.7885	0.8561	0.7722	0.7897	0.3305	0.2105	0.5184
LightGBM (sd)	0.0074	0.0002	0.0092	0.0086	0.0054	0.0051	0.0006

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, C.; Lio, Y.; Tsai, T.-R. An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms. Mathematics 2026, 14, 800. https://doi.org/10.3390/math14050800

AMA Style

Zheng C, Lio Y, Tsai T-R. An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms. Mathematics. 2026; 14(5):800. https://doi.org/10.3390/math14050800

Chicago/Turabian Style

Zheng, Chenlu, Yuhlong Lio, and Tzong-Ru Tsai. 2026. "An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms" Mathematics 14, no. 5: 800. https://doi.org/10.3390/math14050800

APA Style

Zheng, C., Lio, Y., & Tsai, T.-R. (2026). An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms. Mathematics, 14(5), 800. https://doi.org/10.3390/math14050800

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Innovation of the Zero-Inflated Binary Classification in Credit Scoring Using Two-Stage Algorithms

Abstract

1. Introduction

2. Methods

2.1. The ZIBD Model

2.2. The RF Model

2.3. The XGBoost Model

2.4. The LightGBM Model

3. The Two-Stage Algorithms

4. Example 1: The Micro and Small Enterprises Dataset

5. Example 2: Give Me Some Credit Dataset

6. Discussions

7. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. The R Codes to Implement the RF and RF-ZIBD Methods

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI