A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model

Su, Chih-Jen; Chen, I-Fei; Tsai, Tzong-Ru; Lio, Yuhlong

doi:10.3390/math13111702

Open AccessArticle

A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model

¹

Department of Management Sciences, Tamkang University, Tamsui District, New Taipei City 251301, Taiwan

²

Department of Statistics, Tamkang University, Tamsui District, New Taipei City 251301, Taiwan

³

Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(11), 1702; https://doi.org/10.3390/math13111702

Submission received: 11 April 2025 / Revised: 11 May 2025 / Accepted: 21 May 2025 / Published: 22 May 2025

(This article belongs to the Special Issue Significant Applications in Economics, Business, Management and Industrial Statistics)

Download

Browse Figures

Versions Notes

Abstract

The zero-inflated Bernoulli model, enhanced with elastic net regularization, effectively handles binary classification for zero-inflated datasets. This zero-inflated structure significantly contributes to data imbalance. To improve the ZIBer model’s ability to accurately identify minority classes, we explore the use of momentum and Nesterov’s gradient descent methods, particle swarm optimization, and a novel hybrid algorithm combining particle swarm optimization with Nesterov’s accelerated gradient techniques. Additionally, the synthesized minority oversampling technique is employed for data augmentation and training the model. Extensive simulations using holdout cross-validation reveal that the proposed hybrid algorithm with data augmentation excels in identifying true positive cases. Conversely, the hybrid algorithm without data augmentation is preferable when aiming for a balance between the metrics of recall and precision. Two case studies about diabetes and biopsy are provided to demonstrate the model’s effectiveness, with performance assessed through K-fold cross-validation.

Keywords:

data augmentation; gradient descent method; Monte Carlo simulation; particle swarm optimization; SMOTE

MSC:

62-08; 62-11

1. Introduction

If a binary dataset with the responses of

Y = 0

and 1 has excess zeros, the dataset is zero-inflated. The zero-inflated Bernoulli, denoted by ZIBer, model can be used to characterize binary zero-inflated datasets. The excess zeros can be structural or chance zeros. A structural zero means that the zeros are caused by inherent constraints or conditions. Other than structural zeros, the response variable Y follows a Bernoulli distribution, in which

Y = 1

with a probability

0 \leq p \leq 1

and

Y = 0

with the probability

1 - p

. Diop et al. [1] provided a medical example. Assume that the infection response is related to some diseases.

Y = 1

means an infected individual and

Y = 0

otherwise. An immunity agent can control if the individual is infected; that is, the individual is immune and labeled by

Y = 0

if the individual cannot be infected. See Ridout et al. [2] for more discussions about structural and chance zeros. The ZIBer model has been suggested for modeling zero-inflated datasets, with its superiority over the logistic model. To save pages, abbreviations in Table 1 are used in this study so that the readers can follow the arguments.

Today, datasets with a highly disproportionate distribution of categories can be seen occasionally. Such datasets are called imbalanced data in machine learning analysis. The ZIBer model can be an alternative to characterize imbalanced datasets, see Chiang et al. [3] and Xin et al. [4].

Lambert [5] was a pioneer in using the zero-inflated model with the Poisson distribution, named the ZIP model. He used the logit function to link the probability of response

Y = 1

and covariates. Moreover, the maximum likelihood estimation method was used for making statistical inferences. The ZIP model addresses random events that include an abundance of zero-count data within a given time frame. The ZIP model combines two processes that produce zeros. The first process is responsible for generating zeros, while the second process, guided by a Poisson distribution, produces counts. After that, the ZIP model has earned more attention. However, Lambert [5] mentioned that the ZIP model has the drawback of over-dispersion inherited from the Poisson model. To overcome this drawback, he also proposed the zero-inflated negative Binomial model. The zero-inflated model was also extended to the Binomial distribution, see [6,7,8,9,10,11] for more comprehensive discussions. Recent studies about using zero-inflated models for characterizing binary responses have been conducted by [12,13,14,15,16].

Compared with the ZIP, zero-inflated negative Binomial, and zero-inflated Binomial models, fewer existing studies have paid attention to the binary classification for zero-inflated datasets. Dio et al. [1] proposed a simulation-based inference method for binary responses with zero-inflated datasets. Chiang et al. [3] proposed an expectation–maximization (EM) algorithm to obtain reliable maximum likelihood estimates of the ZIBer parameters and named it EM-ZIBer. Considering an overfitting penalty for the ZIBer model, Xin et al. [4] developed a modeling process by using elastic net regularization (ENR) to prohibit model overfitting and named the model the ENR-ZIBer model. The sensitivity of the method proposed by Xin et al. [4] is still lower and unsatisfactory. There is room to improve the performance of the estimation method provided by Xin et al. [4] to that of the ZIBer model for zero-inflated datasets.

The zero-inflated structure is a crucial factor in making the data imbalanced. The ENR-ZIBer model outperforms the logistic model in characterizing zero-inflated datasets. However, there is room to enhance the performance of the ENR-ZIBer model for characterizing imbalanced datasets or keeping a high performance if the model contains many explanatory variables. Using the regularization rule can force the ENR-ZIBer model to screen out less important explanatory variables. However, the regularization rule also increases the possibility of not obtaining reliable model estimates. For more information, readers are suggested to refer to Xin et al. [4].

For zero-inflated datasets, Chiang et al. [3] showed that the EM-ZIBer model is competitive with other weak learners, for example, the light gradient boosting machine and artificial neural network methods. Xin et al. [4] showed the superiority of the ENR-ZIBer model over the EM-ZIBer model proposed by Chiang et al. [3] and the logistic model. However, the ENR-ZIBer model in Xin et al. [4] failed due to the covariates of the latent variable being difficult to identify. Moreover, Xin et al. [4] did not study the performance of the ENR-ZIBer model with data augmentation. To enhance the classification performance of the ENR-ZIBer model for zero-inflated data, machine learning algorithms could have the potential to beat the gradient descent method using the momentum learning rate, labeled as GDM-Mom, proposed by Xin et al. [4]. By the way, if the data are zero-inflated, it could make the data imbalanced. The performance of the ENR-ZIBer model needs to be verified.

Kennedy and Eberhart [17] first proposed the particle swarm optimization (PSO) method in 1995. Inspired by the collective behavior of fish schooling and birds flocking, the PSO is a population-based metaheuristic algorithm for optimization problems by iteratively updating a swarm of particles or candidate solutions. The PSO algorithm is effective for continuous optimization problems and can be implemented with a few hyperparameters. Wang [18] used the multi-kernel functions to improve Fuzzy C-Means, named MK-FCM, and then improve the performance of the PSO method.

In machine learning applications for binary-response data, the synthesized minority oversampling technique, denoted by SMOTE, is a popular method for data augmentation to deal with imbalanced data. The augmented data are then used to train the model for binary classification. When the positive class is under-represented, standard classifiers could perform biased classification on the majority class and lead to poor detection of the minority class. SMOTE is particularly effective when used in combination with ensemble methods or resampling techniques. Dablain et al. [19] proposed a novel oversampling algorithm, named DeepSMOTE, for deep learning models. Integrating the geometric SMOTE method and the SMOTE method for Nominal and Continuous features, Fonseca et al. [20] proposed a geometric SMOTE method for Nominal and Continuous features, named the G-SMOTENC method. G-SMOTENC is an oversampling method to reach a significant improvement in classification performance. Elreedy et al. [21] derived the mathematical formulation for the probability distribution of generated samples using the SMOTE method. More applications using different SMOTE methods can be found in [22,23,24,25,26,27,28,29]. To the best of the authors’ knowledge, using the SMOTE method to enhance the performance of the ENR-ZIBer model is an open issue and must be verified.

The methods of Xin et al. [4] have three weaknesses:

1.: The ENR-ZIBer model used by Xin et al. [4] will fail if the covariates of the latent variable cannot be clearly identified. We suggest using an unknown proportion parameter to replace the second logit model used by Xin et al. [4] to enhance the performance of the ENR-ZIBer model.
2.: The GDM-Mom method proposed by Xin et al. [4] has room to be improved to obtain reliable parameter estimates.
3.: Xin et al. [4] did not evaluate the effect of data augmentation for the ENR-ZIBer model. SMOTE is a popular technique for data augmentation in machine learning applications. It is important to check if SMOTE can enhance the predictive performance of the ENR-ZIBer model.

To validate the quality of the competitive models used in this study, three popular methods for the cross-validation of machine learning models are considered. They are the holdout cross-validation, leave-one-out cross-validation, and K-fold cross-validation methods. In the holdout cross-validation method, a sample is randomly split into a training and test sample. It is common to keep a large proportion of the observations in the sample for the training sample and use the other observations for the test sample. Leave-one-out cross-validation keeps the hth observation for testing and uses the other for the training sample to train the machine learning model,

h = 1, 2, \dots, n

. Then, the mean metric value is used to evaluate the quality of the machine learning method. The leave-one-out cross-validation method is good but time-consuming. K-fold cross-validation can be an alternative to overcome this drawback of the leave-one-out cross-validation method. K-fold cross-validation splits the sample into K equal parts. K − 1 folds are chosen for the training sample to train the machine learning model, and the quality of the machine learning model is validated based on the test sample composed of the remainder part. In this paper, the holdout cross-validation and K-fold cross-validation methods are used to assess the performance of the models used in this study.

Based on the aforementioned reasons, three tasks are finished in this study:

Task 1.: We establish the GDM update process using the NAG factor, denoted by GDM-NAG. Monte Carlo simulations are conducted to screen out the best algorithmic methods for the implementation of the ENR-ZIBer model, from among GDM-Mom, GDM-NAG, PSO, and the proposed hybrid PSO-NAG algorithm that integrates the PSO and GDM-NAG algorithms.
Task 2.: The selected best performance algorithms from Task 1 are used for training the ENR-ZIBer model via SMOTE data augmented sampling. Monte Carlo simulations are conducted to verify the performance of the GDM-Mom, GDM-NAG, PSO, PSO-NAG, and PSO-NAG-SMOTE algorithms in terms of proper metrics.
Task 3.: Holdout cross-validation is used for performance assessment in the Monte Carlo simulation study. Moreover, K-fold cross-validation is used to verify the model’s quality for the real example of diabetes.

The rest of this article is organized as follows. Section 2 introduces zero-inflated data and their modeling. The ZIBer model is established in Section 2.1. The loss function with the ENR rule is defined. Moreover, the implementation of the proposed optimization process using the GDM-Mom, GDM-NAG, PSO, PSO-NAG, and PSO-NAG-SMOTE methods for the implementation of the ENR-ZIBer model is presented in Section 2.2. In Section 3, the performance of using the aforementioned algorithmic methods for the optimization process of the ENR-ZIBer model is evaluated using intensive Monte Carlo simulations. Two examples are used in Section 4 to demonstrate the applications of the proposed methods. Some concluding remarks are given in Section 5 to comment on the proposed methods. Moreover, the topics for future study are discussed.

2. The SMOTE-ZIBer Model

Let

Y_{i} (= 0, 1)

follow a Bernoulli distribution with a vector of covariates defined by

x_{i}^{T} = (x_{1 i}, x_{2 i}, \dots, x_{m i})

,

i = 1, 2, \dots, n

with

x_{1 i} = 1

. Without considering the structural zero assumption, denote the probability of

Y_{i} = 1

by

p_{i}

, and

p_{i} \in (0, 1)

can be linked to

x_{i}

by

log (\frac{p_{i}}{1 - p_{i}}) = β^{T} x_{i}, i = 1, 2, \dots, n,

(1)

where

β^{T} = (β_{1}, β_{2}, \dots, β_{m})

.

2.1. The Zero-Inflated Bernoulli Model

Assuming that some zeros in responses are structural zeros, which are determined through an unobserved or latent factor Z,

Z_{i} = 1

indicates that the zero in the response is a structural zero. Considering the fact that it is not easy to precisely identify the covariates correlated with the logarithm of the odds of

P (Z_{i} = 0)

and

P (Z_{i} = 1)

in practical applications, an unknown constant

δ

is used to denote the probability of structural zero. Then, we can have

\begin{matrix} P (Y_{i} = 0) & = P (Z_{i} = 1) P (Y_{i} = 0 | Z_{i} = 1) + P (Z_{i} = 0) P (Y_{i} = 0 | Z_{i} = 0) \\ = δ \times 1 + (1 - δ) \times (1 - p_{i}) \\ = 1 - (1 - δ) p_{i}, i = 1, 2, \dots n . \end{matrix}

(2)

The probability

P (Y_{i} = 1)

can be modified as

P (Y_{i} = 1) = (1 - δ) p_{i}, i = 1, 2, \dots n,

(3)

after considering the structural zero assumption.

For simplicity, denote

\begin{matrix} π_{i} \equiv π_{i} (Θ) & = P (Y_{i} = 1) = (1 - δ) p_{i}, i = 1, 2, \dots n, \end{matrix}

where

Θ = (β, δ)

.

The loss function based on the data

D = {(x_{i}, y_{i})

,

i = 1, 2, \dots, n .}

can be defined by

L (Θ) \equiv L (Θ | D) = - \sum_{i = 1}^{n} (y_{i} ln (π_{i}) + (1 - y_{i}) ln (1 - π_{i})) .

(4)

Let

λ_{1}

and

λ_{2}

be constants for the regularization strength of

L_{2}

and

L_{1}

norms, respectively. The loss function with the ENR regularization rule is presented by

\begin{matrix} L (Θ) & = - \sum_{i = 1}^{n} (y_{i} ln (π_{i}) + (1 - y_{i}) ln (1 - π_{i})) + \frac{λ_{1}}{2} \sum_{i = 1}^{n} β_{i}^{2} + λ_{2} \sum_{i = 1}^{n} | β_{i} | . \end{matrix}

(5)

Because the domain of

δ

, (0, 1), is bounded at two sides, a good strategy to obtain the minimal loss

L (Θ)

for the estimates of βs, given

δ = δ_{h}

, where

δ_{h} \in Ω_{δ} = {δ_{1}, δ_{2}, \dots, δ_{H}} \subset (0, 1)

, is denoted by

{\hat{β}}^{(h)} = arg min_{β \in Ω_{β}} L (β, δ_{h}),

(6)

and the approximate optimal solution can be obtained as

(\hat{β}, \hat{δ}) = arg min_{h = 1, 2, \dots, H} L ({\hat{β}}^{(h)}, δ_{h}) .

(7)

Following the same setting in the R package glmnet for the values of

λ_{1}

and

λ_{2}

, let

λ_{1} = λ \times \frac{1 - α}{2}

and

λ_{2} = λ \times α

. The value of

α = 0.5

can be a good reference to balance the

L_{1}

-norm and

L_{2}

-norm penalties. If H is large and the domain

Ω_{δ}

is uniformly spread over the interval (0, 1), the obtained optimal solution is close to the global optimal solution.

The target function of

{CE}_{ENR}

is complicated. Xin et al. [4] used the gradient descent method with the momentum learning rate to obtain the estimate of the ENR-ZIBer model. The gradient descent method is a popular method to obtain local optimal solutions for a target function. The gradient of

L (Θ)

with respect to

β

can be obtained by

\begin{matrix} ▿_{β} L (Θ) & = - \sum_{i = 1}^{n} \{\frac{y_{i}}{π} - \frac{1 - y_{i}}{1 - π}\} \frac{\partial π_{i}}{\partial β} + λ_{1} β^{T} 1 + λ_{2}^{T} 1 \\ = - (1 - δ) \sum_{i = 1}^{n} \{\frac{y_{i}}{π} - \frac{1 - y_{i}}{1 - π}\} p_{i} (1 - p_{i}) x_{i} + λ_{1} β^{T} 1 + λ_{2}^{T} 1 \end{matrix}

(8)

where

\bar{y}

is the sample mean of ys,

1

is an n-dimensional vector with entry 1, and

λ_{2}

is an n-dimensional vector with ith entry

λ_{2}

if

β_{i} > 0

and

- λ_{2}

if

β_{i} < 0

.

The GDM-NAG method can be obtained as follows. Let

γ

be the learning rate. For iteration

(t + 1)

, update

ν^{(t)}

by

\begin{matrix} ν^{(t + 1)} = m \times ν^{(t)} - γ ▿_{β} L (Θ^{(t)} + m \times ν^{(t)}), if t \geq 0; \end{matrix}

(9)

where m is 0.9 and

ν^{(0)} = 0

. Update

β^{(t)}

by

β^{(t + 1)} = β^{(t)} + ν^{(t + 1)}, t = 1, 2, \dots, t_{m a x},

(10)

where

t_{m a x}

is the claimed maximum number of iterations for updating parameters. Replace the update formula in Equation (9) by

\begin{matrix} ν^{(t + 1)} = m \times ν^{(t)} - γ ▿_{β} L (Θ^{(t)}), if t \geq 0 . \end{matrix}

(11)

The update process converts to the GDM-Mom method. Normally, GDM-NAD can perform the convergence of the minimization for the loss function quicker than the GDM-Mom method.

PSO is another competitive algorithm with the potential to have better computational efficiency than the gradient descent method. In this study, the PSO algorithm is used to obtain the optimal estimates,

\hat{β}

and

\hat{δ}

, to minimize

L (β, δ)

, which is defined by

L (β, δ) = - \sum_{i = 1}^{n} (y_{i} \times ln (\frac{π_{i}}{1 - π_{i}}) + ln (1 - π_{i})) + \frac{λ_{1}}{2} \sum_{i = 1}^{n} β_{i}^{2} + λ_{2} \sum_{i = 1}^{n} | β_{i} |,

(12)

where

π_{i} = (1 - δ) p_{i}

. The steps to implement the PSO algorithm are given below:

The implementation of the PSO algorithm:

Initialization Step.: Using $L (β, δ)$ in Equation (12) as the target function, initialize a population of particles as candidate solutions in the solution space and assign each particle a random velocity.
Evaluation Step.: Evaluate the fitness of each particle using the target function. Update the personal best for each particle if the current position is better. Update the global best if any particle’s personal best is better than the current one.
Update Velocity Step.: Update the particle’s velocity based on its current velocity, the distance to its personal best, and the global best. The velocity updating formula typically includes inertia, cognitive (individual), and social (global) components.
Update Position Step.: Move each particle to a new position based on its updated velocity. Ensure the new positions are in the solution space.
Iteration Step.: Repeat the evaluation, velocity, and position updating steps for a predetermined number of iterations or until a stopping criterion is met.

Let N denote the number of particles; generate

β_{i, j} \sim U (β_{j, m i n}, β_{j, m a x})

,

j = 1, 3, \dots, m

, and

δ_{i} \sim U (0, 1)

,

i = 1, 2, \dots, N

, where

U (a, b)

is the uniform distribution over the range from a to b, and

β_{j, m i n}

and

β_{j, m a x}

are the minimum and maximum of

β_{j}

, respectively. Let

pb

and

gb

denote the personal best and group best positions. At the initial step,

t = 0

, with

Θ_{i}^{(0)} = (β_{i}^{(0)}, δ_{i}^{(0)})

, where

β_{i}^{(0)} = {(β_{i, 1}^{(0)}, β_{i, 2}^{(0)}, \dots, β_{i, m}^{(0)})}^{T}

,

i = 1, 2, \dots, N

.

{pb}_{i}^{(0)} = Θ_{i}^{(0)}, i = 1, 2, \dots, N .

(13)

Evaluate the value of

L (Θ_{i}^{(0)})

. Let

{gb}^{(0)} = min_{i = 1, 2, \dots, N} L (Θ_{i}^{(0)}) .

(14)

Update the velocity vector

υ_{i} = {(υ_{i, 1}, υ_{i, 2}, \dots, υ_{i, m + 1})}^{T}

and

Θ

at iteration

t + 1

by

\begin{matrix} (15) & \begin{matrix} υ_{i}^{(t + 1)} & = w^{(t)} υ_{i}^{(t)} + c_{1} R_{1} ({pb}_{i}^{(t)} - Θ_{i}^{(t)}) + c_{2} R_{2} ({gb}^{(t)} - Θ_{i}^{(t)}), i = 1, 2, \dots, N, \end{matrix} \\ (16) & \begin{matrix} Θ_{i}^{(t + 1)} & = Θ_{i}^{(t)} + υ_{i}^{(t + 1)}, i = 1, 2, \dots, N, \end{matrix} \end{matrix}

where

c_{1} > 0

and

c_{2} > 0

are the cognitive and social coefficients, respectively;

R_{1}

and

R_{2}

are generated from

U (0, 1)

; and

w^{(t)}

is the weight used to reduce

υ_{i}^{(t)}

. The initial velocity

υ_{i}^{(0)}

can be zero or generated from a uniform distribution. In practical applications,

w^{(t)}

can be evaluated by

w^{(t)} = w_{m a x} - \frac{w_{m a x} - w_{m i n}}{t_{m a x}} \times t,

(17)

where

w_{m i n}

and

w_{m a x}

are the predetermined minimum and maximum of the weights. Moreover,

t_{m a x}

is the maximal iteration to run PSO. While searching for the optimal solution, we can give upper bounds,

υ_{j}^{m a x}

,

j = 1, 2, \dots, m

, to the velocity by the following:

If $υ_{i, j}^{(t + 1)} > v_{j}^{m a x}$ , then $υ_{i, j}^{(t + 1)} = υ_{j}^{m a x}$ , $i = 1, 2, \dots, N$ ;
If $υ_{i, j}^{(t + 1)} < - υ_{j}^{m a x}$ , then $υ_{i, j}^{(t + 1)} = - υ_{j}^{m a x}$ , $i = 1, 2, \dots, N$ .

The $υ_{j}^{m a x}$ can be considered by

υ_{j}^{m a x} = ℓ \times \frac{Θ_{j, m a x} - Θ_{j, m i n}}{2}, ℓ \in (0, 1] .

(18)

Update

pb

and

gb

at iteration

t + 1

by the following:

If $L (Θ_{i}^{(t + 1)}) \leq L ({pb}_{i}^{(t)})$ , then ${pb}_{i}^{(t + 1)} = Θ_{i}^{(t + 1)}$ , $i = 1, 2, \dots, N$ .
If $L (Θ_{i}^{(t + 1)}) \leq L ({gb}^{(t)})$ , then ${gb}^{(t + 1)} = Θ_{i}^{(t + 1)}$ , $i = 1, 2, \dots, N$ .

2.2. The PSO-NAG and PSO-NAG-SMOTE Algorithms

For an imbalanced dataset with majority and minority classes, SMOTE can be a data augmentation technique for the minority class. First, SMOTE selects a minority class instance, say A, at random. Its k nearest neighbors are found in the minority class. The synthetic instance can be created by choosing one of the k nearest neighbors, say B, at random. Then, A and B are connected to form a line segment in the feature space. The synthetic instances are generated as a convex combination of instances A and B.

At the beginning, partition the dataset into one training and one test dataset, respectively. Use SMOTE with the training dataset for data augmentation for the minority class, and then use the augmented dataset to train the model; that is, minimize the target loss function. There are two integrated algorithms that will be introduced to implement the ENR-ZIBer model. The first is the PSO-NAG algorithm, which can be implemented using the following steps.

The implementation of the PSO-NAG algorithm:

Step 1.: Use the PSO algorithm described in Section 2.1 to obtain 100 estimates $({\hat{β}}^{(j)}, {\hat{δ}}^{(j)})$ , $j = 1, 2, \dots, 100$ .
Step 2.: Use the median of 100 estimates of ${\hat{δ}}^{(j)}$ , $j = 1, 2, \dots, 100$ as the optimal estimate of $δ$ , and denote it by $\hat{δ}$ .
Step 3.: Use the GDM-NAG updating method mentioned in Section 2.1 to obtain the estimate $\hat{β}$ , which minimizes the loss function defined by

$\begin{matrix} L (Θ, \hat{δ}) & = - \sum_{i = 1}^{n} (y_{i} ln (π_{i}^{*}) + (1 - y_{i}) ln (1 - π_{i}^{*})) + \frac{λ_{1}}{2} \sum_{i = 1}^{n} β_{i}^{2} + λ_{2} \sum_{i = 1}^{n} | β_{i} |, \end{matrix}$

(19)

where $π_{i}^{*} = (1 - \hat{δ}) \times p_{i}$ , and $p_{i} = \frac{e^{β^{T} x_{i}}}{1 + e^{β^{T} x_{i}}}$ , $i = 1, 2, \dots, n$ .

The PSO-NAG-SMOTE algorithm is similar to the PSO-NAG algorithm. The only difference is adding one more step for data augmentation before using the PSO-NAG algorithm. The PSO-NAG-SMOTE algorithm can be implemented with the following steps.

Data Augmentation Step.: Use the SMOTE method to augment the data.
Following Steps.: Use the augmented data to train the ENR-ZIBer model based on Step 1 to Step 3 of the PSO-NAG algorithm.

Cut the data

D

into training and test samples. Using the training sample to train the ENR-ZIBer model. The contingency table of the true and predicted results is given in Table 2. In Table 2,

T P

,

T N

,

F P

, and

F N

denote the number of true positive cases, true negative cases, false positive cases, and false negative cases, respectively. Then, we can define the metrics of accuracy, recall (sensitivity), precision, and

F_{1}

-score, respectively. After the ENR-ZIBer model is established using the training sample, the test sample is used to evaluate the classification performance based on the following metrics:

\begin{matrix} A c c u r a c y & = \frac{T P + T N}{T P + T N + F P + F N}, \end{matrix}

(20)

\begin{matrix} R e c a l l (S e n s i t i v i t y) & = \frac{T P}{T P + F N}, \end{matrix}

(21)

\begin{matrix} P r e c i s i o n & = \frac{T P}{T P + F P}, \end{matrix}

(22)

and

\begin{matrix} F_{1} - score = \frac{2}{\frac{1}{R e c a l l} + \frac{1}{P r e c i s i o n}} = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n} . \end{matrix}

(23)

Accuracy measures the rate of correct classification of the whole sample. Accuracy can work well for balanced classes; however, it can be misleading if the dataset is imbalanced. Recall is used to measure how well the model identifies TP cases. A high recall indicates fewer FN cases. Recall can be an important metric for medical diagnoses or fraud detection. Precision emphasizes the rate of positive predictions based on the group with cases that are predicted as positive. A high precision indicates fewer FP cases, which is crucial in spam detection or legal cases. The

F_{1}

-score is the harmonic mean of the precision and recall metrics.

F_{1}

-score can well balance the FP and FN cases.

3. Performance Evaluation for the ENR-ZIBer Model

In this section, an intensive Monte Carlo simulation study is conducted to verify the classification performance of the proposed methods based on zero-inflated data with a medium and high probability of structural zeros. The parameters for the Monte Carlo simulation are addressed as follows.

The coefficient parameter $β^{T} = (2, 2, 3, 0.5, 1, 0.1, 4, 1.5)$ is used as a logit link to generate the random variable of the positive responses and the chance zeros.
The values of $δ = 0.5, 0.7$ , and 0.85 are used to generate zero-inflated data with different levels of imbalance. The proportion of the minority group is about 31%, 19%, and 9% in a sample for $δ = 0.5, 0.7$ , and 0.85, respectively.
The explanatory variables are generated based on the following distributions: $x_{1, j} = 1$ , $x_{2, j} \sim N (0, 1)$ , $x_{3, j} \sim B (1, 0.3)$ , $x_{4, j} \sim N (10, 1)$ , $x_{5, j} \sim N (2, 1)$ , $x_{6, j} \sim N (20, 2)$ , $x_{7, j} \sim N (0, 1)$ , and $x_{8, j} \sim B (5, 0, 3)$ , $j = 1, 2, \dots, n$ . $N (μ, σ)$ denotes the normal distribution with the mean $μ$ and standard deviation $σ$ ; $B (s, p_{s})$ denotes the Binomial distribution with the number of trials s and success probability $p_{s}$ .
The sample sizes $n = 300$ , 500, and 800 for $δ = 0.5$ and 0.7; and $n = 500$ and 800 for $δ = 0.85$ . For an imbalanced sample, we need a bigger sample to establish the model such that the test sample can contain more positive cases for performance evaluation. The proportion of the minority group is smaller than 10%. Hence, we keep the sample size at at least $n = 500$ for the case of $δ = 0.85$ .
The number of iterations to evaluate metrics is 1000.
The learning rate $γ = 0.1$ is used to implement the GDM-Mom and GDM-NAG methods.
The penalty coefficients are given by $α = 0.5$ and $λ = 0.3$ for regularization.
The searching range of $β_{i}$ is from 1 to 10, and the searching range of $δ$ is from 0 to 1.
Random samples are generated from the Bernoulli distribution with a successful probability of

$π_{i} = (1 - δ) \times \frac{e^{β^{T} x_{i}}}{1 + e^{β^{T} x_{i}}}, i = 1, 2, \dots, n .$
The maximum number of iterations to implement the PSO algorithm is 1000.
The weight at iteration t is set to be $w^{(t)} = \frac{1}{2 log (2)} = 0.721$ and $c_{1} = c_{2} = \frac{1}{2} + log (2) = 0, 1.193$ .
The swarm size s is the number of particles in the swarm. In this study, it is given by $s = f l o o r (10 + 2 \sqrt{t h e l e n g t h o f p a r a m e t e r s})$ , where floor(x) returns a value that is the largest integer not greater than x. More particles in the swarm result in a larger initial diversity of the swarm to allow larger parts of the search space to be covered in each iteration. However, more particles also increase the computational complexity in each iteration. The empirical heuristic value of s is in the interval of [10, 30].
The holdout cross-validation method is used for evaluating the model’s performance. For each generated sample, 70% of observations are randomly selected from the sample for the training sample to train models, and the other 30 observations are for the test sample.

The metrics of accuracy, recall, precision, and $F_{1}$ -score defined in Equations (20)–(23) are used to compare the performance of the methods considered. R codes are prepared to implement the GDM-Mom and GDM-NAG methods to update the model parameters. The package psooptim in the library pso is used to implement the PSO algorithm. First, we focus on the metric of recall, which is the probability of correctly identifying if the positive case is truly positive. The violin plots of recall based on 1000 simulation runs for various methods are displayed in Figure 1. The violin plots are a hybrid plots of the box plots and kernel density plots that are used to indicate summary statistics and the density of variables.

From Figure 1, we can see that the GDM-Mom, GDM-NAG, and PSO-NAG methods have similar shapes and are competitive in terms of the recall metric. The PSO method can compute more quickly than all the GDMs. However, its performance in recall is slightly worse than the GDM-Mom, GDM-NAG, and PSO-NAG methods with a narrower spread range. To overcome the time-consuming drawback of gradient descent methods, we suggest obtaining the estimate of

δ

based on the PSO method and then running the GDM-NAG method by replacing

δ

with the mean of 100 PSO estimates of

δ

. The method is named the PSO-NAG method. The PSO-NAG method can run more quickly than the GDM-NAG method.

In view of Figure 1, we also find that data augmentation can help the ZIBer model to enhance the value of recall; that is, the ability of the ZIBer model to correctly identify if the positive case is truly positive can be enhanced after using the SMOTE method for data augmentation. Figure 1 shows that the PSO-NAG-SMOTE method outperforms all the competitors in terms of a high value of recall and the narrowest spread range.

Figure 1. The violins of recall based on 1000 simulation runs for various methods.

Recall is not the only metric to evaluate the ability to correctly identify the TP cases. We carefully study the ability to correctly identify the TP cases using the metrics of recall, precision, and their integrated metric, the

F_{1}

-score metric. All simulation results are reported in Table 3, Table 4 and Table 5. As we have mentioned in the parameter settings, only

n = 500

and

n = 800

are used for the case of

δ = 0.85

for Table 5 to allow the test sample to have more cases with

Y = 1

for model evaluation. Different from the screening process suggested by Xin et al. [4], we use

F_{1}

-score instead of accuracy to determine the cut point. If we focus on the proportion of TP, high recall and precision mean that most TP cases have been successfully identified and most predicted positive cases are truly positive, respectively. A method with a high recall and precision is expected; otherwise, a balance between recall and precision is needed. Hence, the metric

F_{1}

-score is considered in this study to replace recall for assessing classification performance.

From Table 3, Table 4 and Table 5, we find the following results.

1.: The classification quality of the ENR-ZIBer model depends on the level of imbalance. The mean recall columns in Table 3, Table 4 and Table 5 report high values, which indicates that the ENR-ZIBer model have a good ability to correctly identify the TP cases. However, we also find that the value of precision is significantly lower than the value of tecall.
2.: The violin plots of precision based on 1000 simulation runs for various methods shown in Figure 2 also indicate that the values of precision are lower than the values of recall shown in Figure 1. Figure 2 shows that the PSO-NAG method slightly outperforms the other competitors with a higher precision metric. The findings indicate that enhancing the ability to identify the TP cases through using the SMOTE method also incurs more FP cases in classification. Table 5 shows that when the proportion of the minority group is lower than 10%, the precision significantly drops from the corresponding figures in Table 3 and Table 4. This is a trade-off to enhance the recall of the ENR-ZIBer model by using the SMOTE method for data augmentation.
3.: The rule of using a high $F_{1}$ -score to screen out the best model hurts the accuracy. It is also important to have a compromise between correctly identifying the TP cases and keeping a high rate of TP prediction. Considering a higher value of the $F_{1}$ -score as the metric to screen out the best model, we can find that the GDM-NAG algorithm outperforms the GDM-Mom and PSO algorithms for most of the cells in Table 3, Table 4 and Table 5. Moreover, we can take advantage of the PSO algorithm, which can save computation time to obtain a reliable estimate of $δ$ . The PSO-NAG method is recommended due to the fact that it can save more computation time than the GDM-Mom and GDM-NAG methods.
4: Table 3, Table 4 and Table 5 show that the GDM-NAG method produces a similar estimation quality to that obtained by using the GDM-Mom method. The PSO algorithm works most efficiently to save computation time. However, its performance is inferior to the GDM-NAG and GDM-Mom methods. The PSO-NAG algorithm uses the PSO algorithm to obtain the estimate of $δ$ based on 100 iterations, and then to obtain the estimate of $β$ using the GDM-NAG method and the PSO estimate of $δ$ . Table 3, Table 4 and Table 5 show that the PSO-NAG algorithm can be a good method to have a compromise between having a high $F_{1}$ -score and saving computation time (Figure 3).
5.: The SMOTE method significantly enhances the value of recall but also decreases the value of precision. After balancing using $F_{1}$ -score, we find that the proposed PSO-NAG method beats the PSO-NAG-SMOTE method. SMOTE is a popular technique to promote the performance of machine learning tools for imbalanced data. However, we find that the trade-off of using the SMOTE method for the ENR-ZIBer model is to enhance the value of recall but decrease the value of precision. Moreover, the PSO-NAG-SMOTE method consumes more computation time to work. The benefit of using SMOTE for the PSO-NAG algorithm in the ENR-ZIBer model is insignificant if the payoff for precision is considered. If dealing with FP cases is not an issue for practical applications, the PSO-NAG-SMOTE method is recommended.
6.: Table 3, Table 4 and Table 5 show that increasing the sample size helps to decrease the standard deviation of the metrics from the simulation study. Hence, reliable estimation results depend on the sample size. Basically, at least 300 observations in a sample are recommended to implement the ZIBer model if the level of zero-inflated data and imbalances is moderate. If the level of zero-inflated data and imbalances is extremely high, then at least 500 observations in a sample are recommended to implement the ZIBer model.

In summary, if dealing with FP cases is not an issue for practical applications, the PSO-NAG-SMOTE method is recommended. If we consider a balance between the metrics of recall and precision to implement the ENR-ZIBer model, the PSO-NAG method is recommended based on the

F_{1}

-score. The recall value from PSO-NAG is significantly high overall in the simulation results. The findings indicate that the proposed hybrid algorithm with SMOTE for data augmentation can significantly enhance the ability to correctly identify the TP cases.

4. Examples

In this section, two examples are given for illustration. The first example is about Pima Indian diabetes, and the second example is about the biopsy of breast tumors.

4.1. Example 1

In Example 1, a Pima Indian diabetes dataset with 768 cases is used to demonstrate the applications of the proposed methods. This dataset, which contains eight explanatory variables for the response of Pima Indians, has a positive response for diabetes or not. The explanatory variables are defined as follows:

$x_{2}$ :: The number of pregnancies.
$x_{3}$ :: The plasma glucose concentration based on the 2 h oral glucose tolerance test numbers.
$x_{4}$ :: The diastolic blood pressure in mm Hg.
$x_{5}$ :: The triceps skin fold thickness in mm.
$x_{6}$ :: The two-hour serum insulin in $μ$ U/mL.
$x_{7}$ :: The body mass index.
$x_{8}$ :: The diabetes pedigree function.
$x_{9}$ :: The age in years.
y:: The response variable: $y = 1$ or 0 denotes that the Pima Indian has a positive or negative response for diabetes, respectively.

The readers can retrieve the dataset from the R package “mlbench” from R 4.5.0 with the code data(PimaIndiansDiabetes, package = “mlbench”).

The dataset does not have missing cells, and the diabetes rate is about 34.9% in this dataset. Please note that the dataset in this study is different from the one in Xin et al. [4]. They used the data “PimaIndiansDiabetes2” in the package mlbench, and we use the data “PimaIndiansDiabetes” in the package mlbench. Moreover, we do not use the second logistic model in this study. Hence, the estimation results cannot be compared with the results that were obtained by Xin et al. [4]. In this section, the 5-fold cross-validation method (K = 5) is used to assess the quality of the models. In Section 3, we already mentioned that PSO-NAG could help to save computation time with good performance. In this section, we use the PSO, PSO-Mom, PSO-NAG, and PSO-NAG-SMOTE methods for the modeling of the Pima Indian diabetes dataset. PSO-Mom means that the

δ

is estimated based on the median of 100 PSO estimates of the

δ

, and then GDM-Mon is used to update parameter

β

by replacing

δ

with the PSO estimate. For simplicity, we denote the PSO estimate of

δ

by

δ

-PSO. The parameter settings for the PSO and SMOTE methods are the same as those given in Section 3.

The estimates of

δ

for 5-fold estimation are 0.5384, 0.5124, 0.46715, 0.62145, and 0.31255. Their mean value is 0.49, which is the PSO estimate of

δ

. The mean values of the accuracy, recall, precision, and

F_{1}

-score of the PSO, PSO-Mom, PSO-NAG, and PSO-NAG-SMOTE methods are given in Table 6. The hybrid algorithms of PSO-Mom and PSO-NAG are recommended based on the

F_{1}

-score. Moreover, the PSO-NAG-SMOTE method is recommended based on the metric of recall. The ENR-ZIBer model with the PSO-NAG-SMOTE method has a good ability to correctly predict the true positive cases.

Based on the full dataset with

δ = 0.49

and using the PSO-NAG method, the estimated ENR-ZIBer model can be obtained by

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) & = 3.3587 + 0.8707 x_{i, 2} + 2.4353 x_{i, 3} - 1.0721 x_{i, 4} + 0.6627 x_{i, 5} - 0.9416 x_{i, 6} \\ + 1.701 x_{i, 7} + 0.747 x_{i, 8} + 3.943 x_{i, 9}, i = 1, 2, \dots, 768 . \end{matrix}

(24)

4.2. Example 2

A biopsy dataset about breast cancer is used for illustration. This dataset was released by Dr. William H. Wolberg from the University of Wisconsin Hospitals, Madison. All biopsies of breast tumors were assessed for 699 patients up to 15 July 1992. In this dataset, nine attributes, except for the ID and response variable, have been scored on a scale of 1 to 10 and given as follows:

ID:: sample code number (not unique).
x₂:: clump thickness.
x₃:: uniformity of cell size.
x₄:: uniformity of cell shape.
x₅:: marginal adhesion.
x₆:: single epithelial cell size.
x₇:: bare nuclei (16 values are missing).
x₈:: bland chromatin.
x₉:: normal nucleoli.
x₁₀:: mitoses
y:: “benign” (y = 0) or “malignant” (y = 1).

This dataset can be retrieved from the R package “MASS” from R 4.5.0 using the code data(biopsy, package = “MASS”). There are 16 missing values in bare nuclei. After removing the missing value cases, this dataset has 683 cases, each of which has 9 attributes and the response variable for data analysis.

The estimates of

δ

for 5-fold estimation are 0.05495, 0.0537, 0.0484, 0.05875, and 0.085. Their mean value is 0.06, which is the PSO estimate of

δ

. Because

δ

is close to zero, this means the structural zero effect is insignificant. In the biopsy dataset, we can find that the performance of the ENR-ZIBer model is excellent in terms of all metrics (Table 7).

Using

δ = 0.06

and the PSO-NAG method, the estimated ENR-ZIBer model based on the full dataset can be obtained by

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) & = - 0.3971 + 2.0635 x_{i, 2} + 1.8843 x_{i, 3} + 0.2503 x_{i, 4} + 0.9914 x_{i, 5} - 0.024 x_{i, 6} \\ + 2.6169 x_{i, 7} + 0.6191 x_{i, 8} + 0.9361 x_{i, 9} + 0.7805 x_{i, 10}, i = 1, 2, \dots, 683 . \end{matrix}

(25)

We can find that the coefficient of

x_{6}

is small and indicates that the attribute

x_{6}

(single epithelial cell size) is insignificant in this example.

5. Concluding Remarks

In this study, we investigated the estimation quality of using the machine learning methods of PSO, GDM-Mom, GDM-NAG, PSO-NAG, and PSO-NAG-SMOTE to establish the ENR-ZIBer model when the dataset is zero-inflated or imbalanced. Monte Carlo simulations were conducted to verify the classification performance of the ENR-ZIBer model based on the holdout cross-validation method. The purpose is to find the best competitive machine learning method for the prediction of the ENR-ZIBer model and save computation time. The simulation results have shown that the hybrid algorithm, by integrating the PSO and GDM-NAG algorithms with two steps, is the most competitive in balancing the metrics of recall and precision.

The performance of the SMOTE method for data augmentation for the zero-inflated data was also studied in this manuscript. The PSO-NAG-SMOTE method enhances the ENR-ZIBer model’s ability to identify TP cases. However, the augmented data based on the SMOTE method also result in more FP cases in the prediction results using the ENR-ZIBer model. This is a trade-off in using the SMOTE method for the hybrid algorithm of PSO-NAG. In practical applications, the PSO-NAG-SMOTE method can be used to establish the ENR-ZIBer model if the issue of FP cases is not a concern. Only the PSO-NAD-SMOTE algorithm for simulations with 1000 iterations is time-consuming. In practical applications for small to moderate sample sizes, the computation time is acceptable.

Two datasets about Pima Indian diabetes and biopsy were used to illustrate the applications of the proposed method. In this example, the PSO, PSO-Mom, PSO-NAG, and PSO-NAG-SMOTE methods are used to establish the ENR-ZIBer model, and the performance is evaluated via utilizing the 5-fold cross-validation method. As we found in the simulation study, the hybrid algorithm outperforms the other competitors based on the

F_{1}

-score metric. We also find the good ability of the ENR-ZIBer model to correctly predict the true positive cases.

Simulation results show that using the proposed PSO-NAG-SMOTE hybrid algorithms can enhance the recall of the ENR-ZIBer model. The findings indicate that the proposed hybrid algorithm with SMOTE for data augmentation can significantly enhance the ability to correctly identify the TP cases. How to reduce the possibility of FP cases in the prediction based on the data augmentation method is important. The ZIBer model uses a sigmoid function as a link function. It is easy to implement, but its performance has room to improve. The proposed method asks for using the median of 100 PSO estimates as the estimate of

δ

. It will spend more time on parameter estimation. To enhance the classification performance of the ENR-ZIBer model, how to combine the ENR-ZIBer model with other competitive machine learning methods is also interesting. These two topics are still open and will be studied in the future.

Author Contributions

Conceptualization, investigation, writing—review and editing, and project administration: T.-R.T.; writing: Y.L. and C.-J.S.; validation, investigation, and review and editing: Y.L.; methodology: C.-J.S.; investigation: I.-F.C. and C.-J.S.; funding acquisition: T.-R.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, grant number NSTC 112-2221-E-032-038-MY2.

Data Availability Statement

The data can be retrieved from the R package “mlbench” in R 4.5.0 with the codes data(PimaIndiansDiabetes, package = “mlbench”) and data(biopsy, package = “MASS”).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Diop, A.; Diop, A.; Dupuy, J.-F. Simulation-based inference in a zero-inflated Bernoulli regression model. Commun. Stat. Simul. Comput. 2016, 45, 3597–3614. [Google Scholar] [CrossRef]
Ridout, M.; Demétrio, C.G.B.; Hinde, J. Models for counts data with many zeros. In Proceedings of the XIXth International Biometric Conference, Cape Town, South Africa, 1–5 July 1996; Invited Papers. International Biometric Society: Cape Town, South Africa, 1998; pp. 179–192. [Google Scholar]
Chiang, J.-Y.; Lio, Y.L.; Hsu, C.-Y.; Ho, C.-L.; Tsai, T.-R. Binary classification with imbalanced data. Entropy 2024, 26, 15. [Google Scholar] [CrossRef] [PubMed]
Xin, H.; Lio, Y.L.; Chen, H.-C.; Tsai, T.-R. Zero-inflated binary classification model with elastic net regularization. Mathematics 2024, 12, 2990. [Google Scholar] [CrossRef]
Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Hall, D.B. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics 2000, 56, 1030–1039. [Google Scholar] [CrossRef]
Gelf, A.E.; Citron-Pousty, S. Zero-inflated models with application to spatial count data. Environ. Ecol. Stat. 2002, 9, 341–355. [Google Scholar]
Rodrigues, J. Bayesian analysis of zero-inflated distributions. Commun. Stat. Theory Methods 2003, 32, 281–289. [Google Scholar] [CrossRef]
Ghosh, S.K.; Mukhopadhyay, P.; Lu, J.C. Bayesian analysis of zero-inflated regression models. J. Stat. Plan. Inference 2006, 136, 1360–1375. [Google Scholar] [CrossRef]
Staub, K.E.; Winkelmann, R. Consistent estimation of zero-inflated count models. Health Econ. 2013, 22, 673–686. [Google Scholar] [CrossRef]
Zuur, A.F.; Ieno, E.N. Beginner’s Guide to Zero-Inflated Models with R; Highland Statistics Limited: Newburgh, NY, USA, 2016. [Google Scholar]
Lee, S.M.; Pho, K.H.; Li, C.S. Validation likelihood estimation method for a zero-inflated Bernoulli regression model with missing covariates. J. Stat. Plan. Inference 2021, 214, 105–127. [Google Scholar] [CrossRef]
Pho, K.H. Goodness of fit test for a zero-inflated Bernoulli regression model. Commun.-Stat.-Simul. Comput. 2024, 53, 756–771. [Google Scholar] [CrossRef]
Li, C.S.; Lu, M. Semiparametric zero-inflated Bernoulli regression with applications. J. Appl. Stat. 2022, 49, 2845–2869. [Google Scholar] [CrossRef] [PubMed]
Pho, K.H. Zero-inflated probit Bernoulli model: A new model for binary data. Commun.-Stat.-Simul. Comput. 2023, 1–21. [Google Scholar] [CrossRef]
Lu, M.; Li, C.S.; Wagner, K.D. Penalised estimation of partially linear additive zero-inflated Bernoulli regression models. J. Nonparametric Stat. 2024, 36, 863–890. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; IEEE: Piscataway, NJ, USA, 1995; pp. 1942–1948. [Google Scholar]
Wang, L. Imbalanced credit risk prediction based on SMOTE and multi-kernel FCM improved by particle swarm optimization. Appl. Soft Comput. 2022, 114, 108153. [Google Scholar] [CrossRef]
Dablain, D.; Krawczyk, B.; Chawla, N.V. DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Trans. Neural Networks Learn. Syst. 2023, 34, 6390–6404. [Google Scholar] [CrossRef]
Fonseca, J.; Bacao, F. Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Syst. Appl. 2023, 234, 121053. [Google Scholar] [CrossRef]
Elreedy, D.; Atiya, A.F.; Kamalov, F. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach. Learn. 2024, 113, 4903–4923. [Google Scholar] [CrossRef]
Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef]
Wongvorachan, T.; He, S.; Bulut, O. A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information 2023, 14, 54. [Google Scholar] [CrossRef]
Prasetya, J.; Abdurakhman, A. Comparison of Smote Random Forest and Smote K-Nearest Neighbors Classification Analysis on Imbalanced Data. Media Stat. 2023, 15, 198–208. [Google Scholar] [CrossRef]
Zhou, H.; Wu, Z.; Xu, N.; Xiao, H. PDR-SMOTE: An imbalanced data processing method based on data region partition and K nearest neighbors. Int. J. Mach. Learn. Cybern. 2023, 14, 4135–4150. [Google Scholar] [CrossRef]
Wang, Z.; Liu, T.; Wu, X.; Liu, C. A diagnosis method for imbalanced bearing data based on improved SMOTE model combined with CNN-AM. J. Comput. Des. Eng. 2023, 10, 1930–1940. [Google Scholar] [CrossRef]
Wen, J.; Tang, X.; Lu, J. An imbalanced learning method based on graph tran-smote for fraud detection. Sci. Rep. 2024, 14, 16560. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, L.; Wei, B. Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation. Mathematics 2024, 12, 1709. [Google Scholar] [CrossRef]
Wang, J.H.; Liu, C.Y.; Min, Y.R.; Wu, Z.H.; Hou, P.L. Cancer diagnosis by gene-environment interactions via combination of SMOTE-Tomek and overlapped group screening approaches with application to imbalanced TCGA clinical and genomic data. Mathematics 2024, 12, 2209. [Google Scholar] [CrossRef]

Figure 2. The violin plots of precision based on 1000 simulation runs for various methods.

Figure 3. The violin plots of

F_{1}

-score based on 1000 simulation runs.

Figure 3. The violin plots of

F_{1}

-score based on 1000 simulation runs.

Table 1. A summary of abbreviations.

ZIBer	zero-inflated Bernoulli
ZIP	zero-inflated Poisson
EM	expectation–maximization method
EM-ZIBer	ZIBer model using the EM method
ENR	elastic net regularization
ENR-ZIBer	ZIBer model using the ENR rule
GDM	gradient descent method
GDM-Mom	GDM using the momentum learning rate
SMOTE	synthesized minority oversampling technique
MK-FCM	multi-kernel Fuzzy C-Means
PSO	particle swarm optimization
G-SMOTENC	geometric SMOTE method for Nominal and Continuous features
GDM-NAG	GDM using Nesterov’s accelerated gradient factor
PSO-NAG	A new algorithm that integrates the PSO and GDM-NAG methods
PSO-NAG-SMOTE	PSO-NAG algorithm using SMOTE for data augmented sampling
TP	true positive
TN	true negative
FP	flase positive
FN	false negarive

Table 2. The confusion matrix.

		True
		Positive	Negative
Predicted	Positive	TP	FP	$P r e c i s i o n = \frac{T P}{T P + F P}$
	Negative	FN	TN
		$R e c a l l = \frac{T P}{T P + F N}$		$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$

Table 3. The mean and standard deviation of the metrics of accuracy, recall, precision, and

F_{1}

-score based on 1000 simulation runs with

δ = 0.5

.

Table 3. The mean and standard deviation of the metrics of accuracy, recall, precision, and

F_{1}

-score based on 1000 simulation runs with

δ = 0.5

.

		Mean				Standard Deviation
$n$	Method	Accuracy	Recall	Precision	$F_{1}$	Accuracy	Recall	Precision	$F_{1}$
300	GDM-Mom	0.6737	0.8836	0.4892	0.6241	0.0640	0.0833	0.0747	0.0614
300	GDM-NAG	0.6744	0.8851	0.4894	0.6249	0.0631	0.0817	0.0741	0.0617
300	PSO	0.5954	0.8152	0.4282	0.5571	0.0684	0.0894	0.0732	0.0701
300	PSO-NAG	0.6626	0.8950	0.4828	0.6221	0.0645	0.0740	0.0744	0.0618
300	PSO-NAG-SMOTE	0.6442	0.9141	0.4674	0.6140	0.0692	0.0624	0.0752	0.0670
500	GDM-Mom	0.6697	0.8923	0.4848	0.6248	0.0494	0.0679	0.0582	0.0486
500	GDM-NAG	0.6701	0.8945	0.4848	0.6256	0.0476	0.0643	0.0575	0.0486
500	PSO	0.5906	0.8121	0.4203	0.5509	0.0600	0.0787	0.0586	0.0570
500	PSO-NAG	0.6643	0.8983	0.4793	0.6221	0.0473	0.0633	0.0551	0.0472
500	PSO-NAG-SMOTE	0.6429	0.9240	0.4654	0.6164	0.0519	0.0484	0.0562	0.0500
800	GDM-Mom	0.6682	0.8955	0.4839	0.6260	0.0396	0.0574	0.0459	0.0381
800	GDM-NAG	0.6669	0.8994	0.4829	0.6262	0.0402	0.0547	0.0462	0.0381
800	PSO	0.5882	0.8038	0.4181	0.5477	0.0523	0.0699	0.0473	0.0454
800	PSO-NAG	0.6645	0.9029	0.4820	0.6266	0.0385	0.0504	0.0434	0.0369
800	PSO-NAG-SMOTE	0.6396	0.9282	0.4601	0.6145	0.0390	0.0373	0.0444	0.0406

Table 4. The mean and standard deviation of the metrics of accuracy, recall, precision, and

F_{1}

-score based on 1000 simulation runs s with

δ = 0.7

.

Table 4. The mean and standard deviation of the metrics of accuracy, recall, precision, and

F_{1}

-score based on 1000 simulation runs s with

δ = 0.7

.

		Mean				Standard Deviation
$n$	Method	Accuracy	Recall	Precision	$F_{1}$	Accuracy	Recall	Precision	$F_{1}$
300	GDM-Mom	0.6260	0.8076	0.3182	0.4471	0.0968	0.1377	0.0720	0.0704
300	GDM-NAG	0.6249	0.8080	0.3176	0.4463	0.0956	0.1377	0.0737	0.0709
300	PSO	0.5306	0.8119	0.2639	0.3940	0.0759	0.1078	0.0635	0.0765
300	PSO-NAG	0.6170	0.8058	0.3150	0.4417	0.1056	0.1433	0.0788	0.0742
300	PSO-NAG-SMOTE	0.5351	0.8972	0.2781	0.4194	0.0936	0.0888	0.0664	0.0771
500	GDM-Mom	0.6074	0.8338	0.3049	0.4407	0.0788	0.1120	0.0570	0.0587
500	GDM-NAG	0.6038	0.8398	0.3030	0.4403	0.0774	0.1061	0.0556	0.0592
500	PSO	0.5265	0.8031	0.2562	0.3855	0.0659	0.0919	0.0488	0.0589
500	PSO-NAG	0.6083	0.8238	0.3037	0.4370	0.0834	0.1177	0.0585	0.0589
500	PSO-NAG-SMOTE	0.5424	0.9124	0.2811	0.4272	0.0597	0.0612	0.0472	0.0557
800	GDM-Mom	0.5994	0.8424	0.2998	0.4387	0.0626	0.0948	0.0439	0.0468
800	GDM-NAG	0.5946	0.8543	0.2982	0.4392	0.0590	0.0847	0.0427	0.0467
800	PSO	0.5271	0.7926	0.2559	0.3847	0.0619	0.0806	0.0395	0.0471
800	PSO-NAG	0.5997	0.8336	0.3011	0.4381	0.0682	0.1021	0.0445	0.0459
800	PSO-NAG-SMOTE	0.5333	0.9248	0.2763	0.4239	0.0471	0.0477	0.0370	0.0442

Table 5. The mean and standard deviation of the metrics of accuracy, recall, precision, and

F_{1}

-score based on 1000 simulation runs s with

δ = 0.85

.

Table 5. The mean and standard deviation of the metrics of accuracy, recall, precision, and

F_{1}

-score based on 1000 simulation runs s with

δ = 0.85

.

		Mean				Standard Deviation
$n$	Method	Accuracy	Recall	Precision	$F_{1}$	Accuracy	Recall	Precision	$F_{1}$
500	GDM-Mom	0.6114	0.7349	0.1681	0.2618	0.1366	0.1800	0.0645	0.0629
500	GDM-NAG	0.6119	0.7341	0.1680	0.2619	0.1363	0.1787	0.0632	0.0634
500	PSO	0.4819	0.8154	0.1311	0.2237	0.0763	0.1159	0.0369	0.0557
500	PSO-NAG	0.6596	0.6399	0.1865	0.2646	0.1599	0.2164	0.0798	0.0606
500	PSO-NAG-SMOTE	0.4744	0.8761	0.1377	0.2357	0.0834	0.1008	0.0370	0.0556
800	GDM-Mom	0.6025	0.7489	0.1625	0.2616	0.1060	0.1523	0.0382	0.0465
800	GDM-NAG	0.5963	0.7561	0.1611	0.2605	0.1062	0.1495	0.0369	0.0461
800	PSO	0.4857	0.8001	0.1324	0.2259	0.0707	0.0980	0.0286	0.0428
800	PSO-NAG	0.6387	0.6655	0.1765	0.2628	0.1417	0.2003	0.0576	0.0502
800	PSO-NAG-SMOTE	0.4597	0.9033	0.1380	0.2381	0.0649	0.0710	0.0294	0.0443

Table 6. The mean of accuracy, recall, precision, and

F_{1}

-score based on the Pima Indian diabetes dataset using 5-fold estimation results.

Table 6. The mean of accuracy, recall, precision, and

F_{1}

-score based on the Pima Indian diabetes dataset using 5-fold estimation results.

Method	Accuracy	Recall	Precision	$F_{1}$ -Score
PSO	0.669	0.761	0.519	0.616
PSO-Mom	0.710	0.863	0.558	0.675
PSO-NAG	0.707	0.851	0.563	0.672
PSO-NAG-SMOTE	0.682	0.885	0.531	0.661

Table 7. The mean of accuracy, recall, precision, and

F_{1}

-score based on the biopsy dataset using 5-fold estimation results.

Table 7. The mean of accuracy, recall, precision, and

F_{1}

-score based on the biopsy dataset using 5-fold estimation results.

Method	Accuracy	Recall	Precision	$F_{1}$ -Score
PSO	0.975	0.992	0.941	0.966
PSO-Mom	0.976	0.988	0.949	0.967
PSO-NAG	0.976	0.998	0.942	0.968
PSO-NAG-SMOTE	0.976	0.992	0.945	0.967

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, C.-J.; Chen, I.-F.; Tsai, T.-R.; Lio, Y. A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model. Mathematics 2025, 13, 1702. https://doi.org/10.3390/math13111702

AMA Style

Su C-J, Chen I-F, Tsai T-R, Lio Y. A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model. Mathematics. 2025; 13(11):1702. https://doi.org/10.3390/math13111702

Chicago/Turabian Style

Su, Chih-Jen, I-Fei Chen, Tzong-Ru Tsai, and Yuhlong Lio. 2025. "A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model" Mathematics 13, no. 11: 1702. https://doi.org/10.3390/math13111702

APA Style

Su, C.-J., Chen, I.-F., Tsai, T.-R., & Lio, Y. (2025). A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model. Mathematics, 13(11), 1702. https://doi.org/10.3390/math13111702

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model

Abstract

1. Introduction

2. The SMOTE-ZIBer Model

2.1. The Zero-Inflated Bernoulli Model

2.2. The PSO-NAG and PSO-NAG-SMOTE Algorithms

3. Performance Evaluation for the ENR-ZIBer Model

4. Examples

4.1. Example 1

4.2. Example 2

5. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI