Binary Classification with Imbalanced Data

When the binary response variable contains an excess of zero counts, the data are imbalanced. Imbalanced data cause trouble for binary classification. To simplify the numerical computation to obtain the maximum likelihood estimators of the zero-inflated Bernoulli (ZIBer) model parameters with imbalanced data, an expectation-maximization (EM) algorithm is proposed to derive the maximum likelihood estimates of the model parameters. The logistic regression model links the Bernoulli probabilities with the covariates in the ZIBer model, and the prediction performance among the ZIBer model, LightGBM, and artificial neural network (ANN) procedures is compared by Monte Carlo simulation. The results show that no method can dominate the other methods regarding predictive performance under the imbalanced data. The LightGBM and ZIBer models are more competitive than the ANN model for zero-inflated-imbalanced data sets.


Introduction
When a data set contains an excess of zero counts over the expectation of a standard statistical distribution, a zero-inflated model helps to analyze such data sets.In many scenarios, count data have two sources of zeros.They are the structural zeros and sampling zeros.Structural zeros mean that the response variable cannot take positive values because of inherent constraints or conditions, and sampling zeros mean the zeros result from random chance.The idea behind the zero-inflated model is to account for two sources of zeros using a mixture of two separated distributions.One model characterizes the probability of structural zeros, and the other describes the non-zero counts.This mixture mechanism of the zero-inflated model allows for better representing the complexities when the data set has excess zeros.The zero-inflated Poisson (ZIP) model and zero-inflated negative Binomial (ZINB) model are two major zero-inflated models.
Imbalanced data refers to the distribution of categories in the data set that is highly disproportionate.For example, in a data set with two types, one type has significantly more instances than another.The situation results in an unequal representation of the two categories.Data sets could be inherently imbalanced in real-world scenarios for various reasons.One example of imbalanced data in medical applications can be diagnosing a rare disease with only a tiny percentage of patients with the condition.In this case, the class size of the rare event will be severely smaller than the common event.
Traditional machine learning algorithms and statistical models could be challenged to handle imbalanced data.The prediction results could be biased toward the majority class and lead to poor performance of the minority class.However, the minority class is often more significantly interesting in many applications.The well-known drawbacks of machine learning algorithms caused by imbalanced data sets are model bias, poor generalization, and misleading evaluation metrics.Several techniques can be used to address the issue of imbalanced data, including resampling, synthetic data generation, using ensemble methods, and more.
The artificial neural network (ANN) has been widely used to implement categorical classification.Rosenblatt [1] proposed a theory of perception about a hypothetical nervous system that can sense, remember, and recognize information from the physical world.Rumelhart et al. [2] discussed a learning algorithm for backpropagation ANNs.The algorithm adjusts the weights of the connections in the network to minimize the error between the realistic output and the desired output.Hirose et al. [3] modified the backpropagation algorithm by varying the number of hidden units in the ANN.Their algorithm can reduce the probability of becoming trapped in local minima and speed up the convergence compared to conventional backpropagation ANN.Sietsma and Dow [4] studied the relationship between the size and structure of an ANN and its ability to generalize from a training set.Using a complex classification task, they investigated the effects of pruning and noise training on network performance.
In the previous decade, Devi et al. [5] emphasized the classification of cervical cancer using ANN methods to develop a strategy for accurately categorizing cervical cancer cases.Sze et al. [6] provided a comprehensive tutorial and survey on the efficient processing of deep neural networks.Abiodun et al. [7] surveyed the ANN applications to provide a taxonomy of ANN models and provided readers with knowledge of current and emerging trends in ANN research.Nasser and Abu-Naser [8] addressed the detection of lung cancer using an ANN method to explore the feasibility of employing ANN technology for identifying lung cancer, which is a significant advancement in medical diagnostics.Muhammad et al. [9] delved into predicting pancreatic cancer using an ANN method.Umar [10] predicted student academic performance using ANN methods to forecast students' academic achievements.Shen et al. [11] introduced a novel ensemble classification model for imbalanced credit risk evaluation.The model integrates neural networks with a classifier optimization technique.The study addressed the challenge of imbalanced data sets in credit risk assessment by leveraging the strengths of neural networks and advanced optimization techniques.
Lambert [12] proposed a ZIP regression model to process count data with excess zeros.The author applied the ZIP regression model to a soldering defects data set on the printed wiring board and compared the performance of the proposed model with other models.Hall [13] presented a case study of ZIP and zero-inflated binomial (ZIB) regression with random effects.The author showed that zero-inflated models are useful when the data contains an excess of zeros, and random effects can be used to account for the correlation between observations within clusters.Cheung [14] discussed the theoretical underpinnings of zero-inflated models, provided examples of their application, and discussed the implications of using these models for accurate inference in the context of growth and development studies.Gelfand and Citron-Pousty [15] presented how to use zero-inflated models for spatial count data within the environmental and ecological statistics field.They explored how zero-inflated models can account for both the spatial correlation and the excess zero counts often found in environmental and ecological applications.Rodrigues [16] discussed how Bayesian techniques can be applied to estimate the parameters of zero-inflated models and provided practical examples to demonstrate the approach.
Ghosh et al. [17] provided a Bayesian analysis of zero-inflated regression models.Their Bayesian approach suggested a coherent way to handle the challenges of excess zeros in count data, offering insights into parameter estimation and uncertainty quantification.Harri and Zhao [18] introduced the zero-inflated-ordered probit model to handle ordered response data with excess zeros.The proposed method is applied to analyze tobacco consumption data.Loeys et al. [19] explored the analysis of zero-inflated count data, going beyond the scope of ZIP regression.Diop et al. [20] studied the maximum likelihood estimation method in the logistic regression model with a cure fraction.The authors introduced the concept of cure fraction models, which analyze data where many individuals do not experience the event of interest.Staub and Winkelmann [21] studied the consistent estimation of zero-inflated count models.He et al. [22] discussed the concept of structural zeros and their relevance to zero-inflated models.Diop et al. [23] studied simulationbased inference in a zero-inflated Bernoulli regression model.This research is valuable for practitioners working with binary data exhibiting excess zeros.
Alsabti et al. [24] present a novel decision tree classifier called CLOUDS for classification using big data.The authors showed that techniques such as discretization and data set sampling can be used to scale up decision tree classifiers to large data sets.However, both methods can cause a significant accuracy loss.Friedman [25] introduced Gradient Boosting Machines (GBMs) with a powerful technique for building predictive models by iteratively combining weak learners.GBMs are known for their effectiveness in handling complex, high-dimensional data with robustness against overfitting.Jin and Agrawal [26] addressed the challenge of constructing decision trees efficiently in a parallel computing environment.They improved communication and memory efficiency during the parallel construction of decision trees.
Chen and Guestrin [27] developed the "XGBoost", a scalable tree-boosting system.XGBoost is a powerful and efficient algorithm for boosting decision trees.Ke et al. [28] introduced "LightGBM", a highly efficient gradient boosting decision tree algorithm.This tool dramatically contributes to machine learning, particularly boosting techniques.Light-GBM has gained attention for its speed and scalability, making it suitable for handling large data sets and complex problems.Wang et al. [29] applied "LightGBM" for the miRNA classification in breast cancer patients.Ma et al. [30] used multi-observation and multi-dimensional data cleaning methods and the algorithms of LightGBM and XGboost for prediction.The authors concluded that the LightGBM algorithm based on multiple observational data set classification prediction results is the best.Machado et al. [31] introduced LightGBM, showing it is an excellent decision tree gradient boosting method.The authors applied LightGBM to predict customer loyalty within the finance industry.Minstireanu and Mesnita [32] employed the LightGBM algorithm for online click fraud detection.Daoud [33] compared three popular gradient boosting algorithms, XGBoost, LightGBM, and CatBoost, to evaluate the strengths and weaknesses of each algorithm in the context of credit risk assessment or related tasks.Some open challenges for implementing imbalanced data analysis have been comprehensively discussed by Krawczyk [34] and Nalepa and Kawulok [35].The two works provided insights into imbalanced data modeling.Krawczyk [34] pointed out the open challenges including the binary and multi-class classification problems, how to do multilabel and multi-instance learning, and unsupervised and semi-supervised handling for imbalanced data sets.It is also challenging to perform regression on skewed cases and a well-designed procedure for learning from imbalanced data streams under stationary and drifting environments, then extending the model for large-scale and big-data cases.Nalepa and Kawulok [35] reviewed the issue of selecting training sets for support vector machines.They did an extensive survey on existing methods for using the support vector machine method to train data in big-data applications.Nalepa and Kawulok [35] separated these helpful techniques into several categories.This work helps users understand the underlying ideas behind the algorithms used.
Even a ZIBer regression model has been proposed in the literature.How to obtain the maximum likelihood estimates (MLEs) of the ZIBer regression model parameters can be a gap in this topic.We are also interested in studying the performance of the ZIBer regression model compared with machine learning classifiers, for example, the LighGBM and ANN methods.In this study, we propose an EM algorithm to obtain reliable MLEs of the ZIBer regression model parameters based on zero-inflated-imbalanced binary data.The logistic model links the event or non-event probability with explanatory variables.The innovation is to propose the theoretical process of the EM algorithm to obtain reliable MLEs for the ZIBer regression model using imbalanced data sets.Monte Carlo simulations were conducted to show the predictive performance of the ZIBer, LightGBM, and ANN methods for binary classification under imbalanced data.
The rest of this article is organized as follows.Section 2 presents the ZIBer regression model and the proposed EM procedure that produces reliable MLEs of the model parameters.In Section 3, two examples are used to illustrate the applications of using the proposed EM algorithm to the ZIBer regression model.In Section 4, Monte Carlo simulations were conducted to compare the classification rate of the ZIBer model, LighGBM, and ANN methods in three measures.The design of Monte Carlo simulations and their implementation are studied in this section.Some concluding remarks are given in Section 5.

Zero-Inflated Bernoulli Regression Model and EM Algorithm
Consider a special case about the infection status of a specific disease.Y i denotes the infection status for the specific disease of the i-th individual in a sample of size n; The logistic function to link Y i and x i can be expressed by where β T = (β 1 , β 2 , . . ., β p ) is the coefficient vector.Failure to account for the extra zeros in a zero-inflated model can result in biased estimates and inferences.Let δ i control the probability of Y i being a structural zero.δ i can be an unconditional probability of Y i being a structural zero controlled by an unobserved factor, say w i .When It can be shown that the unobserved factor w i ∼ B(1, δ i ) for i = 1, 2, . . ., n.Two situations result in Y i = 0; the first situation is that Y i is a structural zero with probability δ i , and the other is the situation that Y i is not a structural zero but has the probability 1 − p i to be zero.The unconditional probability of Y i = 0 can be presented by where π i = (1 − δ i )p i .The unconditional probability of Y i = 1 can be presented by We can obtain the unconditional distribution of Y i by Y i ∼ B(1, π i ).
Using the second logistic function to link δ i and the k × 1 covariate vector of where θ T = (θ 1 , θ 2 , . . ., θ k ) is the vector of model parameters.After algebraic formulation, we can show that Assume that sample D = (y 1 , . . ., y n , x 1 , . . ., x n , z 1 , . . ., z n ) is available.The likelihood function can be constructed as follows: After algebraic formulation, we can obtain the log-likelihood equation, To obtain reliable maximum likelihood estimators, it could be challenging to directly maximize the log-likelihood function in Equation (7).So, we suggest an EM algorithm to maximize the log-likelihood function.
Assume that an unobserved probability w i can determine whether Y i is a structural zero or not; when w i = 1 implies that Y i = 0 is a structural zero, and w i = 0 implies that the Bernoulli distribution determines the response Y i is 1 or 0. The likelihood function based on complete data D can be represented by The log-likelihood function can be represented by and the log-likelihood function can be obtained by where (11) and Using Equations ( 11) and ( 12) to implement the following E-step and M-step until convergences of the estimates of parameters, θ and β, are obtained.E-step: Iteration (h + 1): Estimate w i by its posterior mean w (h+1) i given the estimates β (h)   and θ (h) .If y i = 1, then w where A and A c denote that Y i has a structure zero and is from a Bernoulli distribution.M-step: Maximizing to obtain θ (h+1) , and maximizing to obtain β (h+1) .

Example 1
A diabetes data set is used for illustration.The data set can be obtained from the R package version 2.1-3.1 "mlbench".There are 768 records with eight independent variables including

•
The number of pregnancies.

•
The glucose concentration in the 2-hour oral glucose tolerance test.

•
The diastolic blood pressure in mm Hg.

•
The triceps skin-fold thickness in millimeters.The response variable is Diabetes, which can be negative or positive.After removing incomplete records, we have 392 records for the final sample for illustration.We labeled all the independent variables by x i , i = 1, 2, . . ., 8.Moreover, the variables of "The diastolic blood pressure in mm Hg", "Body mass", and "Age" are selected as the independent variables, labeled by z h , h = 1, 2, 3 to develop the second logistic regression model.The diabetes rate is 0.332 in this example.
All independent variables are scaling with for i = 1, 2, . . ., n.Using the proposed EM algorithm, we can obtain the following ZIBer models, and When the estimated probability p i ≥ 0.53, i = 1, 2, . . ., 392, the corresponding person is identified for diabetes.Based on the obtained model, the accuracy is 0.806.We want to study how much efficiency is lost using the typical logistic regression model.Based on the typical logistic regression model, we obtain the following estimation function form, When the estimated probability p i ≥ 0.39, i = 1, 2, . . ., 392, the corresponding person is identified for diabetes.Based on the obtained model, the accuracy is 0.798.The efficiency of the typical logistic regression model can be improved, but the improvement is not significant based on only 0.008 increment on the Accuracy.The diabetes rate is 0.332.If the success rate is high, the improvement of replacing the typical logistic regression model with the ZIBer regression model could not be significant.The finding makes sense because the imbalance in this example is insignificant.The R codes using the proposed EM algorithm to obtain the MLEs of the ZIBer regression model parameters are given in Appendix A. We use the best cut-rate of 0.53 to predict the value of Diabetes and identify it as positive if p i ≥ 0.53 and 0 otherwise.We can obtain the Accuracy, Sensitivity, and Specificity as 0.806, 0.608, and 0.905, respectively.

Example 2
A Taiwan credit data set is used for illustration.This data set can be obtained from the UC Irvine Machine Learning Repository by the hyperlink https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients; this data set was donated to UC Irvine Machine Learning Reponsitory on 25 January 2016.The response variable is the default payment, Yes = 1 for default and Yes = 0 for non-default.There are 30,000 records with the following 23 explanatory variables:

•
The Given Credit Amount of NT dollars, including the individual consumer and his (or her) family (supplementary) credits.• Gender, 1 for male and 2 for female.• Education, 1 for graduate school, 2 for university, 3 for high school, and 4 for other.• Marital Status, 1 for married, 2 for single, and 3 for other.• Age in years.

•
The history of past payments in 2005 from September to April in decent order of the columns.The measurement scale is −1 for pay duly, 1 for a one-month payment delay, 2 for a two-month payment delay, . . ., 8 for an eight-month payment delay, and 9 for a nine-month payment delay and above.

•
The number of bill statements in 2005 with NT dollars from September to April in decent order of the columns.

•
The amount of previous payments in 2005 with NT dollars from September to April in decent order of the columns.
The response variable of Default Payment is labeled by Y.We use the columns of The Given Credit Amount of NT dollars, Gender, Education, Marital Status, Age, and the payment statuses of April, May, June, July, and August as the independent variables, labeled by X 1 , X 2 , . . ., and X 10 , respectively, to develop the ZIBer regression model.In many instances, we need to subjectively determine the independent variables of Z ′ s for establishing the second logistic regression model.In this example, we select the columns of Gender, Education, Marital Status, and Age as the independent variables of Z ′ s for the second logistic regression model.
No missing records were found in this data set.First, we searched the rows with labels that were not well-defined in the independent variables and removed all records that were not well-defined from the data set.Finally, 4030 records are used in the illustrative data set for constructing the ZIBer regression model.Before model construction, we transform the categorical variable with three or larger categories into dummy variables.Hence, the Marital Status needs to be transformed into two dummy variables.If X 4 = 1, 2, and 3, the new dummy variables of M1 and M2 can be obtained by (M 1 , M 2 ) = (1, 0), (0, 1) and (1,1), respectively.Let and Z 5 = X 5 .All independent variables are scaling with Using the proposed EM algorithm, we can obtain the following ZIBer models, and i = 1, 2, . . ., 4030.
When the estimated probability p i ≥ 0.41, i = 1, 2, . . ., 4030, the corresponding customer is identified as a default.Based on the obtained model, the Accuracy is 0.911.

Monte Carlo Simulations
In this section, we would like to study the performance of the proposed ZIBer regression method and compare its performance with that of the ANN and LightGBM methods.To help readers have a comprehensive understanding of the Monte Carlo simulation design, the design of the Monte Carlo simulation section is explained in Section 4.1, and the implementation of performance comparison among models can be found in Section 4.2.

The Simulation Design
Before implementing Monte Carlo simulations, we need to generate imbalanced data.Hence, we need a baseline model to do a fair performance comparison.First, we generate zero-inflated samples.Assume that w 1 , w 2 , . . ., w n is a random sample from the uniform distribution over 0 and 1. if w i ≤ δ i , let Y i = 0 and Y i follows a Bernoulli distribution with probability π i = (1 − δ i )p i for i = 1, 2, . . ., n.In the simulation studies, we use three covariates: first logistic model, x i1 = 1, generate x i2 from the standard normal distribution, and generate x i3 from the Bernoulli distribution with probability 0.75.Let We can obtain the value of p i from the equation, Using two z ′ s for the second logistic model, let z i1 = 1 and generate z i2 from the Weibull distribution with a shape parameter 3.6 and scale parameter 1 for i = 1, 2, . . ., n.
we can obtain the value of δ i from the equation, Then, we can generate Y i from a Bernoulli distribution with probability π i = (1 Repeat the above process n times to obtain a zero-inflated random sample D.

The Implementation of Performance Comparison
To guarantee that the sample size is large enough to make the machine learning method perform stable, we consider n = 5000 in this simulation study.The ZIBer algorithm can reach the stop rule if 50 iterative runs are up or all the parameter differences between two adjacent iterations are less than 10 −5 .For the comparison purpose, we use 80% or 4000 generated data to train the LightGBM and ANN models and obtain the MLEs of ZIBer model parameters.R codes were prepared to implement the prediction performance comparison.When the MLEs of the model parameters are obtained, the model global searches for the best cut point with the lowest error classification rate as the training ZIBer model.Because the separation of training and testing groups depends on random chance, we repeated the whole process 1000 times and calculated the average performance of accuracy, sensitivity, and specificity, which TP denotes true positive, FN denotes false negative, FP denotes false positive, and TN denotes true negative.The simulation results are given in Table 1.The dispersion of the metrics Accuracy, Sensitivity, and Specificity of 1000 repetitions is given in Figures 3-5.  2 shows that no method dominates the others.These three methods have similar performance for Accuracy and Specificity.The differences between these three methods are almost ignored.However, we found that the ANN method has extreme results for Sensitivity and Specificity.The average sensitivity based on 1000 repetitions is nearly 0, and the average Specificity based on 1000 repetitions is 1.The finding means the ANN methods could not be suitable for the case, emphasizing the Sensitivity measure.The Sensitivity in zero-inflated-imbalanced data is not easy to measure correctly.The LightGBM and ZIBer methods outperform the ANN methods in Sensitivity and Specificity measures.We also find that the ZIBer and LighGBM methods are competitive.
From Figures 1-3, we can see the spread of 1000 measures of Accuracy, Sensitivity, and Specificity, respectively.Figure 1 indicates the Accuracy of LightGBM has a higher median than the other two competitors with a shorter dispersion than the ZIBer method.Figure 2 shows the strength of the ZIBer method with a higher median than the other two competitors and a similar dispersion with the LightGBM method.Figure 3 shows the ZIBer method has the lowest median and is widespread compared with the other two competitors.
Let us revisit Examples 1 and 2, these two data sets can be zero-inflated because the proportions of non-diabetes and non-default payments are 66.8% and 64.4%, respectively.Because the simulation design used is the same and comparison results are similar, Example 2 will be kept for explanation.In practical applications, some non-default credit card users have extremely low default probability.They could have a different model structure from users that could be default but not default in the current state.The performance of the ANN, LightGBM, and ZIBer models is compared using model validation.Randomly splitting the illustrative data set into the training data set with 80% observations, and the remaining 20% observations comprise the testing data set.Meanwhile, the random sampling maintains the same proportion of default payments in the training data set as the illustrative data set.The training data set is used to train the ANN and LightGBM models and construct the ZIBer model.Then, the testing data set is used to evaluate the Accuracy, Sensitivity, and Specificity.The split operation is repeated 300 times.Accuracy, Sensitivity, and Specificity are evaluated using their mean values in 300 repetitions.Moreover, the standard deviations of these three models in 300 repetitions are computed, too.All computation results are given in Table 2, which shows that these three models are competitive.No one model dominates the others.The LightGBM model has the highest Accuracy and Specificity.The ZIBer model has the highest Sensitivity.We also note that the ZIBer model and LightGBM are competitive for the Accuracy measurement.The mean Sensitivity of ANN is significantly lower than the other two competitors and has the largest standard deviation.We also find that the standard deviation of the Sensitivity of LightGBM is over two times higher than that of the ZIBer model.The ZIBer method is a statistical model-based method.Hence, the ZIBer method does not require big data to train models.When the sample size is small, the ZIBer method can be expected to perform best.The ZIBer and LightGBM methods are more competitive than the ANN, and we recommend using them to process zero-inflated-imbalanced data.

Conclusions
This paper proposes the EM maximum likelihood estimation procedure to obtain reliable MLEs of the ZIBer model parameters.The EM algorithm is simple for practical use.The machine learning models of ANN and LightGBM are used for comparing the performance of the proposed ZIBer model in terms of Accuracy, Sensitivity, and Specificity.Monte Carlo Simulations were conducted, and the simulation results show that no method can dominate the other two methods under the zero-inflated-imbalanced data.
Two examples of diabetes and credit card defaults are used for illustration.We demonstrate how to use the proposed EM algorithm to obtain the reliable parameters of the ZIBer regression model.Moreover, we use a random sampling technique for Example 2 to study the performance of the ANN, LightGBM, and ZIBer models under zero-inflatedimbalanced binary data sets in Section 4. We find that the LightGBM and ZIBer models are competitive.The LightGBM has the highest Accuracy and Specificity, and the ZIBer model has the highest Sensitivity.The ANN model performs worse in the measure of Sensitivity than the two competitors.Overall, the ZIBer model has a more stable performance than its competitors.
The strength of a parametric statistical model is the functional form is clear, and the explanation between the response and explanatory variables is feasible.However, screening out critical explanatory variables from a large set in the model is challenging.Enhancing the abilities of the model with higher sensitivity of the ZIBer model is also challenged.These two issues could be studied in the future.

Figure 1 .
Figure 1.The box plots of accuracy for the ANN, LightGBM, and ZIBer models.

Figure 2 .
Figure 2. The box plots of sensitivity for the ANN, LightGBM, and ZIBer models.

Figure 3 .
Figure 3.The box plots of Specificity for the ANN, LightGBM, and ZIBer models.

Figures 4
Figures4 and 5show the scatter plots of 300 Sensitivity measurements versus the Accuracy measurements and 300 Sensitivity measurements versus the Specificity measurements for the ANN, LightGBM, and ZIBer models.These two scatter plots show that the Accuracy, Sensitivity, and Specificity measures of the ZIBer model spread at the top position among all three methods.In this example, the ZIBer model looks to have a more stable performance than its competitors.

Figure 4 .
Figure 4.The scatter plot of the Accuracy versus Sensitivity for the ANN, LightGBM, and ZIBer models.

Figure 5 .
Figure 5.The scatter plot of the Specificity versus Sensitivity for the ANN, LightGBM, and ZIBer models.

Table 2 .
Performance comparison among the ANN, LightGBM, and ZIBer models based on 100 split operations for the illustrative data set.