Zero-Inflated Binary Classification Model with Elastic Net Regularization

Hua Xin; Yuhlong Lio; Hsien-Ching Chen; Tzong-Ru Tsai

doi:10.3390/math12192990

,

and

¹

School of Mathematics and Statistics, Northeast Petroleum University, Daqing 163318, China

²

Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA

³

Department of Statistics, Tamkang University, Tamsui District, New Taipei City 251301, Taiwan

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(19), 2990;https://doi.org/10.3390/math12192990

This article belongs to the Special Issue Current Developments in Theoretical and Applied Statistics

Version Notes

Order Reprints

Abstract

Zero inflation and overfitting can reduce the accuracy rate of using machine learning models for characterizing binary data sets. A zero-inflated Bernoulli (ZIBer) model can be the right model to characterize zero-inflated binary data sets. When the ZIBer model is used to characterize zero-inflated binary data sets, overcoming the overfitting problem is still an open question. To improve the overfitting problem for using the ZIBer model, the minus log-likelihood function of the ZIBer model with the elastic net regularization rule for an overfitting penalty is proposed as the loss function. An estimation procedure to minimize the loss function is developed in this study using the gradient descent method (GDM) with the momentum term as the learning rate. The proposed estimation method has two advantages. First, the proposed estimation method can be a general method that simultaneously uses

L_{1}

- and

L_{2}

-norm terms for penalty and includes the ridge and least absolute shrinkage and selection operator methods as special cases. Second, the momentum learning rate can accelerate the convergence of the GDM and enhance the computation efficiency of the proposed estimation procedure. The parameter selection strategy is studied, and the performance of the proposed method is evaluated using Monte Carlo simulations. A diabetes example is used as an illustration.

Keywords:

expectation-maximization algorithm; gradient descent method; learning rate; maximum likelihood estimation; zero-inflated model

MSC:

62-08; 62-11

1. Introduction

Zero-inflated models help characterize binary data sets with excess zero counts. Structural and sampling zeros can be found in a zero-inflated data set. When the response variable cannot take positive values due to inherent constraints or conditions, the zero is called structural zero. In a medicate setting, Diop et al. [] investigated the infection response related to some diseases where

Y = 1

indicates the individual was infected; otherwise,

Y = 0

. Assume that an immunity agency controls whether an individual is infected. The individual is immune and labeled by

Y = 0

if the individual cannot undergo the interesting outcome. In both examples,

Y = 0

is not a random zero and can be a structural zero. Ridout et al. [] has also commented on the distinction between structural zero and random zero. In contrast to the structural zero, the zeros raised from random chance are the sampling zero. Using a mixture mechanism, the zero-inflated model can account for structural and sampling zero sources and performs better than the traditional logistic, Poisson, and non-negative binomial regression models.

Occasionally, the highly disproportionate distribution of categories also needs to be improved for data analysis. Such data sets are also referred to as imbalanced data. In binary classifications, the size of one category significantly dominates the other. Those binary data sets could be inherently imbalanced in real-world scenarios for various reasons.

The initial zero-inflated model was the zero-inflated Poisson (ZIP), proposed by Lambert [], where a zero-inflated Poisson (ZIP) regression model was investigated via maximum likelihood estimation method to make statistical inferences. Moreover, a data set about the soldering defect on the printed wiring board was used to illustrate the proposed ZIP regression model. To overcome the drawback of over-dispersion from the Poisson model, the zero-inflated negative Binomial (ZINB) model was another competitive model. Reducing the impacts of random effects on the zero-inflated data sets, Hall [] conducted case studies for the ZIP and zero-inflated binomial (ZIB) regression models. Hall [] successfully showed that random effects help account for the correlation between observations within clusters. Cheung [] explored the theoretical foundations of zero-inflated models, illustrated the practical applications with examples, and analyzed the implications of using these models for precise inference in the context of growth and development studies. Gelfand and Citron-Pousty [] demonstrated the utilization of zero-inflated models for spatial count data in environmental and ecological statistics. Additionally, they investigated how these models address both spatial correlation and the frequent occurrence of excess zero counts in environmental and ecological applications. Rodrigues [] and Ghosh et al. [] studied the applications of using Bayesian techniques to estimate the parameters of zero-inflated models.

Harri and Zhao [] introduced a zero-inflated-ordered probit model to process data sets with ordered response variables and excess zeros. They also utilized a tobacco consumption data set for modeling illustration. Loeys et al. [] considered ZIP, ZINB, Poisson logit hurdle model (PLH), and Binomial logit hurdle model (NBLH). Diop et al. [] investigated the maximum likelihood estimation method for the logistic regression model with a cure fraction. Staub and Winkelmann [] conducted inferences to study the consistency of the parameter estimation of the zero-inflated count model. He et al. [] investigated the relevance of the structural zeroes to a zero-inflated model. Diop et al. [] conducted simulation-based inference for the zero-inflated binary regression model and proposed the maximum likelihood estimation procedure. Zuur and Ieno [] gave comprehensive guidance for users to use R codes to apply zero-inflated models.

2. Motivation and Organization

Zero inflation and overfitting can reduce the accuracy rate of using machine learning models for characterizing binary data sets. A zero-inflated Bernoulli (ZIBer) model can be the right model to characterize zero-inflated binary data sets. To our best knowledge, few works have studied the ZIBer model for binary classification, and overcoming the overfitting problem of the ZIBer model is an open question. Recent works about the applications of the ZIBer model can be summarized as follows: When covariates missing at random is found in the ZIBer regression model, Lee et al. [] proposed a validation likelihood method to estimate the model parameters. Pho [] proposed a goodness-of-fit test for the ZIBer model besides revisiting the goodness-of-fit test of the logit model. Li and Lu [] proposed an implementing procedure and consistent estimation method to obtain the variances of the regression parameter estimators. Then, Monte Carlo simulation methods are used to evaluate their proposed method. Pho [] proposed a new zero-inflated probit Bernoulli model and constructed the parameter estimation process. Lu et al. [] proposed a penalized estimation method for partially linear additive models to ZIBer outcome data. The B-spline method is employed to approximate unknown nonparametric components. Moreover, they proposed a two-stage iterative EM algorithm to obtain the penalized spline estimates. When the response variables in a zero-inflated data set follow a Bernoulli distribution, Chiang et al. [] suggested an expectation-maximum (EM) maximum likelihood estimation process for the ZIBer model to obtain reliable estimates of the model parameters. We denote the EM maximum likelihood estimation method proposed by Chiang et al. [] as the EM-ZIBer method.

The EM-ZIBer method of Chiang et al. [] can be an easy method to obtain more reliable maximum likelihood estimates (MLEs) of the ZIBer model than the typical maximum likelihood estimation method. However, the EM-ZIBer method loses feature selection efficiency when the zero-inflated data set size is vast and has an imbalanced structure. This drawback reduces the feasibility of applying the EM-ZIBer method to an extensive zero-inflated data set. In this study, the gradient descent method (GDM) with momentum term as learning rate is used to minimize the loss function, which combines the minus log-likelihood function and using elastic net regularization rule for an overfitting penalty. The proposed estimation procedure has two advantages: First, the proposed estimation method simultaneously uses

L_{1}

- and

L_{2}

-norm terms for penalty and includes the ridge and least absolute shrinkage and selection operator (lasso) methods as special cases. Hence, the proposed estimation procedure is a general method. Second, the momentum learning rate can accelerate the convergence of the GDM and enhance the computation efficiency of the proposed estimation procedure. For simplicity, we named the new parameter estimation method ENR-ZIBer method. Because the lasso regulation uses

L_{1}

-norm for a penalty. The target function for optimization is not a convex function. Hence, the GDM replaces the Newton method for the loss function minimization in this study.

Steps to implement the proposed ENR-ZIBer method will be analytically studied in Section 3.3. Second, to accelerate the convergence of the proposed ENR-ZIBer method, we suggest using the momentum term to the learning rate parameter to implement the gradient descent method. The implementation of using the momentum term in the ENR-ZIBer method will be analytically studied in Section 3.4.

The rest of this article is organized as follows. The ZIBer model is addressed in Section 3. The maximum likelihood estimation is discussed in Section 3.2. Moreover, the proposed elastic net penalty regularization that includes the

L_{1}

- and

L_{2}

-norm rules to the maximum likelihood estimation for the ZIBer model, are studied in Section 3.3. To accelerate the convergence, the GDM with the momentum term as the learning rate is used to the GDM for the elastic net regularization rule is examined in Section 3.4. In Section 4, the quality of the proposed ENR-ZIB method is evaluated using Monte Carlo simulations. One example is used in Section 5 for illustration. Finally, some conclusions are given in Section 6.

3. The Statistical Model and Optimization

The ZIBer model will be established and the inference procedure of model parameters will be addressed in this section.

3.1. The Statistical Model

Regarding the binary classification modeling, the response variable Y can be labeled as 0 or 1, where

Y = 1

indicates an individual infected by a disease; otherwise,

Y = 0

. Let the probability of

Y = 1

depend upon the covariate

x^{T} = (x_{1}, x_{2}, \dots, x_{p})

with

x_{1} = 1

; that is,

P (Y = 1 | x) = p

, where p is a function of

x

. Therefore,

Y | x \sim B (1, p)

, where

B (1, p)

denotes the Bernoulli distribution with the conditional success probability

P (Y = 1 | x)

. To construct the ZIBer model, the logit (or named logistic regression) function is used as the link and defined by

\begin{matrix} log (\frac{p}{1 - p}) = β^{T} x = β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p}, \end{matrix}

(1)

where

β^{T} = (β_{1}, β_{2}, \dots, β_{p})

. It can be shown that

p = \frac{e^{β^{T} x}}{1 + e^{β^{T} x}} .

(2)

In a zero-inflated model, the zero inflation occurs in the analysis of counts if the data-generating process results in too many zeros. Failure to account for extra zeros could produce unreasonable estimates and inferences. To remove the drawback, two possible random resources of

Y = 0

are considered here. The structural zero occurs with probability

δ

, and the non-structural zero happens with the probability

1 - δ

. Hence, the unconditional probability of

Y = 0

can be obtained by

\begin{matrix} P (Y = 0) & = & δ + (1 - δ) \times (1 - p) \end{matrix}

(3)

\begin{matrix} = & 1 - (1 - δ) p . \end{matrix}

(4)

It is easy to obtain the probability

\begin{matrix} P (Y = 1) = 1 - P (Y = 0) = (1 - δ) p . \end{matrix}

(5)

Assume that

δ

depends on the latent covariate

z^{T} = (z_{1}, z_{2}, \dots, z_{m})

, where

z_{1} = 1

. Using the logit model again by

\begin{matrix} log (\frac{δ}{1 - δ}) = θ^{T} z = θ_{1} z_{1} + θ_{2} z_{2} + \dots + θ_{m} z_{m}, \end{matrix}

(6)

where

θ^{T} = (θ_{1}, θ_{2}, \dots, θ_{m})

, it can be shown that

δ = \frac{e^{θ^{T} z}}{1 + e^{θ^{T} z}} .

(7)

Given the covariate information

x, z

, and let

π : = π (β, θ | x, z)

denote the probability of

P (Y = 1 | x, z)

. We can represent

\begin{matrix} π & = & P (Y = 1 | x, z) = (1 - δ) p \\ = & \frac{1}{1 + e^{θ^{T} z}} \times \frac{e^{β^{T} x}}{1 + e^{β^{T} x}} \\ = & \frac{e^{β^{T} x}}{(1 + e^{θ^{T} z}) (1 + e^{β^{T} x})} . \end{matrix}

(8)

The odd can be presented by

\begin{matrix} \frac{π}{1 - π} = \frac{e^{β^{T} x}}{1 + e^{θ^{T} z} + e^{θ^{T} z + β^{T} x}} . \end{matrix}

(9)

Therefore, the ZIBer model can be defined by the following: the response variable

Y | x, z \sim B (1, π)

, where

π

is defined by Equation (8).

3.2. Maximum Likelihood Estimation

Let

Θ = (β, θ)

. Given independent responses,

y_{1}, y_{2}, \dots, y_{n}

, of Y with the corresponding observed and latent covariates,

D = (x_{1}, x_{2}, \dots, x_{n}, z_{1}, z_{2}, \dots, z_{n})

, where

x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i p})

and

z_{i} = (z_{i 1}, z_{i 2}, \dots, z_{i m})

for

i = 1, 2, \dots, n

. The likelihood function can be presented by

\begin{matrix} L (Θ | D) & = & \prod_{i = 1}^{n} π_{i}^{y_{i}} {(1 - π_{i})}^{1 - y_{i}} \\ = & \prod_{i = 1}^{n} {e^{y_{i} β^{T} x_{i}} {(1 + e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}})}^{1 - y_{i}} \\ \times (\frac{1}{1 + e^{β^{T} x_{i}}}) (\frac{1}{1 + e^{θ^{T} z_{i}}})} . \end{matrix}

(10)

The log-likelihood function of the ZIBer model can be presented by

\begin{matrix} ℓ (Θ | D) & = & \sum_{i = 1}^{n} y_{i} β^{T} x_{i} + \sum_{i = 1}^{n} (1 - y_{i}) log (1 + e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}) \\ - \sum_{i = 1}^{n} log (1 + e^{β^{T} x_{i}}) - \sum_{i = 1}^{n} log (1 + e^{θ^{T} z_{i}}) . \end{matrix}

(11)

The first derivative of

ℓ (Θ | D)

with respect to

β

and

θ

are

\begin{matrix} \frac{\partial ℓ (Θ | D)}{\partial β} = \sum_{i = 1}^{n} (y_{i} + \frac{(1 - y_{i}) e^{β^{T} x_{i} + θ^{T} z_{i}}}{1 + e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}} - \frac{e^{β^{T} x_{i}}}{1 + e^{β^{T} x_{i}}}) x_{i} \end{matrix}

(12)

and

\begin{matrix} \frac{\partial ℓ (Θ | D)}{\partial θ} = \sum_{i = 1}^{n} ((1 - y_{i}) \frac{e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}}{1 + e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}} - \frac{e^{θ^{T} z_{i}}}{1 + e^{θ^{T} z_{i}}}) z_{i}, \end{matrix}

(13)

respectively. The typical maximum likelihood estimates of

β

and

θ

can be obtained by

(\hat{β}, \hat{θ}) = \arg max_{Θ \in Ω} ℓ (Θ | D),

(14)

where

Ω

is the solution space of

Θ = (β, θ)

.

3.3. Optimization with Regularization

The

L_{1}

-norm regularization uses lasso rule to penalize the model overfitting. The

L_{1}

norm is defined by

{| | β | |}_{1} : = \sum_{i = 1}^{p} | β_{j} |

. Assume that we want to minimize

- ℓ (Θ | D)

with the

L_{1}

-norm penalty, the lasso target function can be modified as

\begin{matrix} ℓ^{L} (Θ | D) = - ℓ (Θ | D) + λ_{L} {| | β | |}_{1}, \end{matrix}

(15)

where

λ_{L}

is a constant, which is used to control the regularization strength.

The

L_{2}

-norm regularization uses ridge regression to prevent the model overfitting. The

L_{2}

norm is defined by

{| | β | |}_{2} : = \sqrt{\sum_{i = 1}^{p} β_{j}^{2}}

. Without loss of generality, we use the term

\sum_{i = 1}^{p} β_{j}^{2}

to develop the ridge regularization for the zero-inflated binary model. Assume that we want to minimize

- ℓ (Θ | D)

with the

L_{2}

-norm penalty, the ridge target function can be modified as

\begin{matrix} ℓ^{R} (Θ | D) = - ℓ (Θ | D) + \frac{λ_{R}}{2} {({| | β | |}_{2})}^{2} . \end{matrix}

(16)

where

λ_{R}

is a constant for controlling the regularization strength, and the constant

\frac{1}{2}

is used to reduce the computation loading.

Lasso regularization using

L_{1}

-norm can result in a parsimonious model. Large

λ_{L}

can help to result in a sparse model by setting some coefficients to zero, reducing the number of effective parameters. However, lasso regularization rule is sensitive to outliers because of taking the absolute value for the penalty.

The

L_{2}

-norm penalty in ridge regularization discourages large coefficients. The

L_{2}

-norm penalty pushes the model’s coefficients toward zero but never exactly reaches zero. Compared with the

L_{1}

-norm regularization rule, the ridge regularization rule is less sensitive to outliers because it takes a square value for the penalty.

It is a good idea to use a regularization method that covers the strength of the lasso and ridge regularization rules. Elastic net regularization uses the penalties from the lasso and ridge techniques to regularize the target model. Taking advantage of the

L_{1}

- and

L_{2}

-norms, the elastic net regularization rule combines the lasso and ridge regularization rules and learns from their shortcomings to improve the regularization of statistical models. The loss function based on the log-likelihood function and elastic net regulation can be defined by

\begin{matrix} ℓ^{E} (Θ | D) = - ℓ (Θ | D) + \frac{λ_{1}}{2} {({| | β | |}_{2})}^{2} + λ_{2} {| | β | |}_{1}, \end{matrix}

(17)

where

λ_{1}

and

λ_{2}

are constants to control the regularization strength. It is obvious that

ℓ^{E} (Θ | D)

reduces to

ℓ^{L} (Θ | D)

if

λ_{1} = 0

, and

ℓ^{E} (Θ | D)

reduces to

ℓ^{R} (Θ | D)

if

λ_{2} = 0

. When

λ_{1} = 0

, the elastic net regularization rule reduces to the lasso regularization rule. When

λ_{2} = 0

, the elastic net regularization rule reduces to the ridge regularization rule. Therefore, the elastic net regularization rule covers the lasso and ridge regularization rules as a special case. The elastic regularization rule is a generalized rule for practical applications.

The first derivative of

ℓ^{E} (Θ | D)

with respect to

β

can be obtained by

\begin{matrix} \frac{\partial ℓ^{E} (Θ | D)}{\partial β} & = & - \sum_{i = 1}^{n} (y_{i} + \frac{(1 - y_{i}) e^{β^{T} x_{i} + θ^{T} z_{i}}}{1 + e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}} - \frac{e^{β^{T} x_{i}}}{1 + e^{β^{T} x_{i}}}) x_{i} \\ + & (λ_{1} \sum_{j = 1}^{p} β_{j}) 1 + λ_{2}, \end{matrix}

(18)

where

λ_{2}^{T} = (λ_{21}, λ_{22}, \dots, λ_{2 p})

. If

β_{j} > 0

, then

λ_{2 j} = λ_{2}

; otherwise,

λ_{2 j} = - λ_{2}

. The first derivative of

ℓ^{E} (Θ | D)

with respect to

θ

can be obtained by

\begin{matrix} \frac{\partial ℓ^{E} (Θ | D)}{\partial θ} = - \sum_{i = 1}^{n} ((1 - y_{i}) \frac{e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}}{1 + e^{θ^{T} z_{i}} + e^{β^{T} x_{i} + θ^{T} z_{i}}} - \frac{e^{θ^{T} z_{i}}}{1 + e^{θ^{T} z_{i}}}) z_{i} \end{matrix}

(19)

The gradient descent method is used to obtain the estimates of

β

and

θ

. Let

β^{(t)}

and

θ^{(t)}

denote the solutions of

β

and

θ

at the tth iteration. Let

\begin{matrix} ▿ ℓ_{β}^{E} (Θ^{(t)} | D) = \frac{ℓ^{E} (Θ | D)}{\partial β} |_{β = β^{(t)}, θ = θ^{(t)}} . \end{matrix}

(20)

and

\begin{matrix} ▿ ℓ_{θ}^{E} (Θ^{(t)} | D) = \frac{ℓ^{E} (Θ | D)}{\partial θ} |_{β = β^{(t)}, θ = θ^{(t)}} . \end{matrix}

(21)

The values of

β

and

θ

at the

(t + 1)

th iteration can be updated by

\begin{matrix} β^{(t + 1)} = β^{(t)} - γ_{1} ▿ ℓ_{β}^{E} (Θ^{(t)} | D) \end{matrix}

(22)

and

\begin{matrix} θ^{(t + 1)} = θ^{(t)} - γ_{2} ▿ ℓ_{θ}^{E} (Θ^{(t)} | D), \end{matrix}

(23)

respectively, for

t = 1, 2, \dots

;

γ_{1}

and

γ_{2}

are learning rates. The update process stops if the condition

\begin{matrix} | ℓ^{E} (Θ^{(t)} | D) - ℓ^{E} (Θ^{(t - 1)} | D) | \leq ϵ \end{matrix}

(24)

is satisfied, where

ϵ

is a small positive number. Then,

β^{(t)}

and

θ^{(t)}

are the required solution of

β

and

θ

, respectively.

3.4. Convergence Acceleration for the Gradient Descent Method

Constant learning rates have been commonly used to implement the gradient descent method. However, implementing a gradient descent method with a constant learning rate for optimization could take more computation time to reach convergence. Improving the learning rate based on the one-step ahead estimation results to the next step can be a good idea to accelerate the convergence of the gradient descent method. In this study, the momentum learning rate method is adapted to accelerate the convergence speed of the gradient descent method. More comprehensive discussions for using the momentum term to the learning rates can be found in Yu and Chen [], Attoh-Okine [], Qian [], Wang et al. [], Liu et al. [], Friedman et al. [], and Simon et al. [].

At the initial step of iteration

t = 0

, let

\begin{matrix} ν_{β}^{(0)} = γ_{1} ▿ ℓ_{β}^{E} (Θ^{(0)} | D) \end{matrix}

(25)

and

\begin{matrix} ν_{θ}^{(0)} = γ_{2} ▿ ℓ_{θ}^{E} (Θ^{(0)} | D), \end{matrix}

(26)

where

Θ^{(0)} = (β^{(0)}, θ^{(0)})

is the initial solution vector of

Θ

.

β^{(0)}

and

θ^{(0)}

are the initial solutions of

β

and

θ

, respectively. For iteration

t \geq 1

, the values of

ν_{β}^{(t)}

and

ν_{θ}^{(t)}

are updated by

\begin{matrix} ν_{β}^{(t + 1)} = m \times ν_{β}^{(t)} + γ_{1} ▿ ℓ_{β}^{E} (Θ^{(t)} | D) \end{matrix}

(27)

and

\begin{matrix} ν_{θ}^{(t + 1)} = m \times θ^{(t)} + γ_{2} ▿ ℓ_{θ}^{E} (Θ^{(t)} | D), \end{matrix}

(28)

where m is the momentum term that has is a constant in [0,1). Various existing studies have verified the appropriate value of the momentum term can be

m = 0.9

; see Liu et al. []. Then

β

and

θ

can be updated at the iteration t by

\begin{matrix} β^{(t + 1)} = β^{(t)} - ν_{β}^{(t)} \end{matrix}

(29)

and

\begin{matrix} θ^{(t + 1)} = θ^{(t)} - ν_{θ}^{(t)}, \end{matrix}

(30)

respectively.

The optimization to minimize

ℓ^{E} (Θ | D)

depends on the value of

λ_{1}

and

λ_{2}

. We use the same setting as the R package glmnet. Let

λ_{1} = λ (\frac{1 - α}{2})

and

λ_{2} = λ \times α

; see [,]. We can see that

ℓ^{E} (Θ | D)

reduces

ℓ^{L} (Θ | D)

if

α = 1

and reduces to

ℓ^{R} (Θ | D)

if

α = 0

. In this study, we suggest to set

α = 0.5

. Then, search the value of

λ

over the interval [0,1] for the optimal values of

λ_{1}

and

λ_{2}

. We will study the performance of the ENR-ZIB method in Section 4 using the Monte Carlo simulation method.

4. Monte Carlo Simulations

To assess the quality of the proposed ENR-ZIBer estimation method, we generated zero-inflated data sets from the ZIBer model. Let

Y_{i} | x_{i}, z_{1} \sim B (1, π_{i})

for

i = 1, 2, \dots, n

,

β^{T} = (β_{1}, β_{2}, \dots, β_{8})

and

x_{i}^{T} = (x_{i 1}, x_{i 2}, \dots, x_{i 8})

. The logit model based on

p_{i}

and

x_{i}

is defined by

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) = β^{T} x_{i}, i = 1, 2, . . ., n . \end{matrix}

where

x_{i 1} = 1

,

x_{i 2} \sim N (0, 1)

,

x_{i 3} \sim B (1, 0.3)

,

x_{i 4} \sim N (10, 1)

, and

x_{i 5} \sim N (2, 1)

,

x_{i 6} \sim N (20, 2^{2})

,

x_{i 7} \sim N (0, 1)

,

x_{i 8} \sim B (5, 0.3)

, for

i = 1, 2, . . ., n

, and

N (μ, σ^{2})

denotes the normal distribution with

(mean, variance) = (μ, σ^{2})

. Let

θ^{T} = (θ_{1}, θ_{2}, θ_{3})

and

z_{i}^{T} = (z_{1 i}, z_{2 i}, z_{3 i})

The logit model based on

δ_{i}

and

z_{i}

is defined by

\begin{matrix} log (\frac{δ_{i}}{1 - δ_{i}}) = θ^{T} z_{i}, i = 1, 2, . . . n . \end{matrix}

where

z_{i 1} = 1

,

z_{i 2} \sim B (1, 0.6)

, and

z_{i 3} \sim N (30, 7^{2})

, for

i = 1, 2, . . ., n

.

The values of

β^{T} = (2, 2, 3, 0.5, 1, 0.1, 4, 1.5)

and

θ^{T} = (1, 1, 2)

are used for the simulations. Before generating

Y_{i}

from

B (1, π_{i})

,

{x_{i j}, i = 1, 2, \dots, n}

and

{z_{i h}, i = 1, 2, \dots, n}

are standardized for each

j = 2, 3, 4, 5

and

h = 2, 3

, respectively. Moreover, the coefficient vectors of

β

and

θ

are also normalized by

\frac{β}{{| | β | |}_{2}}

and

\frac{θ}{{| | θ | |}_{2}}

, respectively. The learning rates

γ_{1} = 0.01

and

γ_{2} = 0.01

are used for the gradient descent method. Moreover, the momentum term is used to update the values of

ν_{β}

and

ν_{θ}

in Equations (27) and (28), respectively, at each iteration run to accelerate the convergence speed of the numerical computation.

To make the proposed ENR-ZIBer method a general model that can cover the results of the lasso and ridge regularization penalty model, k values of

λ \in (0, 1)

are first generated, in which

k / 3

values of

λ

are used for

λ_{1} = 0

(

α = 1

),

λ_{1} = 1

(

α = 0

), and

λ_{1} \neq 0

and

λ_{2} \neq 0

(

α = 0.5

), respectively. Hence, the lasso, ridge, and elastic net regularization penalty models are included in the pool to compete with the best model. The cross-validation rule is used in this study to make the competition fair. Seventy percent of the sample was used as the training sample for establishing the target model. Then, the other third percent of the sample was used as the testing sample for the error rate evaluation. The best model with the lowest error rate can be screened out based on the testing sample among the model competition.

First, we determine how many repetitions are required to obtain a reliable error rate for the convergence of the mean error rate (MER) based on the simulation study with 5000 repetitions. The obtained MERs are reported in Figure A1. From Figure A1, we find that the MER converges at 1000 or more repetitions for the sample size 300 and 500. Hence, 1000 repetitions should be enough for the performance comparison of competitive models.

The ZIBer model is a weak classifier because it uses an analytical nonlinear function for classification. It is expected to enhance the accuracy rate of the ZIBer model by using regularization rules to penalize the model overfitting. Because the GDM works stable for large samples. Training a weak learner does not need a huge sample. However, the cross-validation rule asks to cut the original sample into training and testing samples. To achieve these two goals, at least 150 cases and up to 500 cases are considered for implementing the proposed method in simulations.

Second, the performance of the proposed ENR-ZIBer method is compared with the EM-ZIBer method and logit model for the sample sizes

n = 150

, 300, and 500. 70% observations in the sample were used as the training sample to establish the models, and the other 30% observations were used as the testing sample for evaluation. The mean error rates (MERs) of the EM-ZIBer method, the logit model, and the proposed ENR-ZIBer method were evaluated based on the testing sample using the models established with the training sample. Then, the model performance is evaluated based on 1000 repetitions. All simulation results are reported in Table 1.

Table 1. The MER of the ENR-ZIBer, EM-ZIBer, and logistic regression methods for

n = 150, 300, 500

and

k = 90, 150

in 1000 repetitions.

Table 1 shows that the MER is stable under each sample size for

k = 90

and 150. The best ENR-ZIBer model is selected from 30 (for

k = 90

) and 50 (for

k = 150

) lasso, ridge, and elastic net penalty regularization models, respectively. Because the cross-evaluation rule is used. The model is established based on the training sample, which contains 70% observations. The error rate is evaluated based on the testing sample, which is composed of the other 30% observation. The sample size cannot be small. A small testing sample will make the MER estimation unstable. We also find that the proposed ENR-ZIBer method works to generate a stable MER and smaller standard deviation if

k = 150

. This finding will be used for the data analysis in Section 5.

The logit model in Table 1 has the largest MER compared with its competitors of the ENR-ZIBer and EM-ZIBer methods. The proposed ENR-ZIBer method with

k = 150

can well characterize zero-inflated data and outperform the EM-ZIBer method and logit model regarding the MER and its standard deviation in most cells based on 1000 repetitions. Hence,

k = 150

is recommended for the implementation of the ENR-ZIBer method. The ENR-ZIBer method can include 50 lasso (

λ_{1} = 0

or

α = 1

), 50 ridge (

λ_{2} = 0

or

α = 0

), and 50 elastic net (

λ_{1} \neq 0

and

λ_{2} \neq 0

; or

α = 1 / 2

) penalty regulations sets with 50 random generated values of

λ

, respectively, in which

λ_{1} = (1 - α) / 2

and

λ_{2} = α / 2

, to compete the best model.

5. An Example

In this section, the proposed ENR-ZIBer method is illustrated by a diabetes data set. A total of 768 cases with eight covariate variables are included in this data set. The purpose is to predict whether each case will have a positive or negative response to diabetes. Hence, the response variable is Diabetes, labeled by y, and can be either a positive (

y = 1

) or negative (

y = 0

) response. The covariate variables are

(1): Pregnant ( $x_{2}$ ): pregnancy numbers;
(2): Glucose( $x_{3}$ ): plasma glucose concentration based on the 2-h oral glucose tolerance test;
(3): Pressure( $x_{4}$ ): diastolic blood pressure (mm Hg);
(4): Triceps( $x_{5}$ ): Triceps skin fold thickness (mm);
(5): Insulin( $x_{6}$ ): Two-hour serum insulin ( $μ$ U/mL);
(6): Mass( $x_{7}$ ): Body mass index;
(7): Pedigree( $x_{8}$ ): Diabetes pedigree function;
(8): Age( $x_{9}$ ): The age in years.

This data set can be obtained from the R package “mlbench”. The R codes for this sample analysis are given in Appendix B.

First, we cleaned the data set to remove 376 cases that contained incomplete covariate information. After data cleaning, 392 cases are kept as the final data set for modeling. In summary for the response variable in the cleaned sample, the diabetes rate is 0.332. Following the suggestion of Chiang et al. [], the “pressure”, “mass”, and “Age” are selected as the covariate variables of the second logit model, labeled by

z_{2}, z_{3}

, and

z_{4}

. The

x_{1}

and

z_{1}

are a m-dimensional vector with entry 1. Using the EM algorithm proposed by Chiang et al. [], the EM-ZIBer model was established and reported in Table 2.

Table 2. The estimates of the model coefficients for the diabetes data set.

Chiang et al. [] suggest the best cutting probability of the ZIBer is

p_{i} \geq 0.53

for

i = 1, 2, \dots, 392

, which results in an accuracy of 80.6% for this data set. The logit model was also established for the diabetes data set and reported at the third column of Table 2. The best cutting probability of the typical logit model

p_{i} \geq 0.39

,

i = 1, 2, \dots, 392

, which results in the accuracy is 79.8%.

The ENR-ZIBer method with

k = 150

was established to compete with the logit model and the EM-ZIBer methods for the best model. The model coefficients of the ENR-ZIBer model are reported at the forth column of Table 2. As discussed in Section 4, the ENR-ZIBer method with

k = 150

includes 50 lasso, 50 ridge, and 50 elastic net penalty regularization models with

λ_{1} \neq 0

and

λ_{2} \neq 0

for the model selection to compete with the best model. The MLEs of the EM-ZIBer method were used as the initial solutions of the model parameters to implement the ENR-ZIBer method. Finally, we obtained the best model with the lowest error rate, which has

(λ_{1}, λ_{2}) = (0.049, 0.098)

. Moreover, the best cutting probability based on the ENR-ZIBer method is 0.45, which results in an error rate of 0.188 or accuracy 1 − 0.188 = 81.1%. The accuracy of the ENR-ZIBer model beats the logit model and the EM-ZIBer method. We also found that the Triceps(

x 5

) contributes less to the binary classification using the ENR-ZIB model than other covariates of

x^{'} s

.

6. Concluding Remarks

In this study, we established the loss function that is composed of the log-likelihood function of the ZIBer model and the

L_{1}

- and

L_{2}

-norm terms for penalty. The computation procedure to obtain reliable estimates of the proposed ENR-ZIBer method was also developed using the GDM to overcome the computation difficulties caused by the complicated target function for minimization. To accelerate the convergence of the GDM for zero-inflated data, the momentum term is used as the learning rate for the numerical computation to implement the proposed ENR-ZIBer method. We also find the ENR-ZIBer method with

k = 150

works well. This model can include the lasso and ridge regularization rules for a penalty in the pool to compete with the best model.

Monte Carlo simulations were conducted to evaluate the performance of the proposed method. Moreover, the performance of the proposed ENR-ZIBer method is compared with the EM-ZIBer method and the logit model. Simulation results show that the proposed ENR-ZIBer method with

k = 150

outperforms its competitors with the lowest MRE. We recommend using the ENR-ZIBer with

k = 150

for characterizing zero-inflated data with a feature selection consideration.

The proposed method fills the gap of the EM-ZIBer method proposed by Chiang et al. [] to reach a compromise of high accuracy and feature selection for the zero-inflated data set. Other feasible learning rates could have similar performance as the momentum term. Moreover, using the elastic net regulation rule to other zero-inflated models can be another important topic. These two topics will be studied in the near future.

Author Contributions

Conceptualization, investigation, writing, review and editing, and project administration: T.-R.T.; validation, investigation, and review and editing: Y.L.; methodology: H.X.; investigation: H.-C.C. and H.X.; funding acquisition: T.-R.T. and H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, grant number NSTC 112-2221-E-032-038-MY2; and Heilongjiang Provincial Natural Science Foundation, grant number LH2023D024.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The diabetes data set can be found from the R package “mlbench”.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Figures for Mean Error Rates

Figure A1. The MERs from top to bottom for (a)

n = 300

and (b)

n = 500

.

Appendix B. R Codes for the Diabetes Example

rm(list=(ls(all=T)))

library(mlbench)

.

## 'data.frame': 768 obs. of 9 variables:

## $ pregnant: num 6 1 8 1 0 5 ...

## $ glucose : num 148 85 183 89 137 116 ...

## $ pressure: num 72 66 64 66 40 74 ...

## $ triceps : num 35 29 NA 23 35 ...

## $ insulin : num NA NA NA 94 168 NA ...

## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 ...

## $ pedigree: num 0.627 0.351 0.672 ...

## $ age : num 50 31 32 21 33 30 4 ...

.

## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

.

## ENR-ZIBER functions---------------------------------------

## mx is a n*(p+1) matrix, the first col. is 1's

## mz is a n*(m+1) matrix, the first col. is 1's

## y is a n*1 vector

## be.o: old beta; be.n: updated beta

## the.o: old theta; be.n: updated beta

## be.n: the vector of the gradient of ell w.r.t beta

## the.n: the vector of the gradient of ell w.r.t theta

## Random generate m combinations of lam1 and lam2

## lam1=lam*(1-alp)/2

## lam2=lam*alp

## m: the number of combinations of (lam1, lam2)

.

## Random generate m combinations of lam1 and lam2

## lam1=lam*(1-alp)/2

## lam2=lam*alp

## mm: the number of combinations of (lam1, lam2)

## estimating the parameters using elastic net regularization

## be0, the0: initial solutions of be and the

##

##-----------------------------------------------------------------

## Obtain the estimators of beta and the, and the est. errR ahd y.hat

##-----------------------------------------------------------------

.

## start function--------------------------------------------------

.

# scaling

minmax <- function(x) {

(x - min(x)) / (max(x) - min(x))

} # min-max scaling function

.

## return elastic net regularization target

## -(log-likelihood)+(lam1/2)(norm_2(beta))^2+lam2norm_1(beta)

## log-like= sum(yi*log(pi_i/(1-p_i)) + log(1-pi_i))

## proportional to sum(yi*log(pi_i/(1-p_i)) )

like.f<-function(be, the, lam1, lam2){

n=nrow(x)

log.lik=0

for(i in 1:n){

s.be=sum(be*x[i,])

s.the=sum(the*z[i,])

pi=exp(s.be)/((1+exp(s.the))*(1+exp(s.be)))

log.lik=log.lik+y[i]*log(pi/(1-pi))

}

return(round(-log.lik+(lam1/2)*(sum(be^2))+lam2*sum(abs(be)),5))

}

.

## the first devariates of the elastic net regularization target

dbe_the<-function(be, the, lam1, lam2){

## The 4 terms in the derivatives of ell

## input the ith rows of x, z; be.o, the.o

term.f<-function(xi, zi, be, the){

## exp. terms in the derivative of ell w.r.t. beta

s.be=sum(be*xi); s.the=sum(the*zi)

term1=exp(s.be+s.the)/(1+exp(s.the)+exp(s.be+s.the))

term2=exp(s.be)/(1+exp(s.be))

## exp. terms in the derivative of ell w.r.t. theta

term3=(exp(s.the)+exp(s.be+s.the))/(1+exp(s.the)+exp(s.be+s.the))

term4=exp(s.the)/(1+exp(s.the))

return(c(term1, term2, term3, term4))

}

.

## evaluate the two first derivates

dell.be=as.vector(rep(0,p+1))

dell.the=rep(0,m+1)

for(i in 1:nrow(x)){

terms.f=term.f(x[i,], z[i,], be, the)

dell.be=dell.be+(y[i]+(1-y[i])*terms.f[1]-terms.f[2])*x[i,]

dell.the=dell.the+((1-y[i])*terms.f[3]-terms.f[4])*z[i,]

}

.

## find the vector of lamda

lam2.v=sign(be)*rep(lam2, p+1)

dell.be=-dell.be+(lam1*sum(be))+lam2.v

dell.the=-dell.the

## export log-likelood-lam*sum(be) and log-likelood-kap*sum(the)

return(list(dell.be=dell.be, dell.the=dell.the))

}

.

## udate the parameter one time

update.f<-function(be, the, gam1, gam2, lam1, lam2){

grad<-dbe_the(be, the, lam1, lam2)

be.new<-be-gam1*grad$dell.be

the.new<-the-gam2*grad$dell.the

return(list(be.new=be.new,the.new=the.new))

}

.

## searching the optimal estimates using momentum updating

grad.des<-function(be.o, the.o, gam1, gam2, lam1, lam2, M){

nu.t1=nu.t2=0

be.new=be.o+nu.t1

the.new=the.o+nu.t2

for (i in 1:M){

grad.para=dbe_the(be.new, the.new, lam1, lam2)

nu.t1=0.9*nu.t1-gam1*grad.para$dell.be

nu.t2=0.9*nu.t2-gam2*grad.para$dell.the

be.new=be.new+nu.t1

the.new=the.new+nu.t2

loglike.o=like.f(be.o, the.o, lam1, lam2)

loglike.n=like.f(be.new, the.new, lam1, lam2)

err<-abs(loglike.n-loglike.o)

## cat(i, err, "")

if (err<0.001){break} ## -log.like increases or err is smaller than threshold

else{

be.o<-be.new; the.o<-the.new

}

return(list(be.best=round(be.new,4),the.best=round(the.new,4),

iter.f=i,loglike.f=loglike.n))

}

.

## best cut prob. and error rate

errR.f=function(x.tra, y.tra, z.tra, be.est, the.est,

x.tes, z.tes, y.tes){

n=nrow(x.tra)

pp.est=array()

for(i in 1:n){

s.be=exp(sum(be.est*x.tra[i,]))

s.the=exp(sum(the.est*z.tra[i,]))

pp.est[i]=round(s.be/((1+s.be)*(1+s.the)),3)

}

.

errR=array()

seq.v=seq(0.1, 0.7, 0.05);

for (jj in 1:length(seq.v)){

y.est=rep(0,n); y.est[which(pp.est>=seq.v[jj])]=1

errR[jj]=length(which(y.est!=y))/n

}

## take the minimal one if multiple values are available

best.cut=min(seq.v[which(errR==min(errR))], na.rm=TRUE)

.

nt=nrow(x.tes)

pp.est=array()

for(i in 1:nt){

s.be=exp(sum(be.est*x.tes[i,]))

s.the=exp(sum(the.est*z.tes[i,]))

pp.est[i]=round(s.be/((1+s.be)*(1+s.the)),3)

}

.

y.est=rep(0,nt); y.est[which(pp.est>=best.cut)]=1

errR.tes=round(length(which(y.est!=y.tes))/nt,4)

return(list(bestCutP=best.cut, y.est=y.est, errR.tes=errR.tes))

}

.

## end function----------------------------------------------

.

data("PimaIndiansDiabetes2", package = "mlbench")

.

Diabetes2 <- na.omit(PimaIndiansDiabetes2)

Diabetes2.sta=apply(Diabetes2[,c(-9)],2,minmax)

diabetes<-factor(Diabetes2[,c(9)])

Diabetes3=data.frame(Diabetes2.sta,diabetes)

xx=Diabetes3[,c(-9)]; xx=apply(xx,2,minmax)

zz=Diabetes3[,c(3,6,8)]; zz=apply(zz,2,minmax)

y=ifelse(Diabetes3$diabetes=="neg", 0, 1)

.

## x=xxtra; z=zztra; y=ytra

n=length(y)

cnst=rep(1,n)

x=data.frame(cnst, xx); z=data.frame(cnst, zz)

.

## parameters

## be: beta

## the: theta

## m: the number of randomly generated combinations of (lam1, lam2)

## gam1, gam2: constant learning rate

## M: max. number for the gradient decent algorithm

## iter: the repeatition of simulations

## seed.r: used for set.seed

.

gam1=0.01

gam2=0.01

.

M=500

k=150

p=ncol(x)-1

m=ncol(z)-1

.

be0=rep(1,ncol(x))

the0=rep(1,ncol(z))

.

## update parameters by gradient decent method

.

## Initial solutions of the parameters (random initial values)

## lasso, ridge, ENR

set.seed=123

lam=sample(seq(0, 1, 0.001), k)

seq1=sample(1:k, round(k/3))

seq2=sample(1:k, round(k/3))

seq3=sample(1:k, k-2*round(k/3))

lam.1=lam[seq1]

lam.2=lam[seq2]

lam.3=lam[seq3]

.

alp=0; penCoef1=cbind(lam.1*(1-alp)/2, lam.1*alp) ## lasso

alp=1; penCoef2=cbind(lam.2*(1-alp)/2, lam.2*alp) ## ridge

alp=0.5; penCoef3=cbind(lam.3*(1-alp)/2, lam.3*alp) ## ENR

.

penCoef=rbind(penCoef1,penCoef2,penCoef3)

.

## MLEs as initial values

## be.o=rnorm(length(be0)); the.o=rnorm(length(the0))

be.o=c(-7.562, 2.665, 6.931, 1.99, -0.19, 3.34,

5.246, 5.313, -2.863);

the.o=c(0.005, 2.232, -1.94, -10.494)

.

sink("Diabetes_ENR-ZIBerk150.lst")

iterBest=logLikeli=errR.v=bestCut_P=array()

beEst=array(dim=c(k, length(be0)))

theEst=array(dim=c(k, length(the0)))

for (i in 1:k){

aaa=grad.des(be.o, the.o, gam1, gam2, penCoef[i,1], penCoef[i,2], M)

.

be.best=as.numeric(aaa$be.best)

the.best=as.numeric(aaa$the.best)

.

iterBest[i]=aaa$iter.f

beEst[i,]=be.best

theEst[i,]=the.best

logLikeli[i]=aaa$loglike.f

errR.v[i]=errR.f(x, y, z, aaa$be.best, aaa$the.best, x, z, y)$errR.tes

cat(c(i, penCoef[i,1], penCoef[i,2], aaa$iter.f, be.best, the.best, aaa$loglike.n, errR.v[i]), "\n")

}

.

ind.errR=which(errR.v==min(errR.v)) ## using error rate

ind2=which(penCoef[ind.errR,2]==max(penCoef[ind.errR,2]))

ind=ind2

.

lam1.f=penCoef[ind,1]; lam2.f=penCoef[ind,2];

be.h=beEst[ind,]; the.h=theEst[ind,]

iter.best=iterBest[ind]

errR.best=errR.v[ind]

cat("\n")

cat(iter.best, lam1.f, lam2.f, be.h, the.h, errR.best, "\n")

cat("\n")

sink()

.

write.csv(iterBest, file="iterBestk150.csv", row.names = FALSE)

write.csv(beEst, file="beEstk150.csv", row.names = FALSE)

write.csv(theEst, file="theEstk150.csv", row.names = FALSE)

write.csv(errR.v, file="errR150.csv", row.names = FALSE)

References

Diop, A.; Diop, A.; Dupuy, J.-F. Simulation-based inference in a zero-inflated Bernoulli regression model. Commun. Stat.-Simul. Comput. 2016, 45, 3597–3614. [Google Scholar] [CrossRef]
Ridout, M.; Demétrio, C.G.B.; Hinde, J. Models for counts data with many zeros. In Proceedings of the XIXth International Biometric Conference, Cape Town, South Africa, 14–18 December 1998; Invited Papers. International Biometric Society: Cape Town, South Africa, 1998; pp. 179–192. [Google Scholar]
Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Hall, D.B. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics 2000, 56, 1030–1039. [Google Scholar] [CrossRef] [PubMed]
Cheung, Y.B. Zero-inflated models for regression analysis of count data: A study of growth and development. Stat. Med. 2002, 21, 1461–1469. [Google Scholar] [CrossRef] [PubMed]
Gelfand, A.E.; Citron-Pousty, S. Zero-inflated models with application to spatial count data. Environ. Ecol. Stat. 2002, 9, 341–355. [Google Scholar]
Rodrigues, J. Bayesian analysis of zero-inflated distributions. Commun.-Stat.-Theory Methods 2003, 32, 281–289. [Google Scholar] [CrossRef]
Ghosh, S.K.; Mukhopadhyay, P.; Lu, J.C. Bayesian analysis of zero-inflated regression models. J. Stat. Plan. Inference 2006, 136, 1360–1375. [Google Scholar] [CrossRef]
Harris, M.N.; Zhao, X. A zero-inflated ordered probit model, with an application to modelling tobacco consumption. J. Econom. 2007, 141, 1073–1099. [Google Scholar] [CrossRef]
Loeys, T.; Moerkerke, B.; De Smet, O.; Buysse, A. The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression. Br. J. Math. Stat. Psychol. 2011, 65, 163–180. [Google Scholar] [CrossRef]
Diop, A.; Diop, A.; Dupuy, J.F. Maximum likelihood estimation in the logistic regression model with a cure fraction. Electron. J. Stat. 2011, 5, 460–483. [Google Scholar] [CrossRef]
Staub, K.E.; Winkelmann, R. Consistent estimation of zero-inflated count models. Health Econ. 2012, 22, 673–686. [Google Scholar] [CrossRef] [PubMed]
He, H.; Tang, W.; Wang, W.; Crits-Christoph, P. Structural zeroes and zero-inflated models. Shanghai Arch. Psychiatry 2014, 26, 236–242. [Google Scholar] [PubMed]
Zuur, A.F.; Ieno, E.N. Beginner’s Guide to Zero-Inflated Models with R; Highland Statistics Limited: Newburgh, NY, USA, 2016. [Google Scholar]
Lee, S.M.; Pho, K.H.; Li, C.S. Validation likelihood estimation method for a zero-inflated Bernoulli regression model with missing covariates. J. Stat. Plan. Inference 2021, 214, 105–127. [Google Scholar] [CrossRef]
Pho, K.H. Goodness of fit test for a zero-inflated Bernoulli regression model. Commun.-Stat.-Simul. Comput. 2024, 53, 756–771. [Google Scholar] [CrossRef]
Li, C.S.; Lu, M. Semiparametric zero-inflated Bernoulli regression with applications. J. Appl. Stat. 2022, 49, 2845–2869. [Google Scholar] [CrossRef]
Pho, K.H. Zero-inflated probit Bernoulli model: A new model for binary data. In Communications in Statistics-Simulation and Computation; Taylor & Francis: Abingdon, UK, 2023; pp. 1–21. [Google Scholar]
Lu, M.; Li, C.S.; Wagner, K.D. Penalised estimation of partially linear additive zero-inflated Bernoulli regression models. J. Nonparametric Stat. 2024, 36, 863–890. [Google Scholar] [CrossRef]
Chiang, J.-Y.; Lio, Y.L.; Hsu, C.-Y.; Ho, C.-L.; Tsai, T.R. Binary classification with imbalanced data. Entropy 2023, 26, 15. [Google Scholar] [CrossRef]
Yu, X.-H.; Chen, G.A. Efficient backpropagation learning using optimal learning rate and momentum. Neural Netw. 1997, 10, 517–527. [Google Scholar] [CrossRef]
Attoh-Okine, N.O. Analysis of learning rate and momentum term in backpropagation neural network algorithm trained to predict pavement performance. Adv. Eng. Softw. 1999, 30, 291–302. [Google Scholar] [CrossRef]
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999, 12, 145–151. [Google Scholar] [CrossRef]
Wang, J.; Yang, J.; Wu, W. Convergence of cyclic and almost-cyclic learning with momentum for feedforward neural networks. IEEE Trans. Neural Netw. 2011, 22, 1297–1306. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Gao, Y.; Yin, W. An improved analysis of stochastic gradient descent with momentum. Adv. Neural Inf. Process. Syst. 2020, 33, 18261–18271. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 2011, 39, 1–13. [Google Scholar] [CrossRef]

Table 1. The MER of the ENR-ZIBer, EM-ZIBer, and logistic regression methods for

n = 150, 300, 500

and

k = 90, 150

in 1000 repetitions.

Table 1. The MER of the ENR-ZIBer, EM-ZIBer, and logistic regression methods for

n = 150, 300, 500

and

k = 90, 150

in 1000 repetitions.

Method	n	k	MRE	SD *
ENR-ZIBer	150	90	0.201	0.0553
		150	0.197	0.0555
EM-ZIBer		–	0.254	0.0774
Logit		–	0.268	0.0718
ENR-ZIBer	300	90	0.209	0.0397
		150	0.207	0.0398
EM-ZIBer		–	0.248	0.0590
Logit		–	0.251	0.0456
ENR-ZIBer	500	90	0.212	0.0318
		150	0.209	0.0304
EM-ZIBer		–	0.238	0.0383
Logit		–	0.244	0.0351

* “SD” is the acronym of the standard deviation.

Table 2. The estimates of the model coefficients for the diabetes data set.

Coefficients	EM-ZIBer	Logit	ENR-ZIBer
$β_{1}$	−7.562	−5.771	−6.615
$β_{2}$	2.665	1.397	2.273
$β_{3}$	6.931	5.434	6.571
$β_{4}$	1.99	−0.122	1.79
$β_{5}$	−0.19	0.628	−0.255
$β_{6}$	3.34	−0.687	2.335
$β_{7}$	5.246	3.449	4.437
$β_{8}$	5.313	2.664	4.355
$β_{9}$	−2.863	2.037	−2.856
$θ_{1}$	0.005	−	0.041
$θ_{2}$	2.232	−	2.236
$θ_{3}$	−1.94	−	−2.048
$θ_{4}$	−10.491	−	−10.781
$λ_{1}$	−	−	0.049
$λ_{2}$	−	−	0.098
Accuracy	80.6%	79.8%	81.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Zero-Inflated Binary Classification Model with Elastic Net Regularization

Abstract

1. Introduction

2. Motivation and Organization

3. The Statistical Model and Optimization

3.1. The Statistical Model

3.2. Maximum Likelihood Estimation

3.3. Optimization with Regularization

3.4. Convergence Acceleration for the Gradient Descent Method

4. Monte Carlo Simulations

5. An Example

6. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Figures for Mean Error Rates

Appendix B. R Codes for the Diabetes Example

References

Article Metrics

Citations

Article Access Statistics