Modeling the Cigarette Consumption of Poor Households Using Penalized Zero-Inflated Negative Binomial Regression with Minimax Concave Penalty

Yudhie Andriyana; Rinda Fitriani; Bertho Tantular; Neneng Sunengsih; Kurnia Wahyudi; I Gede Nyoman Mindra Jaya; Annisa Nur Falah

doi:10.3390/math11143192

,

and

¹

Department of Statistics, Faculty of Mathematics and Natural Sciences, Universitas Padjadjaran, Sumedang 45363, Indonesia

²

National Bureau of Statistics of Blitar Municipality, Blitar 66137, Indonesia

³

Department of Medicine, Faculty of Medicine, Universitas Padjadjaran, Sumedang 45363, Indonesia

⁴

Doctoral Program of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Padjadjaran, Sumedang 45363, Indonesia

Mathematics2023, 11(14), 3192;https://doi.org/10.3390/math11143192

This article belongs to the Special Issue Statistical Modeling for Analyzing Data with Complex Structures

Version Notes

Order Reprints

Abstract

The cigarette commodity is the second largest contributor to the food poverty line. Several aspects imply that poor people consume cigarettes despite having a minimal income. In this study, we are interested in investigating factors influencing poor people to be active smokers. Since the consumption number is a set of count data with zero excess, we have an overdispersion problem. This implies that a standard Poisson regression technique cannot be implemented. On the other hand, the factors involved in the model need to be selected simultaneously. Therefore, we propose to use a zero-inflated negative binomial (ZINB) regression with a minimax concave penalty (MCP) to determine the dominant factors influencing cigarette consumption in poor households. The data used in this study were microdata from the National Socioeconomic Survey (SUSENAS) conducted in March 2019 in East Java Province, Indonesia. The result shows that poor households with a male head of household, having no education, working in the informal sector, having many adult household members, and receiving social assistance tend to consume more cigarettes than others. Additionally, cigarette consumption decreases with the increasing age of the head of household.

Keywords:

penalized; zero inflated; negative binomial; minimax concave penalty; variable selection; cigarette consumption; poor household

MSC:

62J07

1. Introduction

Indonesia has committed to poverty alleviation in the agenda for Sustainable Development Goals (SDGs) and the National Medium-Term Development Plan 2020–2024. It is expected to be able to reduce the national and regional poverty rates. However, from 2010 to 2020, the poverty rate in one of the provinces in Indonesia, East Java Province, was consistently higher than the national poverty rate. The poverty rate in East Java is still at the double-digit level compared to the national poverty rate, which has reached a single-digit level since 2018. This indicates a need for targeted poverty alleviation programs in East Java.

The Indonesia National Socioeconomic Survey conducted in 2019 indicated that the cigarette commodity is the second largest contributor to poverty after rice. The contribution is around 11.82 percent in urban areas and 10.74 percent in rural areas. It indicates that cigarettes are one of the basic needs of poor households in East Java. Cigarette consumption in low-income families is related to their smoking behavior. It is measured by the smoking intensity, represented by the number of cigarettes consumed.

The number of cigarettes consumed is a type of count data (the non-negative integer values). Poisson and negative binomial regressions are commonly used to build predictive models on count data [1]. However, the high frequency of data with zero values can make the regression model less precise in explaining the data. The count data with excess zeros may have significant variations that cause overdispersion.

The zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models can be applied to accommodate the excess zero observation. However, the ZIP model assumes that the variance and mean must be equal, which is called equidispersion. As a result, it is not easy to qualify the actual data. Meanwhile, the ZINB model has no such assumption [2]. The ZINB model is more suitable for accommodating the overdispersion that the ZIP model cannot overcome [1].

The various factors of demographic and socioeconomic conditions may determine cigarette consumption in poor households [3]. The inclusion of many predictor variables in the regression model may lead to multicollinearity conditions. Therefore, a parsimonious model is required for better interpretation and more accurate prediction [4]. On the other hand, stepwise or subset selection methods have weaknesses, including inconsistency between the model selection algorithm and many hypothesis testing problems [5]. Therefore, we implement a method that can select variables and estimate coefficients simultaneously, as proposed by [6]. Hence, this study aims to estimate the model parameters and select the variables to determine the dominant factors influencing cigarette consumption of poor households in East Java, Indonesia, by applying the penalized zero-inflated negative binomial (ZINB) regression model.

Binomial models are extensively used, and numerous techniques have been suggested for estimation, prediction, and various other applications. In the case of estimation and variable selection within the normal linear model, the field of sparse estimation encompasses various methods such as the least absolute shrinkage and selection operator (LASSO) [7], smoothly clipped absolute deviation (SCAD) [8], and minimax concave penalty (MCP) [9].

2. Materials and Methods

2.1. Data

The data were taken from the microdata of the National Socioeconomic Survey (SUSENAS) conducted in March 2019 in East Java Province, Indonesia. The response variable is the weekly average cigarette consumption of poor households during the past month. The predictors are related to household characteristics. For example, a discussion on the poverty data was delivered by [10]. The detailed predictors are presented in Table 1.

Table 1. Predictor Variables.

2.2. Zero-Inflated Negative Binomial (ZINB)

The zero-inflated negative binomial (ZINB) regression is a model to accommodate the overdispersion problem caused by excess zeros in the response variable. The ZINB regression model in which the response variable

Y_{i}

has a probability density function as follows [1,11]:

P (Y_{i} = y_{i}) = \{\begin{cases} π_{i} + (1 - π_{i}) (\frac{θ}{μ_{i} + θ}) & , i f y_{i} = 0 \\ (1 - π_{i}) \frac{Γ (θ + y_{i})}{Γ (y_{i} + 1) Γ (θ)} {(\frac{μ_{i}}{μ_{i} + θ})}^{y_{i}} {(\frac{θ}{μ_{i} + θ})}^{θ} & , i f y_{i} > 0 \end{cases}

(1)

where

0 \leq π_{i} \leq 1

is the success probability,

μ_{i} \geq 0

is the conditional expectation of the response,

i = 1, \dots, n

, and

\frac{1}{θ}

is the dispersion parameter with

θ > 0

. Assume

μ_{i}

and

π_{i}

are parameters that depend on the predictor vectors

{\vec{x}}_{i}

and

{\vec{v}}_{i}

, respectively. The length of the predictor vector

{\vec{x}}_{i}

is

q_{1}

, while the length of the predictor vector

{\vec{v}}_{i}

is

q_{2}

. Therefore, the equation of the link function is as follows:

\log (μ_{i}) = {\vec{x}}_{i}^{T} \vec{β} and \log (\frac{π_{i}}{1 - π_{i}}) = {\vec{v}}_{i}^{t} \vec{ζ}

(2)

where

\vec{β}

and

\vec{ζ}

are the regression coefficients. For

n

independent random samples, let

\vec{φ} = ({\vec{β}}^{t}, θ, {\vec{ζ}}^{t})

be the log-likelihood function given by:

l (\vec{ϕ}) = \sum_{y_{i} = 0} \log [π_{i} + (1 - π_{i}) {(\frac{θ}{μ_{i} + θ})}^{θ}] + \sum_{y_{i} > 0} \log [(1 - π_{i}) \frac{Γ (θ + y_{i})}{Γ (y_{i} + 1) Γ (θ)} {(\frac{μ_{i}}{μ_{i} + θ})}^{θ}]

(3)

2.3. Penalized Zero-Inflated Negative Binomial (ZINB) Regression

Variable selection is the key to producing a parsimonious model to facilitate model interpretation [12]. The modern variable selection methods were developed to overcome the weaknesses of the traditional methods by adding the penalized likelihood or penalty function. The selection of variables using the penalized ZINB regression model is as follows [4]:

p l (φ) = l (φ) - p (\vec{β}, \vec{ζ})

(4)

where

l (φ)

is the log-likelihood of the ZINB function and

p (\vec{β}, \underset{\to}{ζ})

is the non-negative penalty function given by:

p (\vec{β}, \underset{\to}{ζ}) = n \sum_{j = 1}^{q_{1}} p (λ_{N B}; |β_{j}|) + n \sum_{k = 1}^{q_{2}} p (λ_{B I}; |ζ_{k}|)

(5)

where

q_{1}

and

q_{2}

are the number of parameters for the NB and zero components, respectively.

β_{j}

represents the NB component parameter coefficients and

ζ_{k}

represents the zero component parameter coefficients. Tuning parameters

λ_{N B}

and

λ_{B I}

are determined by data-driven methods. Intercept and parameter

θ

are always included in the model; hence, they are not penalized.

This study used three penalty functions, as follows:

Least absolute shrinkage and selection operator (LASSO). The penalty function is given by [3]:

$p (λ^{L}; |ξ|) = λ^{L} |ξ|$

(6)

where $ξ$ is the penalty coefficients.
Smoothly clipped absolute shrinkage (SCAD). The first derivative of the SCAD penalty function on [0, ∞) is given by [8]:

$p_{λ^{S}}^{'} (ξ) = λ^{S} \{I (ξ \leq λ^{S}) + \frac{(γ^{S} λ^{S} - ξ)}{(γ^{S} - 1) λ^{S}} + I (ξ > λ^{S})$

(7)

For $γ^{S} > 2, λ^{S} \geq 0, ξ > 0$ where $I (.)$ is the indicator function.
Minimax concave penalty (MCP). The first derivative of the MCP penalty function on [0, ∞) is given by [13]:

$p^{'}_{λ^{M}, γ^{M}} (ξ) = \{\begin{cases} λ^{M} - \frac{ξ}{γ^{M}}, if ξ \leq γ^{M} λ^{M} \\ 0, if ξ > γ^{M} λ^{M} \end{cases}$

(8)

2.4. The EM Algorithm

The expectation maximization (EM) algorithm is generally applied to mixture models [12]. It is often used in missing data problems. The EM algorithm has two main steps: the expectation step (E-step) and the maximization step (M-step).

Let

Z_{i} = 1

if

Y_{i}

is a zero value observation and

Z_{i} = 0

if

Y_{i}

is from a negative binomial (NB) distribution. Since

\vec{z} = {(z_{1}, \dots, z_{n})}^{t}

is not observable, it is often treated as missing data. If complete data

(Y_{i}, z_{i})

are available, the penalized log-likelihood function is given by [4]:

p l_{c} (ϕ) = \sum_{i = 1}^{n} \{z_{i} {\vec{v}}_{i} \vec{ζ} - \log (1 + \exp ({\vec{v}}_{i} \vec{ζ})) + (1 - z_{i}) \log (f (y_{i}; \vec{β}, θ))\} - p (\vec{β}, \vec{ζ})

(9)

where

f (y_{i}; β, θ) = \frac{Γ (θ + y_{i})}{Γ (y_{i} + 1) Γ (θ)} {(\frac{μ_{i}}{μ_{i} + θ})}^{Y} {(\frac{θ}{μ_{i} + θ})}^{θ}

and

p (\vec{β}, \vec{ζ})

is a non negative penalty function.

The EM algorithm computes the expectation of the complete data log-likelihood, which is linear in z. The log-likelihood of Equation (4) is maximized iteratively by alternating between estimating

z_{i}

by its conditional expectation under the current estimates

\hat{ϕ}

(E-step) and maximizing Equation (11) (M-step) by

z_{i}

from the E-step. The iteration stops when the estimated

\hat{φ}

converges.

The E-step estimates

z_{i}

by its conditional mean

z_{i}^{(m)}

, given the data and assuming that the current estimates

{\hat{φ}}^{(m)}

are provided by the right model parameters. The conditional expectation of

z

at iteration

m

is given by:

{\hat{z}}_{i}^{(m)} = {\{\begin{cases} (1 + \exp (- v_{i} {\hat{ζ}}^{(m)}) {[\frac{{\hat{θ}}^{(m)}}{\exp (x_{i} {\hat{β}}^{(m)}) + {\hat{θ}}^{(m)}}]}^{\hat{θ}}) \\ 0, if y_{i} > 0 \end{cases}}^{- 1}, if y_{i} = 0

(10)

Therefore, the expectation of the complete data loglikelihood can be expressed as follows:

\begin{array}{l} Q (ϕ {|\hat{ϕ}}^{(m)}) & = E (p l_{c} (ϕ |y, z) |y, {\hat{ϕ}}^{(m)}) \\ = Q_{1} (\vec{β}, θ {|\hat{ϕ}}^{(m)}) + Q_{2} (\vec{ζ} {|\hat{ϕ}}^{(m)}) \end{array}

(11)

where

Q_{1} (\vec{β}, θ {|\hat{ϕ}}^{(m)}) = \sum_{i = 1}^{n} (1 - {\hat{z}}_{i}^{(m)}) \log (f (y_{i}, \vec{β}, θ)) - n \sum_{j = 1}^{q_{1}} p (λ_{N B}, |β_{j}|)

and

Q_{2} (\vec{ζ} |{\hat{φ}}^{(m)}) = \sum_{i = 1}^{n} {\hat{z}}_{i}^{(m)} v_{i} \vec{ζ} - \log (1 + \exp (v_{i} \vec{ζ})) - n \sum_{k = 1}^{q_{2}} p (λ_{B I}, |ζ_{k}|)

.

The EM algorithm updates the estimates by maximizing Equation (11), which can easily be achieved because

(\vec{β}, θ)

and

ζ

are two disjoint terms. The first term,

Q_{1} (\vec{β}, θ |{\hat{φ}}^{(m)})

, is the weighted penalized NB log-likelihood function, whereas the second term,

Q_{2} (\vec{ζ} |{\hat{φ}}^{(m)})

, is the penalized logistic log-likelihood function. The EM performs an iterative procedure for the E-step and M-step until the optimum parameter

\hat{ϕ}

is obtained.

The EM algorithm requires parameter estimation of the two terms, which are the components of the penalized ZINB model. Meanwhile, the NB and logistic regression models are generalized linear models (GLMs). The parameter estimation of regularized GLMs used the modified iteratively reweighted least squares (IRLS) algorithm. It uses a coordinate descent algorithm to optimize the penalized weighted least squares [12].

2.5. Tuning Parameter Selection

The tuning parameter can be determined using the Bayesian information criterion (BIC) as follows [4]:

B I C = - 2 l (\hat{ϕ}; λ_{N B}, λ_{B I}) + \log (n) d f

(12)

where

\hat{ϕ}

is the estimated parameter,

λ_{N B} > 0

is the tuning parameter of the NB component,

λ_{B I} > 0

is the tuning parameter of the zero component, and

l (.)

is the log-likelihood function. The degrees of freedom are given by

d f = \sum_{j = 0, \dots, q_{1}} 1 \{β_{j} \neq 0\} + \sum_{k = 0, \dots, q_{2}} 1 \{ζ_{k} \neq 0\} + 1

, including the degree of freedom for the scale parameter θ. The first step is to construct a solution path based on the paired shrinkage parameters. The algorithm generates two decreasing sequenced parameters,

λ_{N B}^{(1)} > λ_{N B}^{(2)} > \dots > λ_{N B}^{(M)}

and

λ_{B I}^{(1)} > λ_{B I}^{(2)} > \dots > λ_{B I}^{(M)}

. Subsequently, sequences

(λ_{N B}^{(1)}, λ_{B I}^{(1)}), \dots, (λ_{N B}^{(M)}, λ_{B I}^{(M)})

are paired. In principle, the large value can be chosen in such a way that all the coefficients are zero, except the intercepts. The tuning parameter can be chosen based on the smallest BIC value [12].

3. Results

This study observed 3010 poor household samples from the National Socioeconomic Survey (SUSENAS) in March 2019 in East Java. All computations in this section are presented in Appendix A. Based on the data, the highest number of cigarette consumption in poor households was 720 cigarettes per week during the past month. In contrast, the smallest consumption was 0 cigarettes (no smoking). In other words, 41.4 percent of no-smoking poor households is indicated by the histogram decreasing to the right (Figure 1). Meanwhile, the mean is around 45 cigarettes per week during the past month, with a relatively large standard deviation of 62.49. It indicated that the range of data variation was so extensive that cigarette consumption among poor households varied considerably. The standard deviation value is greater than the mean value, which causes overdispersion.

Figure 1. Histogram of Cigarette Consumption of Poor Households in East Java 2019.

Figure 2 demonstrates that most poor households who smoked lived in rural areas. It is in line with a related study that cigarette consumption is higher among low-income families living in rural areas than in urban areas [14]. The poor households who smoked had a male head of household and were married. The head of the poor household who smoked generally had low education. Based on the previous study, families with a head of household who did not attend school tended to consume cigarettes more than those who attended school [15]. Smoking also appeared to be more prevalent in those employed than those unemployed. Cigarette consumption of poor households who worked in informal sectors is higher than those who worked in formal sectors. The previous study showed that families with a head of household who worked in the informal sector tended to smoke more than those who did not [16].

Figure 2. The Characteristics of Poor Households who Smoking in East Java 2019.

Based on Figure 2, most poor households who smoked received social assistance. Moreover, poor households with children under five had a lower smoking percentage. In addition, more than 50% of poor households who smoked lived in their own house.

Overdispersion checking was done by comparing the variance and mean. If the variance is greater than the mean, an overdispersion occurs. Overdispersion checking can also be seen from the statistical dispersion, namely the division between the Pearson chi-square statistical value and the degrees of freedom [5]. Based on the processing, the Pearson chi-square statistical value is 170,923 with a degree of freedom of 2989. Hence, the statistical dispersion is 57.1841, which indicates an overdispersion on the response variable.

The excess zero testing in the count data was done using the score test [17]. The result indicated that the p-value of

2.22 \times 10^{- 16}

is smaller than the 5 percent significance level. It also indicated that the response variables have excess zeros, so the zero inflation regression model is more appropriate. ZINB regression with Backward Elimination (BE) regression for 14 predictor variables was employed to model the cigarette consumption of poor households in East Java. The likelihood ratio test was done to test parameter estimation simultaneously, whereas the Wald test was employed to test the parameters partially [18]. The result indicated that the likelihood ratio is 877.1114, greater than

χ_{(40, 0.05)}^{2}

, which is 55.76. In other words, at least one predictor variable influenced the response variable.

In the early steps of modeling the ZINB penalized regression, the tuning parameter was selected based on the smallest BIC value. In each penalty function, two tuning parameters were to be determined: the tuning parameter for the negative binomial component

λ_{N B}

and the tuning parameter for the zero component

λ_{B I}

. In addition, there was an additional tuning parameter in the SCAD and MCP penalty functions [19,20,21].

Based on Table 2, the ZINB-LASSO model selected predictor variables more minor than other penalized ZINB models. Because of the characteristics of the LASSO penalty function, only one variable could be chosen among the predictor variables with a high correlation [22]. The predictor variables selected in the zero component of the ZINB-SCAD and ZINB-MCP models were generally the same. Meanwhile, the selected predictors of the ZINB-MCP in the NB component were the least.

Table 2. ZINB Model on Poor Household Cigarette Consumption.

As seen in Table 3, the ZINB-MCP model has the smallest BIC value than the other four regression models. Moreover, the number of parameter coefficients of the penalized ZINB-MCP model is relatively smaller than the other models, indicating that the ZINB-MCP model is parsimonious. Thus, the ZINB-MCP model could better model the cigarette consumption data of poor households in East Java in 2019. Based on the results of parameter estimation in Table 2, the ZINB-MCP model equation is as follows:

Table 3. Bayesian Information Criterion (BIC) on ZINB Model.

ZINB-MCP model in the negative binomial (NB) component

$\hat{μ} = \exp (3.7802 - 0.0052 X_{4} + 0.1598 X_{5 d_{1}} + 0.2368 X_{7})$
ZINB-MCP model in the zero component

$\ln (\frac{\hat{π}}{1 - \hat{π}}) = 1.8367 - 1.3676 X_{2} + 0.0173 X_{4} - 0.4035 X_{5 d_{3}} - 0.2186 X_{6 d_{2}} - 0.5599 X_{7} - 0.3240 X_{10}$

The zero component model shows that the gender of the head of the household had a significant effect on cigarette consumption in poor households. Poor households with a male head of household tended to consume more cigarettes than those with a female house of household. The Indonesian Demographic and Health Survey (IDHS) in 2012 showed that most of the heads of households in Indonesia were male smokers [16]. Meanwhile, NB and zero component models showed that the age variable significantly influenced cigarette consumption in poor households. Based on a previous study, cigarette consumption will decrease as the age of the head of household increases [6]. It indicates that most of the young heads of households smoked cigarettes. Therefore, controlling tobacco or cigarette consumption should be targeted at all ages.

The education variable of the head of household significantly influenced cigarette consumption in poor households according to the NB and zero component models. The low-income families whose head of household did not attend school tended to consume more cigarettes than poor households whose head of household had higher education. This result aligns with studies in 48 low- and middle-income countries. Those who do not attend school generally smoke three times more than those who are educated [6]. An educated person will have great awareness of the health risks of smoking [23]. Therefore, education is the primary key to reducing cigarette consumption in poor households.

The zero component model found that the working status variable of the head of household significantly influenced cigarette consumption in poor households. Poor households with a head of household who worked in the informal sector tended to consume more cigarettes than those who did not, which is in line with the previous study [16]. Therefore, it is necessary to put various efforts into increasing awareness of the dangers of smoking by disseminating numerous media that are easily accessible to households. Provincial and regency/city governments need to impose the regulation of non-smoking areas.

The NB and zero component models showed that the number of adult members variable significantly influenced cigarette consumption in poor households. In line with the previous study, adding one adult member would increase cigarette consumption in poor households. It is reasonable that the household member who smokes is an adult.

Furthermore, the zero component model found that the social assistance variable significantly influenced cigarette consumption in poor households. Poor households that received social assistance were more likely to consume cigarettes than those that did not. In line with the previous study, the average number of cigarettes consumed by the head of a household of poor households will increase along with the amount of social assistance received [24]. This phenomenon relates to the household’s behavior (moral hazard) in allocating excess income. They will likely not spend the additional income from social assistance benefits on basic needs. Another study showed that middle- or low-income households often allocate food expenditure for cigarette consumption [25]. Therefore, the government should encourage low-income families to use the social assistance benefits for food consumption, support education, health, and basic needs other than cigarette consumption, to improve their standard of living to get out of poverty.

4. Discussion

Modeling the behavior of cigarette consumption is carried out using a technique called Zero-Inflated Negative Binomial (ZINB) with Backward Elimination (BE) regression. Due to some potential covariates, we then need to combine its classical technique with a variable selection procedure. Hence, we penalized the ZINB regression using three penalty functions (LASSO, SCAD, and MCP). Based on the smallest BIC and RMSE values, the penalized ZINB-MCP regression performs better than the others. We also investigate that out of 14 predictor variables (10 dummy), six predictors (two dummy) are selected, those are; gender of the head of household, age of head of household, education of the head of household in the non-school category, working status of head of household in informal sector category, number of adult household members, and social assistance. We found that some variables have a negative effect, for example, the cigarettes consumptions are decreasing as the age of the consumers increases.

5. Conclusions

According to the results, it can be concluded that modeling and selection variables of cigarette consumption of poor households used classical (ZINB-BE) and modern methods, namely penalized ZINB regression using three penalty functions (LASSO, SCAD, and MCP). The best model was obtained based on the smallest BIC value, namely the ZINB-MCP model. This study indicated that poor households with a male head of household who was young, had no education, worked in the informal sector, had many adult household members, and received social assistance tended to consume more cigarettes. Therefore, the awareness of poor households for reducing cigarette consumption should be supported by improving their education and knowledge.

Author Contributions

Conceptualization, Y.A. and R.F.; methodology, Y.A.; software, R.F.; validation, Y.A., B.T., N.S. and K.W.; formal analysis, Y.A., B.T., N.S. and K.W.; investigation, Y.A.; resources, R.F.; data curation, R.F.; writing—original draft preparation, Y.A. and R.F.; writing—review and editing, Y.A. and A.N.F.; visualization, R.F.; supervision, Y.A., B.T., N.S., I.G.N.M.J. and K.W.; project administration, Y.A. and A.N.F.; funding acquisition, Y.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Universitas Padjadjaran via RPLK scheme with the contract No: 1549/UN6.3.1/PT.00/2023.

Data Availability Statement

Not applicable.

Acknowledgments

The authors gratefully thank to Universitas Padjadjaran for supporting the research which is funded by RPLK scheme with the contract No: 1549/UN6.3.1/PT.00/2023. We also thank to reviewers for the valuable review for this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

 
### RCODES ###
rm(list=ls())
 
library(“mpath”)
library(“zic”)
library(“pscl”)
library(vcdExtra)
 
data=read.table(file.choose(),header=T,sep=“,”)
 
y <- data$ciga
x1<- data$Residence
x2<- data$Gender
x3<- data$Single
x4<- data$Divorce
x5<- data$Age
x6<- data$Secondary
x7<- data$Primary
x8<- data$NoEduc
x9<- data$Formal
x10<- data$Informal
x11<- data$Adultmembers
x12<- data$HouseholdInternet
x13<- data$HouseholdWork
x14<- data$SocialAssistance
x15<- data$ToddlerExistence
x16<- data$Rent
x17<- data$FreeRent
x18<- data$Other
x19<- data$HealthExpenditure
x20<- data$EducationExpenditure
 
#--------------Histogram for the response variable----------------
h<-hist(y,main=“Histogram of Cigarette Consumption”,
ylab=“Frekuency”,xlab=“Cigarettes (sticks)”,
xlim=c(0,600),ylim=c(0,1500),breaks=50,col=“blue”, freq=T)
 
#----------Goodness of fit of the Response Variable-----------
# Poisson with KS test
ks.test(y,”ppois”,lambda <- mean(y))
 
# Poisson Dispersion Test/Variance Test
TCC <-((length(y)-1)*var(y))/mean(y)
qchisq(0.05,length(y)-1)
 
#------------Overdispersion----------------------
#Model: POISSON
library(MASS)
Pois <-glm(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+
x16+x17+x18+x19+x20, family=poisson)
summary(Pois)
 
 
#------------------Zero Excess Test----------------
zero.test(y)
 
##----------------Estimation Model Using ZINB-----------
dat<-cbind(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,
x16,x17,x18,x19,x20)
dat<-as.data.frame(dat)
m1<-zeroinfl(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+
x16+x17+x18+x19+x20,dist=“negbin”, data=dat)
summary(m1)
 
cat(“loglik of zero-inflated model”, logLik(m1))
cat(“BIC of zero-inflated model”, AIC(m1, k=log(dim(dat)[1])))
cat(“AIC of zero-inflated model”, AIC(m1))
 
res.zinb=m1$residual
rmse.zinb=sqrt(mean((res.zinb)^2))
rmse.zinb
 
#----Likelihood ratio for simultaneous test----
m2<-zeroinfl(y~1,dist=“negbin”, data=dat)
summary(m2)
l0<- logLik(m2)
lp<- logLik(m1)
G<- -2*(l0-lp)
G
qchisq(0.95,40)
 
##------------Estimation Model Using ZINB BE(0,05)------
fitbe<-be.zeroinfl(m1,data=dat, dist=“negbin”, alpha=0.05,
trace=FALSE)
summary(fitbe)
 
cat(“loglik of zero-inflated model with backward selection”,
logLik(fitbe))
cat(“BIC of zero-inflated model with backward selection”,
AIC(fitbe, k=log(dim(dat)[1])))
 
minBic <- which.min(BIC(fitbe))
AIC(fitbe)[minBic]
BIC(fitbe)[minBic]
 
res.be=fitbe$residual
rmse.be=sqrt(mean((res.be)^2))
rmse.be
 
##---------Estimation Model Using Penalized ZINB-LASSO--------
fit.lasso<-zipath(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+
x14+x15+x16+x17+x18+x19+x20,data=dat,family=“negbin”,	nlambda=100, lambda.zero.min.ratio=0.001,
		maxit.em=300, maxit.theta=25, theta.fixed=FALSE,
trace=FALSE, penalty=“enet”, rescale=FALSE)
 
minBic <- which.min(BIC(fit.lasso))
coef(fit.lasso, minBic)
cat(“theta estimate”, fit.lasso$theta[minBic])
se(fit.lasso, minBic, log=FALSE)
 
AIC(fit.lasso)[minBic]
BIC(fit.lasso)[minBic]
logLik(fit.lasso)[minBic]
 
 
#plot BIC lasso with tuning parameter indexes
BIC.Lasso<-BIC(fit.lasso)
plot(BIC.Lasso)
 
res.so=fit.lasso$residual [1:3010,22]
rmse.so=sqrt(mean((res.so)^2))
rmse.so
 
##-------Estimation Model Using Penalized ZINB-SCAD-----
tune.scad<-tuning.zipath(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+
x12+x13+x14+x15+x16+x17+x18+x19+x20,data=dat
standardize=TRUE, family = “negbin”,
penalty = “snet”,lambdaCountRatio = .0001, lambdaZeroRatio = c(.1, .01, .001), maxit.theta=1, gamma.count=3.7, gamma.zero=3.7)
 
fit.scad <- zipath(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+
x13+x14+x15+x16+x17+x18+x19+x20,data = dat,
family = “negbin”,lambda.count=tune.scad$lambda.count
lambda.zero= tune.scad$lambda.zero, maxit.em=300, maxit.theta=25, theta.fixed=FALSE, penalty=“snet”)
 
minBic.s <- which.min(BIC(fit.scad))
coef(fit.scad, minBic.s)
cat(“theta estimate”, fit.scad$theta[minBic.s])
se(fit.scad, minBic.s, log=FALSE)
 
AIC(fit.scad)[minBic.s]
BIC(fit.scad)[minBic.s]
logLik(fit.scad)[minBic.s]
 
 
#plot BIC scad dg indeks tuning parameter
BIC.Scad<-BIC(fit.scad)
plot(BIC.Scad)
 
res.scad=fit.scad$residual [1:3010,26]
rmse.scad=sqrt(mean((res.scad)^2))
rmse.scad
 
 
##---------Estimation Model Using Penalized ZINB-MCP----------
tune<-tuning.zipath(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12
+x13+x14+x15+x16+x17+x18+x19+x20,data=dat,standardize=TRUE,
family = “negbin”,penalty = “mnet”,lambdaCountRatio = .0001,
	lambdaZeroRatio = c(.1, .01, .001), maxit.theta=1,
gamma.count=2.7, gamma.zero=2.7)
 
fit.mcp<-zipath(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+
x15+x16+x17+x18+x19+x20,data=dat,family = “negbin”
gamma.count=2.7, gamma.zero=2.7,
		lambda.count=tune$lambda.count,
lambda.zero= tune$lambda.zero,maxit.em=300,
maxit.theta=1, theta.fixed=FALSE, penalty=“mnet”)
 
minBic <- which.min(BIC(fit.mcp))
coef(fit.mcp, minBic)
cat(“theta estimate”, fit.mcp$theta[minBic])
se(fit.mcp, minBic, log=FALSE)
 
AIC(fit.mcp)[minBic]
BIC(fit.mcp)[minBic]
logLik(fit.mcp)[minBic]
 
#plot BIC mcp with tuning parameter
BIC.mcp<-BIC(fit.mcp)
plot(BIC.mcp)
 
res.mcp=fit.mcp$residual [1:3010,21]
rmse.mcp=sqrt(mean((res.mcp)^2))
rmse.mcp
 
##---------Residual Checking using ZINB-MCP-------
#Normalitas
#Histogram Residual
res.mcp=fit.mcp$residual [1:3010,21]
hist(res.mcp, freq = FALSE)
curve(dnorm, add = TRUE)
 
#Normal Probability Plot of the residual
probDist <- pnorm(res.mcp)
plot(ppoints(length(res.mcp)), sort(probDist), main = “PP Plot”, xlab = “Observed Probability”, ylab = “Expected Probability”)
abline(0,1, col=“red”)
 
#Plot between Residual v.s. Fittedvalue
pearson.res=resid(fit.mcp, type=‘pearson’)[1:3010,21]
miu.hat=predict(fit.mcp,type=‘respon’)[1:3010,21]
plot(miu.hat,pearson.res, main=“ZINB-MCP Regression”,
ylab=“Residuals”, xlab=“Predicted”, col=“blue”)
abline(h=0,lty=1,col=“red”)
lines(lowess(miu.hat,pearson.res),lwd=2, lty=2)
 
#Independensi Residual
lag.plot(res.mcp)

References

Said, A. Indonesian Sustainable Development Goals (SDGs) Indicators, BPS RI/BPS-Statistics Indonesia; Indonesian Statistical Bureau: Jakarta, Indonesia, 2019; p. 11. [Google Scholar]
Kang, K.I.; Kang, K.; Kim, C. Risk factors influencing cyberbullying perpetration among middle school students in Korea: Analysis using the zero-inflated negative binomial regression model. Int. J. Environ. Res. Public Health 2021, 18, 2224. [Google Scholar] [CrossRef]
Komasari, D.; Helmi, A.F. Faktor-faktor penyebab perilaku merokok pada remaja. J. Psikol. 2000, 27, 37–47. [Google Scholar]
Wang, Z.; Ma, S.; Wang, C. Variable selection for zero-inflated and overdispersed data with application to health care demand in Germany. Biom. J. 2015, 57, 867–884. [Google Scholar] [CrossRef] [PubMed]
Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Hosseinpoor, A.R.; Parker, L.A.; Tursan d’Espaignet, E.; Chatterji, S. Social determinants of smoking in low-and middle-income countries: Results from the World Health Survey. PLoS ONE 2011, 6, e20331. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Park, S.; Yang, A.; Ha, H.J.; Lee, J. Measuring the Differentiated Impact of New Low-Income Housing Tax Credit (LIHTC) Projects on Households’ Movements by Income Level within Urban Areas. Urban Sci. 2021, 5, 79. [Google Scholar] [CrossRef]
Wang, Z.; Ma, S.; Zappitelli, M.; Parikh, C.; Wang, C.-Y.; Devarajan, P. Penalized count data regression with application to hospital stay after pediatric cardiac surgery. Stat. Methods Med. Res. 2016, 25, 2685–2703. [Google Scholar] [CrossRef]
Wang, Z.; Ma, S.; Wang, C.; Zappitelli, M.; Devarajan, P.; Parikh, C. EM for regularized zero-inflated regression models with applications to postoperative morbidity after cardiac surgery in children. Stat. Med. 2014, 33, 5192–5208. [Google Scholar] [CrossRef]
Breheny, P.; Huang, J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 2011, 5, 232. [Google Scholar] [CrossRef] [PubMed]
Hu, T.W.; Mao, Z.; Liu, Y.; de Beyer, J.; Ong, M. Smoking, standard of living, and poverty in China. Tob. Control 2005, 14, 247–250. [Google Scholar] [CrossRef] [PubMed]
Siahpush, M. Socioeconomic status and tobacco expenditure among Australian households: Results from the 1998–99 Household Expenditure Survey. J. Epidemiol. Community Health 2003, 57, 798–801. [Google Scholar] [CrossRef] [PubMed]
Herawati, P.; Afriandi, I.; Wahyudi, K. Determinan Paparan Asap Rokok di Dalam Rumah: Analisis Data Survei Demografi dan Kesehatan Indonesia (SDKI) 2012. Bul. Penelit. Kesehatan. Bul. Penelit. Kesehat. 2019, 47, 245–252. [Google Scholar]
Van den Broek, J. A score test for zero inflation in a Poisson distribution. Biometrics 1995, 51, 738. [Google Scholar] [CrossRef]
Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data; Cambridge University Press: Cambridge, UK, 2013; Volume 53. [Google Scholar]
Hirose, Y. Regularization methods based on the Lq-likelihood for linear models with heavy-tailed errors. Entropy 2020, 22, 1036. [Google Scholar] [CrossRef]
Patil, A.R.; Kim, S. Combination of ensembles of regularized regression models with resampling-based lasso feature selection in high dimensional data. Mathematics 2020, 8, 110. [Google Scholar] [CrossRef]
Liu, X.; Zhao, B.; He, W. Simultaneous feature selection and classification for data-adaptive Kernel-Penalized SVM. Mathematics 2020, 8, 1846. [Google Scholar] [CrossRef]
Algamal, Z.Y.; Lee, M.H. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015, 42, 9326–9332. [Google Scholar] [CrossRef]
Pampel, F. Tobacco use in sub-Sahara Africa: Estimates from the demographic health surveys. Soc. Sci. Med. 2008, 66, 1772–1783. [Google Scholar] [CrossRef]
Cendekia, D.G. Keterkaitan Transfer Pemerintah Untuk Perlindungan Sosial Terhadap Perilaku Merokok Pada Rumah Tangga Miskin Di Indonesia (The Influence of Government Transfers for Social Protection on Smoking Behaviour Among Poor Households in Indonesia). J. Kependud. Indones. 2018, 13, 133–142. [Google Scholar]
John, R.M.; Ross, H.; Blecher, E. Tobacco expenditure and its implications for household resource allocation in Cambodia. Tob. Control 2012, 21, 341–346. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Histogram of Cigarette Consumption of Poor Households in East Java 2019.

Figure 2. The Characteristics of Poor Households who Smoking in East Java 2019.

Table 1. Predictor Variables.

Variable	Description
Residence	0: rural, 1: urban
Gender	0: female, 1: male
Marital status	0: married, 1: single, 2: divorce
Age
Education	0: college, 1: secondary, 2: primary, 3: no education
Working status	0: not working, 1: formal, 2: informal
Adult household members
Household members who use the Internet
Household members who do not work
Social assistance	0: no receive, 1: receive
Toddler existence	0: no, 1: yes
Housing tenure	0: owner, 1: rent, 2: free rent, 3: other
Health expenditure
Education expenditure

Table 2. ZINB Model on Poor Household Cigarette Consumption.

Variable	Category	Negative Binomial (NB) Component				Zero Component
Variable	Category	BE	LASSO	SCAD	MCP	BE	LASSO	SCAD	MCP
Intercept		3.6657 (0.1021)	3.7446 (0.0838)	3.8485 (0.082)	3.7802 (0.078)	1.7833 (0.2522)	1.4862 (0.1060)	1.7264 (0.2086)	1.8367 (0.2159)
Residence	Rural
Gender	Male	0.1262 (0.0585)				−1.3886 (0.1142)	−0.9839 (0.0937)	−1.4053 (0.1091)	−1.3676 (0.1096)
Marital status	Single					0.7911 (0.3892)
	Divorce
Age		−0.0062 (0.0014)	−0.0019 (0.0015)	−0.0072 (0.0015)	−0.0052 (0.0015)	0.0177 (0.0035)		0.0175 (0.0032)	0.0173 (0.0033)
Education	Secondary			−0.0033 (0.0527)		0.3112 (0.1204)
	Primary
	No Educ	0.2160 (0.0364)	0.0887 (0.0399)	0.2088 (0.0418)	0.1598 (0.0394)	−0.3659 (0.0988)		−0.4156 (0.0950)	−0.4035 (0.0956)
Working status	Formal			0.0105 (0.0473)
	Informal					−0.2369 (0.0881)			−0.2186 (0.0867)
Adult members		0.2436 (0.0157)	0.1976 (0.0186)	0.2444 (0.0175)	0.2368 (0.0172)	−0.4990 (0.0468)	−0.3687 (0.0395)	−0.5628 (0.0410)	−0.5599 (0.0410)
Household members who use the internet
Household members who do not work			0.0035 (0.0132)			−0.0848 (0.0347)	−0.0078 (0.0290)
Social assistance	Receive			0.0005 (0.0354)		−0.2975 (0.0826)		−0.3306 (0.0820)	−0.324 (0.0821)
Toddler existence	Yes	0.0807 (0.0380)
Housing tenure	Rent
	Free rent	0.1522 (0.0737)		0.0395 (0.0159)
	Other
Health expenditure		−0.0099 (0.0033)		−0.003 (0.0031)
Education expenditure
Theta		2.1949 (1.0339)	2.1476 (0.0814)	2.181 (0.0812)	2.17 (0.0813)

The estimated coefficients with standard errors in parentheses.

Table 3. Bayesian Information Criterion (BIC) on ZINB Model.

Model	BIC
ZINB-BE	21,917.3
ZINB-LASSO	21,992.9
ZINB-SCAD	21,928.3
ZINB-MCP	21,899.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Modeling the Cigarette Consumption of Poor Households Using Penalized Zero-Inflated Negative Binomial Regression with Minimax Concave Penalty

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. Zero-Inflated Negative Binomial (ZINB)

2.3. Penalized Zero-Inflated Negative Binomial (ZINB) Regression

2.4. The EM Algorithm

2.5. Tuning Parameter Selection

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics