3. Proposed Method for Zero-Inflated Data Analysis
We propose new analysis method based on BNNR to solve the sparsity problem that occurs during zero-inflated data analysis. To check for the zero-inflated problem in the given data, we quantify zero inflation by the zero ratio in Equation (6), defined as the proportion of observations equal to zero:
In this study, we operationally regard a zero ratio ≥ 10% as excessive zeros. Because the appropriate threshold can be domain-dependent, we also report results across multiple zero ratio levels to assess robustness. The method is an approach intended to solve problems when the zero ratio is very large and analysis is difficult even with existing zero-inflated models. Since the BNNR method is a model that adds regularization to a BNN, we first describe the structure of the BNN. The proposed BNNR is implemented by an R package for Bayesian regularization in feed-forward neural networks [
16]. The network model consists of input and hidden layers with a fixed number of nodes, and an output layer to predict the conditional mean of a zero-inflated count response. For each observation
, the model computes
, where
contains all weights and bias. The BNNR model is trained by minimizing the squared error between
and
, and is regularized through Gaussian priors on
and data-based tuning of the regularization strength. Given data
, the input and output are represented as
and
, respectively.
is the weight vector of the neural network. We model the zero-inflated count data using a neural network with hidden layers. We use sigmoid and exponential activation functions for hidden and output layers, respectively. Using normalization, we unify the scale of all input variables. When
is the target, we define
as following a Gaussian distribution with output given by the neural network model
, and
and precision
for mean and inverse variance, respectively. According to
, we illustrate the posterior of
as follows [
27,
37].
In Equation (7), 0 and
are the zero vector and the identity matrix. The likelihood function for
is expressed as follows.
Using the previous posterior distribution as the prior distribution, we obtain the following posterior distribution.
BNNs that use prior distributions for weights are less likely to be overly reliant on specific parameter values than general neural networks, but they still have difficulty due to overfitting. Moreover, prior misspecification or overly concentrated or shrinking priors, together with restrictive approximate posteriors, can over-regularize Bayesian neural networks, leading to underfitting and biased uncertainty. See Foong et al. on the expressiveness limits of approximate inference in BNNs, Wenzel et al. on posterior multi-modality and sensitivity to priors and initialization, and Wilson and Izmailov on probabilistic generalization and the role of inductive bias in deep models [
28,
29,
30,
31]. Regularization is a machine learning technique that solves the overfitting problem by adding a penalty term to the error function to prevent coefficients from increasing in size [
38]. Regularization limits the size of network weights or suppresses model complexity to prevent overfitting. That is, we control the complexity of the model by including a regularization term in the error function. The number of input and output nodes is determined by the given data. Therefore, we can improve the performance of the model by optimally determining the number of hidden layers and the number of nodes in each hidden layer. The error function with regularization is defined as follows [
16,
21,
25,
26].
In Equation (10),
is the regularization coefficient that determines model complexity. By combining BNNs and regularization, BNNR uses prior distributions to prevent excessive weight growth during data learning and posterior distributions to perform accurate predictions under uncertainty. BNNR learning is performed by minimizing the following error function
[
38,
39].
In Equation (11), the error function consists of the sum of squared differences between the actual and predicted values and weight constraints.
and
are hyperparameters that control the relative importance of the two terms in the error function. From a Bayesian perspective, minimizing
means minimizing the negative logarithm of the posterior distribution. Therefore, the BNNR learning procedure is as follows (Algorithm 1).
| Algorithm 1 BNNR Learning Procedure |
| Input: |
|
|
|
| Output: |
|
|
| Procedure: |
| 1.
|
| 2. Initial weights are sampled from prior distribution. |
| 3. Neural network model is run to calculate the predicted values. |
| 4.
|
| 5. Automatically adjust weight parameters at each epoch based on data fit and weight size. |
| 6. After model converges, samples of all weights are taken and predicted values are computed. |
The BNNR model with two layers is represented as follows [
16].
In Equation (12),
is assumed to follow a Gaussian distribution with mean = 0 and variance =
. In addition,
and
are the bias and activation function, respectively. Thus, we minimize the error function.
In Equation (13),
and
are the error sum of squares and the sum of squares of model parameters, respectively. In this study, the BNNR is implemented using Bayesian regularized neural network for regression. The model uses a single-output network that predicts the conditional mean
. While ZIP and ZINB explicitly parameterize the zero probability and count intensity as
and
, respectively, the current BNNR implementation does not explicitly model (
) within the network nor optimize an explicit variational evidence lower bound (ELBO) [
25,
38,
39]. To evaluate the explanatory performance of count and zero-inflated models, we consider the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) in this paper [
21]. We use the AIC measure to verify the goodness of model fit as follows [
21,
40].
In Equation (14),
is the maximum likelihood estimation (MLE) of
.
is the number of model parameters. The better the fitting performance of the model, the smaller the AIC value. The BIC is an index for measuring the performance of a fitted model by adding sample size
to the penalty of the AIC. Equation (15) represents the BIC [
21,
40].
In Equation (15),
is the maximum posterior (MAP) of
.
is the number of model parameters, and
is the number of observations. Like AIC, the better the fitting performance of the model, the smaller the BIC value. In addition to AIC and BIC, we consider the Watanabe–Akaike information criterion (WAIC) as model evaluation measure. WAIC estimates the expected out-of-sample predictive accuracy using the pointwise log-likelihood and a correction for effective model complexity through the variance of the log-likelihood across posterior draws [
21,
22]. Let
denote the given data and
represent the model parameters. Given posterior draws
, the log pointwise predictive density (lppd) is computed as follows [
21,
22,
23,
24].
In Equation (16), S is the number of posterior draws. Another measure is estimated by Equation (17) [
21,
22,
23,
24].
Equation (17) is the variance of the log predictive density across observations. WAIC is then defined as follows [
21,
22,
23,
24].
In Equation (18), a smaller WAIC value represents better expected predictive performance. As an index for comparing the performance of all models, including zero-inflated count models and BNNR, we use the mean squared error (MSE), which is defined as follows [
40].
In Equation (19),
and
are the actual and predicted values, respectively. To calculate the MSE, we divide the given data into training and test sets with a ratio of 7:3. We build the model using training data and calculate the MSE using the test data. The smaller the MSE value of the model, the better the predictive performance of the model. In the following sections, we compare the performance of four models, PGLM, ZIP, ZINB, and BNNR, using AIC, BIC, and MSE. For the experiments in this paper, we use the R Project for Statistical Computing as a data analysis tool [
36]. R is free, open-source software that provides a variety of functions for statistical data analysis [
36].